Package com.cobber.fta.token
Class TokenStream
- Object
-
- com.cobber.fta.token.TokenStream
-
public class TokenStream extends Object
A TokenStream (a set of Tokens) captures the Shapes and counts of the incoming data stream. So with data that looks like: A09123, ICD9-90871, B00023, A12348, C89023, ICD9-90322, ICD9-44233, Z90908, Q23235 You will end up with two TokenStream instances: X99999 - count 6 XXX9-99999 - count 3 Each TokenStream preserves information about the the incoming data so in the case of the the second Stream above (XXX9-99999) the first Token X retains the fact that it has seen only the letter 'I'. So that when asked for the associated Regular Expression the knowledge exists to return ICD9-\d{4} and NOT simply \p{isAlphabetic}{3}\d-\d{4}.
-
-
Field Summary
Fields Modifier and Type Field Description static TokenStream
ANYSHAPE
The TokenStream that represents any input that is too long.
-
Constructor Summary
Constructors Constructor Description TokenStream(TokenStream other)
Construct a new TokenStream from an existing TokenStream.TokenStream(String trimmed, long occurrences)
Construct a new TokenStream based on the input.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description String
getCompressedKey()
protected com.cobber.fta.token.Token[]
getCompressedTokens()
String
getKey()
static String
getKey(String trimmed)
long
getOccurrences()
String
getRegExp(boolean fitted)
Get the Regular Expression for this TokenStream.com.cobber.fta.token.Token[]
getTokens()
boolean
isAlpha()
boolean
isAlphaNumeric()
boolean
isNumeric()
boolean
matches(String regExp)
Check if this TokenStream matches the supplied Regular Expression.TokenStream
merge(TokenStream other)
Merge the supplied TokenStream into this one - both TokenStreams must have the same uncompressed representation.TokenStream
mergeCompressed(TokenStream other)
Merge the supplied TokenStream into this one - both TokenStreams must have the same uncompressed representation.TokenStream
mergeCount(long occurrences)
TokenStream
simplify()
Simplify the Compressed TokenStream to improve the Regular Expression returned.static boolean
tooLong(String trimmed)
Is the input too long?
-
-
-
Field Detail
-
ANYSHAPE
public static final TokenStream ANYSHAPE
The TokenStream that represents any input that is too long.
-
-
Constructor Detail
-
TokenStream
public TokenStream(String trimmed, long occurrences)
Construct a new TokenStream based on the input.- Parameters:
trimmed
- The trimmed input.occurrences
- The number of occurrences of this input.
-
TokenStream
public TokenStream(TokenStream other)
Construct a new TokenStream from an existing TokenStream.- Parameters:
other
- The template for the new TokenStream.
-
-
Method Detail
-
getKey
public static String getKey(String trimmed)
-
tooLong
public static boolean tooLong(String trimmed)
Is the input too long?- Parameters:
trimmed
- The trimmed input.- Returns:
- True if the input is too long.
-
getRegExp
public String getRegExp(boolean fitted)
Get the Regular Expression for this TokenStream.- Parameters:
fitted
- If true the Regular Expression should be a 'more closely fitted' Regular Expression.- Returns:
- The Java Regular Expression for this TokenStream.
-
merge
public TokenStream merge(TokenStream other)
Merge the supplied TokenStream into this one - both TokenStreams must have the same uncompressed representation.- Parameters:
other
- The other TokenStream- Returns:
- The updated TokenStream or null if Streams are not mergeable.
-
mergeCount
public TokenStream mergeCount(long occurrences)
-
simplify
public TokenStream simplify()
Simplify the Compressed TokenStream to improve the Regular Expression returned. We have something that works but it is possibly really ugly, for example, we may have: \d{2}\p{IsAlphabetic}\d{8}[\d\p{IsAlphabetic}]{4}\p{IsAlphabetic}{2}[\d\p{IsAlphabetic}] The objective is to reduce the number of transitions to something reasonable.- Returns:
- A Simplified compressed representation of the current TokenStream.
-
mergeCompressed
public TokenStream mergeCompressed(TokenStream other)
Merge the supplied TokenStream into this one - both TokenStreams must have the same uncompressed representation.- Parameters:
other
- The other TokenStream- Returns:
- The updated TokenStream.
-
getKey
public String getKey()
-
getTokens
public com.cobber.fta.token.Token[] getTokens()
-
getCompressedKey
public String getCompressedKey()
-
getCompressedTokens
protected com.cobber.fta.token.Token[] getCompressedTokens()
-
isAlpha
public boolean isAlpha()
- Returns:
- True is TokenStream is exclusively Alpha's.
-
isAlphaNumeric
public boolean isAlphaNumeric()
- Returns:
- True is TokenStream is a mix of Alpha's and Numeric's.
-
isNumeric
public boolean isNumeric()
- Returns:
- True is TokenStream is exclusively Numeric's.
-
getOccurrences
public long getOccurrences()
- Returns:
- The number of inputs this TokenStream has captured.
-
matches
public boolean matches(String regExp)
Check if this TokenStream matches the supplied Regular Expression. We use the Automaton to do all the heavy lifting.- Parameters:
regExp
- The Regular Expression to match.- Returns:
- True if the TokenStream matches the supplied Regular Expression.
-
-