Class TokenStream


  • public class TokenStream
    extends Object
    A TokenStream (a set of Tokens) captures the Shapes and counts of the incoming data stream. So with data that looks like: A09123, ICD9-90871, B00023, A12348, C89023, ICD9-90322, ICD9-44233, Z90908, Q23235 You will end up with two TokenStream instances: X99999 - count 6 XXX9-99999 - count 3 Each TokenStream preserves information about the the incoming data so in the case of the the second Stream above (XXX9-99999) the first Token X retains the fact that it has seen only the letter 'I'. So that when asked for the associated Regular Expression the knowledge exists to return ICD9-\d{4} and NOT simply \p{isAlphabetic}{3}\d-\d{4}.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static TokenStream ANYSHAPE
      The TokenStream that represents any input that is too long.
    • Constructor Summary

      Constructors 
      Constructor Description
      TokenStream​(TokenStream other)
      Construct a new TokenStream from an existing TokenStream.
      TokenStream​(String trimmed, long occurrences)
      Construct a new TokenStream based on the input.
    • Field Detail

      • ANYSHAPE

        public static final TokenStream ANYSHAPE
        The TokenStream that represents any input that is too long.
    • Constructor Detail

      • TokenStream

        public TokenStream​(String trimmed,
                           long occurrences)
        Construct a new TokenStream based on the input.
        Parameters:
        trimmed - The trimmed input.
        occurrences - The number of occurrences of this input.
      • TokenStream

        public TokenStream​(TokenStream other)
        Construct a new TokenStream from an existing TokenStream.
        Parameters:
        other - The template for the new TokenStream.
    • Method Detail

      • getKey

        public static String getKey​(String trimmed)
      • tooLong

        public static boolean tooLong​(String trimmed)
        Is the input too long?
        Parameters:
        trimmed - The trimmed input.
        Returns:
        True if the input is too long.
      • getRegExp

        public String getRegExp​(boolean fitted)
        Get the Regular Expression for this TokenStream.
        Parameters:
        fitted - If true the Regular Expression should be a 'more closely fitted' Regular Expression.
        Returns:
        The Java Regular Expression for this TokenStream.
      • merge

        public TokenStream merge​(TokenStream other)
        Merge the supplied TokenStream into this one - both TokenStreams must have the same uncompressed representation.
        Parameters:
        other - The other TokenStream
        Returns:
        The updated TokenStream or null if Streams are not mergeable.
      • mergeCount

        public TokenStream mergeCount​(long occurrences)
      • simplify

        public TokenStream simplify()
        Simplify the Compressed TokenStream to improve the Regular Expression returned. We have something that works but it is possibly really ugly, for example, we may have: \d{2}\p{IsAlphabetic}\d{8}[\d\p{IsAlphabetic}]{4}\p{IsAlphabetic}{2}[\d\p{IsAlphabetic}] The objective is to reduce the number of transitions to something reasonable.
        Returns:
        A Simplified compressed representation of the current TokenStream.
      • mergeCompressed

        public TokenStream mergeCompressed​(TokenStream other)
        Merge the supplied TokenStream into this one - both TokenStreams must have the same uncompressed representation.
        Parameters:
        other - The other TokenStream
        Returns:
        The updated TokenStream.
      • getKey

        public String getKey()
      • getTokens

        public com.cobber.fta.token.Token[] getTokens()
      • getCompressedKey

        public String getCompressedKey()
      • getCompressedTokens

        protected com.cobber.fta.token.Token[] getCompressedTokens()
      • isAlpha

        public boolean isAlpha()
        Returns:
        True is TokenStream is exclusively Alpha's.
      • isAlphaNumeric

        public boolean isAlphaNumeric()
        Returns:
        True is TokenStream is a mix of Alpha's and Numeric's.
      • isNumeric

        public boolean isNumeric()
        Returns:
        True is TokenStream is exclusively Numeric's.
      • getOccurrences

        public long getOccurrences()
        Returns:
        The number of inputs this TokenStream has captured.
      • matches

        public boolean matches​(String regExp)
        Check if this TokenStream matches the supplied Regular Expression. We use the Automaton to do all the heavy lifting.
        Parameters:
        regExp - The Regular Expression to match.
        Returns:
        True if the TokenStream matches the supplied Regular Expression.