Class Tokenizer


  • public class Tokenizer
    extends java.lang.Object
    This takes a raw text and to create tokens from it. The tokens are purely lowest text unit, like words and punctuation (space included).
    The resulting tokens can be then used by other NLP task to generate/use data.
    • Field Detail

      • TOKEN_COUNT_FORMAT

        public static final java.text.DecimalFormat TOKEN_COUNT_FORMAT
    • Constructor Detail

    • Method Detail

      • tokenize

        public java.util.List<Token> tokenize​(java.lang.String rawText)
                                       throws java.io.IOException
        Throws:
        java.io.IOException