Interface Segmenter

  • All Known Implementing Classes:

    public interface Segmenter
    Interface providing segmentation, i.e. splitting of CJK character blocks into separate tokens. This is primarily a convenience feature for users who don't need full tokenization (or who use a separate tokenizer and only need CJK processing).
    Mathias Mølster Lidal
    • Method Detail

      • segment

        List<String> segment​(String input,
                             Language language)
        Split input-string into tokens, and returned a list of tokens in unprocessed form (i.e. lowercased, normalized and stemmed if applicable, see @link{StemMode} for list of stemming options). It is assumed that the input only contains word-characters, any punctuation and spacing tokens will be removed.
        input - the text to segment.
        language - language of input text.
        the list of segments.
        ProcessingException - if an exception is encountered during processing