Interface Segmenter

  • All Known Implementing Classes:
    SegmenterImpl

    public interface Segmenter
    Interface providing segmentation, i.e. splitting of CJK character blocks into separate tokens. This is primarily a convenience feature for users who don't need full tokenization (or who use a separate tokenizer and only need CJK processing).
    Author:
    Mathias Mølster Lidal
    • Method Summary

      All Methods Instance Methods Abstract Methods 
      Modifier and Type Method Description
      java.util.List<java.lang.String> segment​(java.lang.String input, Language language)
      Split input-string into tokens, and returned a list of tokens in unprocessed form (i.e.
    • Method Detail

      • segment

        java.util.List<java.lang.String> segment​(java.lang.String input,
                                                 Language language)
        Split input-string into tokens, and returned a list of tokens in unprocessed form (i.e. lowercased, normalized and stemmed if applicable, see @link{StemMode} for list of stemming options). It is assumed that the input only contains word-characters, any punctuation and spacing tokens will be removed.
        Parameters:
        input - the text to segment.
        language - language of input text.
        Returns:
        the list of segments.
        Throws:
        ProcessingException - if an exception is encountered during processing