Interface Segmenter

All Known Implementing Classes:
SegmenterImpl

public interface Segmenter
Interface providing segmentation, i.e. splitting of CJK character blocks into separate tokens. This is primarily a convenience feature for users who don't need full tokenization (or who use a separate tokenizer and only need CJK processing).
Author:
Mathias Mølster Lidal
  • Method Summary

    Modifier and Type
    Method
    Description
    segment(String input, Language language)
    Split input-string into tokens, and returned a list of tokens in unprocessed form (i.e.
  • Method Details

    • segment

      List<String> segment(String input, Language language)
      Split input-string into tokens, and returned a list of tokens in unprocessed form (i.e. lowercased, normalized and stemmed if applicable, see @link{StemMode} for list of stemming options). It is assumed that the input only contains word-characters, any punctuation and spacing tokens will be removed.
      Parameters:
      input - the text to segment.
      language - language of input text.
      Returns:
      the list of segments.
      Throws:
      ProcessingException - if an exception is encountered during processing