Interface Tokenizer

All Superinterfaces:
TextProcessor
All Known Implementing Classes:
BertFullTokenizer, BertTokenizer, SimpleTokenizer, WordpieceTokenizer

public interface Tokenizer extends TextProcessor
Tokenizer interface provides the ability to break-down sentences into embeddable tokens.
  • Method Summary

    Modifier and Type
    Method
    Description
    Combines a list of tokens to form a sentence.
    default List<String>
    Applies the preprocessing defined to the given input tokens.
    tokenize(String sentence)
    Breaks down the given sentence into a list of tokens that can be represented by embeddings.
  • Method Details

    • preprocess

      default List<String> preprocess(List<String> tokens)
      Applies the preprocessing defined to the given input tokens.
      Specified by:
      preprocess in interface TextProcessor
      Parameters:
      tokens - the tokens created after the input text is tokenized
      Returns:
      the preprocessed tokens
    • tokenize

      List<String> tokenize(String sentence)
      Breaks down the given sentence into a list of tokens that can be represented by embeddings.
      Parameters:
      sentence - the sentence to tokenize
      Returns:
      a List of tokens
    • buildSentence

      String buildSentence(List<String> tokens)
      Combines a list of tokens to form a sentence.
      Parameters:
      tokens - the List of tokens
      Returns:
      the sentence built from the given tokens