Class SimpleTokenizer

java.lang.Object
ai.djl.modality.nlp.preprocess.SimpleTokenizer
All Implemented Interfaces:
TextProcessor, Tokenizer
Direct Known Subclasses:
BertTokenizer, WordpieceTokenizer

public class SimpleTokenizer extends Object implements Tokenizer
SimpleTokenizer is an implementation of the Tokenizer interface that converts sentences into token by splitting them by a given delimiter.
  • Constructor Details

    • SimpleTokenizer

      public SimpleTokenizer(String delimiter)
      Creates an instance of SimpleTokenizer with the given delimiter.
      Parameters:
      delimiter - the delimiter
    • SimpleTokenizer

      public SimpleTokenizer()
      Creates an instance of SimpleTokenizer with the default delimiter (" ").
  • Method Details

    • tokenize

      public List<String> tokenize(String sentence)
      Breaks down the given sentence into a list of tokens that can be represented by embeddings.
      Specified by:
      tokenize in interface Tokenizer
      Parameters:
      sentence - the sentence to tokenize
      Returns:
      a List of tokens
    • buildSentence

      public String buildSentence(List<String> tokens)
      Combines a list of tokens to form a sentence.
      Specified by:
      buildSentence in interface Tokenizer
      Parameters:
      tokens - the List of tokens
      Returns:
      the sentence built from the given tokens