Class WordpieceTokenizer

java.lang.Object
ai.djl.modality.nlp.preprocess.SimpleTokenizer
ai.djl.modality.nlp.bert.WordpieceTokenizer
All Implemented Interfaces:
TextProcessor, Tokenizer

public class WordpieceTokenizer extends SimpleTokenizer
WordpieceTokenizer tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.

 jshell> String input = "unaffable";
 jshell> wordpieceTokenizer.tokenize(intput);
 ["un", "##aff", "##able"]
 

Reference implementation: Google Research Bert Tokenizer

  • Constructor Details

    • WordpieceTokenizer

      public WordpieceTokenizer(Vocabulary vocabulary, String unknown, int maxInputChars)
      Creates an instance of WordpieceTokenizer.
      Parameters:
      vocabulary - a DefaultVocabulary used for wordpiece tokenization
      unknown - String that represent unknown token
      maxInputChars - maximum number of input characters
  • Method Details

    • tokenize

      public List<String> tokenize(String sentence)
      Breaks down the given sentence into a list of tokens that can be represented by embeddings.
      Specified by:
      tokenize in interface Tokenizer
      Overrides:
      tokenize in class SimpleTokenizer
      Parameters:
      sentence - the sentence to tokenize
      Returns:
      a List of tokens