Class WordpieceTokenizer

  • All Implemented Interfaces:
    TextProcessor, Tokenizer

    public class WordpieceTokenizer
    extends SimpleTokenizer
    WordpieceTokenizer tokenizes a piece of text into its word pieces.

    This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.

     jshell> String input = "unaffable";
     jshell> wordpieceTokenizer.tokenize(intput);
     ["un", "##aff", "##able"]
     

    Reference implementation: Google Research Bert Tokenizer

    • Constructor Summary

      Constructors 
      Constructor Description
      WordpieceTokenizer​(Vocabulary vocabulary, java.lang.String unknown, int maxInputChars)
      Creates an instance of WordpieceTokenizer.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.util.List<java.lang.String> tokenize​(java.lang.String sentence)
      Breaks down the given sentence into a list of tokens that can be represented by embeddings.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • WordpieceTokenizer

        public WordpieceTokenizer​(Vocabulary vocabulary,
                                  java.lang.String unknown,
                                  int maxInputChars)
        Creates an instance of WordpieceTokenizer.
        Parameters:
        vocabulary - a DefaultVocabulary used for wordpiece tokenization
        unknown - String that represent unknown token
        maxInputChars - maximum number of input characters
    • Method Detail

      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String sentence)
        Breaks down the given sentence into a list of tokens that can be represented by embeddings.
        Specified by:
        tokenize in interface Tokenizer
        Overrides:
        tokenize in class SimpleTokenizer
        Parameters:
        sentence - the sentence to tokenize
        Returns:
        a List of tokens