Class BertFullTokenizer

All Implemented Interfaces:
TextProcessor, Tokenizer

public class BertFullTokenizer extends BertTokenizer
BertFullTokenizer runs end to end tokenization of input text

It will run basic preprocessors to clean the input text and then run WordpieceTokenizer to split into word pieces.

Reference implementation: Google Research Bert Tokenizer

  • Constructor Details

    • BertFullTokenizer

      public BertFullTokenizer(Vocabulary vocabulary, boolean lowerCase)
      Creates an instance of BertFullTokenizer.
      Parameters:
      vocabulary - the BERT vocabulary
      lowerCase - whether to convert tokens to lowercase
  • Method Details

    • getVocabulary

      public Vocabulary getVocabulary()
      Returns the Vocabulary used for tokenization.
      Returns:
      the Vocabulary used for tokenization
    • tokenize

      public List<String> tokenize(String input)
      Breaks down the given sentence into a list of tokens that can be represented by embeddings.
      Specified by:
      tokenize in interface Tokenizer
      Overrides:
      tokenize in class BertTokenizer
      Parameters:
      input - the sentence to tokenize
      Returns:
      a List of tokens
    • buildSentence

      public String buildSentence(List<String> tokens)
      Combines a list of tokens to form a sentence.
      Specified by:
      buildSentence in interface Tokenizer
      Overrides:
      buildSentence in class SimpleTokenizer
      Parameters:
      tokens - the List of tokens
      Returns:
      the sentence built from the given tokens
    • getPreprocessors

      public static List<TextProcessor> getPreprocessors(boolean lowerCase)
      Get a list of TextProcessors to process input text for Bert models.
      Parameters:
      lowerCase - whether to convert input to lowercase
      Returns:
      List of TextProcessors