java.lang.Object

ai.djl.modality.nlp.preprocess.SimpleTokenizer

ai.djl.modality.nlp.bert.WordpieceTokenizer

All Implemented Interfaces:: TextProcessor, Tokenizer

public class WordpieceTokenizer extends SimpleTokenizer

WordpieceTokenizer tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.

 jshell> String input = "unaffable";
 jshell> wordpieceTokenizer.tokenize(intput);
 ["un", "##aff", "##able"]

Reference implementation: Google Research Bert Tokenizer

Constructor Summary

Constructors

Constructor

Description

WordpieceTokenizer(Vocabulary vocabulary, String unknown, int maxInputChars)

Creates an instance of WordpieceTokenizer.
Method Summary

Modifier and Type

Method

Description

List<String>

tokenize(String sentence)

Breaks down the given sentence into a list of tokens that can be represented by embeddings.

Methods inherited from class ai.djl.modality.nlp.preprocess.SimpleTokenizer
buildSentence

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess

Constructor Details
- WordpieceTokenizer
  
  public WordpieceTokenizer(Vocabulary vocabulary, String unknown, int maxInputChars)
  
  Creates an instance of WordpieceTokenizer.
  
  Parameters:
  
  vocabulary - a DefaultVocabulary used for wordpiece tokenization
  
  unknown - String that represent unknown token
  
  maxInputChars - maximum number of input characters
Method Details
- tokenize
  
  public List<String> tokenize(String sentence)
  
  Breaks down the given sentence into a list of tokens that can be represented by embeddings.
  
  Specified by:
  
  tokenize in interface Tokenizer
  
  Overrides:
  
  tokenize in class SimpleTokenizer
  
  Parameters:
  
  sentence - the sentence to tokenize
  
  Returns:
  
  a List of tokens

Class WordpieceTokenizer

Constructor Summary

Method Summary

Methods inherited from class ai.djl.modality.nlp.preprocess.SimpleTokenizer

Methods inherited from class java.lang.Object

Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer

Constructor Details

WordpieceTokenizer

Method Details

tokenize