Package ai.djl.modality.nlp.bert
Class WordpieceTokenizer
- java.lang.Object
-
- ai.djl.modality.nlp.preprocess.SimpleTokenizer
-
- ai.djl.modality.nlp.bert.WordpieceTokenizer
-
- All Implemented Interfaces:
TextProcessor
,Tokenizer
public class WordpieceTokenizer extends SimpleTokenizer
WordpieceTokenizer tokenizes a piece of text into its word pieces.This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.
jshell> String input = "unaffable"; jshell> wordpieceTokenizer.tokenize(intput); ["un", "##aff", "##able"]
Reference implementation: Google Research Bert Tokenizer
-
-
Constructor Summary
Constructors Constructor Description WordpieceTokenizer(Vocabulary vocabulary, java.lang.String unknown, int maxInputChars)
Creates an instance ofWordpieceTokenizer
.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.util.List<java.lang.String>
tokenize(java.lang.String sentence)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.-
Methods inherited from class ai.djl.modality.nlp.preprocess.SimpleTokenizer
buildSentence
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess
-
-
-
-
Constructor Detail
-
WordpieceTokenizer
public WordpieceTokenizer(Vocabulary vocabulary, java.lang.String unknown, int maxInputChars)
Creates an instance ofWordpieceTokenizer
.- Parameters:
vocabulary
- aDefaultVocabulary
used for wordpiece tokenizationunknown
- String that represent unknown tokenmaxInputChars
- maximum number of input characters
-
-
Method Detail
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String sentence)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.- Specified by:
tokenize
in interfaceTokenizer
- Overrides:
tokenize
in classSimpleTokenizer
- Parameters:
sentence
- the sentence to tokenize- Returns:
- a
List
of tokens
-
-