Package ai.djl.modality.nlp.bert
Class WordpieceTokenizer
java.lang.Object
ai.djl.modality.nlp.preprocess.SimpleTokenizer
ai.djl.modality.nlp.bert.WordpieceTokenizer
- All Implemented Interfaces:
TextProcessor
,Tokenizer
WordpieceTokenizer tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.
jshell> String input = "unaffable"; jshell> wordpieceTokenizer.tokenize(intput); ["un", "##aff", "##able"]
Reference implementation: Google Research Bert Tokenizer
-
Constructor Summary
ConstructorsConstructorDescriptionWordpieceTokenizer
(Vocabulary vocabulary, String unknown, int maxInputChars) Creates an instance ofWordpieceTokenizer
. -
Method Summary
Methods inherited from class ai.djl.modality.nlp.preprocess.SimpleTokenizer
buildSentence
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess
-
Constructor Details
-
WordpieceTokenizer
Creates an instance ofWordpieceTokenizer
.- Parameters:
vocabulary
- aDefaultVocabulary
used for wordpiece tokenizationunknown
- String that represent unknown tokenmaxInputChars
- maximum number of input characters
-
-
Method Details
-
tokenize
Breaks down the given sentence into a list of tokens that can be represented by embeddings.- Specified by:
tokenize
in interfaceTokenizer
- Overrides:
tokenize
in classSimpleTokenizer
- Parameters:
sentence
- the sentence to tokenize- Returns:
- a
List
of tokens
-