Package ai.djl.modality.nlp.bert
Class WordpieceTokenizer
java.lang.Object
ai.djl.modality.nlp.preprocess.SimpleTokenizer
ai.djl.modality.nlp.bert.WordpieceTokenizer
- All Implemented Interfaces:
TextProcessor,Tokenizer
WordpieceTokenizer tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.
jshell> String input = "unaffable"; jshell> wordpieceTokenizer.tokenize(intput); ["un", "##aff", "##able"]
Reference implementation: Google Research Bert Tokenizer
-
Constructor Summary
ConstructorsConstructorDescriptionWordpieceTokenizer(Vocabulary vocabulary, String unknown, int maxInputChars) Creates an instance ofWordpieceTokenizer. -
Method Summary
Methods inherited from class ai.djl.modality.nlp.preprocess.SimpleTokenizer
buildSentenceMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess
-
Constructor Details
-
WordpieceTokenizer
Creates an instance ofWordpieceTokenizer.- Parameters:
vocabulary- aDefaultVocabularyused for wordpiece tokenizationunknown- String that represent unknown tokenmaxInputChars- maximum number of input characters
-
-
Method Details
-
tokenize
Breaks down the given sentence into a list of tokens that can be represented by embeddings.- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classSimpleTokenizer- Parameters:
sentence- the sentence to tokenize- Returns:
- a
Listof tokens
-