Package ai.djl.modality.nlp.bert
Class WordpieceTokenizer
- java.lang.Object
-
- ai.djl.modality.nlp.preprocess.SimpleTokenizer
-
- ai.djl.modality.nlp.bert.WordpieceTokenizer
-
- All Implemented Interfaces:
TextProcessor,Tokenizer
public class WordpieceTokenizer extends SimpleTokenizer
WordpieceTokenizer tokenizes a piece of text into its word pieces.This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.
jshell> String input = "unaffable"; jshell> wordpieceTokenizer.tokenize(intput); ["un", "##aff", "##able"]
Reference implementation: Google Research Bert Tokenizer
-
-
Constructor Summary
Constructors Constructor Description WordpieceTokenizer(Vocabulary vocabulary, java.lang.String unknown, int maxInputChars)Creates an instance ofWordpieceTokenizer.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.util.List<java.lang.String>tokenize(java.lang.String sentence)Breaks down the given sentence into a list of tokens that can be represented by embeddings.-
Methods inherited from class ai.djl.modality.nlp.preprocess.SimpleTokenizer
buildSentence
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess
-
-
-
-
Constructor Detail
-
WordpieceTokenizer
public WordpieceTokenizer(Vocabulary vocabulary, java.lang.String unknown, int maxInputChars)
Creates an instance ofWordpieceTokenizer.- Parameters:
vocabulary- aDefaultVocabularyused for wordpiece tokenizationunknown- String that represent unknown tokenmaxInputChars- maximum number of input characters
-
-
Method Detail
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String sentence)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classSimpleTokenizer- Parameters:
sentence- the sentence to tokenize- Returns:
- a
Listof tokens
-
-