public class WordpieceTokenizer extends SimpleTokenizer
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.
jshell> String input = "unaffable"; jshell> wordpieceTokenizer.tokenize(intput); ["un", "##aff", "##able"]
Reference implementation: Google Research Bert Tokenizer
Constructor and Description |
---|
WordpieceTokenizer(SimpleVocabulary vocabulary,
java.lang.String unknown,
int maxInputChars)
Creates an instance of
WordpieceTokenizer . |
Modifier and Type | Method and Description |
---|---|
java.util.List<java.lang.String> |
tokenize(java.lang.String sentence)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.
|
buildSentence
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
preprocess
public WordpieceTokenizer(SimpleVocabulary vocabulary, java.lang.String unknown, int maxInputChars)
WordpieceTokenizer
.vocabulary
- a SimpleVocabulary
used for wordpiece tokenizationunknown
- String that represent unknown tokenmaxInputChars
- maximum number of input characterspublic java.util.List<java.lang.String> tokenize(java.lang.String sentence)
SimpleTokenizer
tokenize
in interface Tokenizer
tokenize
in class SimpleTokenizer
sentence
- the sentence to tokenizeList
of tokens