public class WordpieceTokenizer extends SimpleTokenizer
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.
jshell> String input = "unaffable"; jshell> wordpieceTokenizer.tokenize(intput); ["un", "##aff", "##able"]
Reference implementation: Google Research Bert Tokenizer
| Constructor and Description |
|---|
WordpieceTokenizer(DefaultVocabulary vocabulary,
java.lang.String unknown,
int maxInputChars)
Creates an instance of
WordpieceTokenizer. |
| Modifier and Type | Method and Description |
|---|---|
java.util.List<java.lang.String> |
tokenize(java.lang.String sentence)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.
|
buildSentenceclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitpreprocesspublic WordpieceTokenizer(DefaultVocabulary vocabulary, java.lang.String unknown, int maxInputChars)
WordpieceTokenizer.vocabulary - a DefaultVocabulary used for wordpiece tokenizationunknown - String that represent unknown tokenmaxInputChars - maximum number of input characterspublic java.util.List<java.lang.String> tokenize(java.lang.String sentence)
tokenize in interface Tokenizertokenize in class SimpleTokenizersentence - the sentence to tokenizeList of tokens