public class BertFullTokenizer extends BertTokenizer
It will run basic preprocessors to clean the input text and then run WordpieceTokenizer
to split into word pieces.
Reference implementation: Google Research Bert Tokenizer
Constructor and Description |
---|
BertFullTokenizer(Vocabulary vocabulary,
boolean lowerCase)
Creates an instance of
BertFullTokenizer . |
Modifier and Type | Method and Description |
---|---|
static java.util.List<TextProcessor> |
getPreprocessors(boolean lowerCase)
Get a list of
TextProcessor s to process input text for Bert models. |
Vocabulary |
getVocabulary()
Returns the
Vocabulary used for tokenization. |
java.util.List<java.lang.String> |
tokenize(java.lang.String input)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.
|
java.lang.String |
tokenToString(java.util.List<java.lang.String> tokens)
Returns a string presentation of the tokens.
|
encode, encode, pad
buildSentence
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
preprocess
public BertFullTokenizer(Vocabulary vocabulary, boolean lowerCase)
BertFullTokenizer
.vocabulary
- the BERT vocabularylowerCase
- whether to convert tokens to lowercasepublic Vocabulary getVocabulary()
Vocabulary
used for tokenization.Vocabulary
used for tokenizationpublic java.util.List<java.lang.String> tokenize(java.lang.String input)
tokenize
in interface Tokenizer
tokenize
in class BertTokenizer
input
- the sentence to tokenizeList
of tokenspublic java.lang.String tokenToString(java.util.List<java.lang.String> tokens)
tokenToString
in class BertTokenizer
tokens
- a list of tokenspublic static java.util.List<TextProcessor> getPreprocessors(boolean lowerCase)
TextProcessor
s to process input text for Bert models.lowerCase
- whether to convert input to lowercaseTextProcessor
s