public class BertFullTokenizer extends SimpleTokenizer
It will run basic preprocessors to clean the input text and then run WordpieceTokenizer
to split into word pieces.
Reference implementation: Google Research Bert Tokenizer
Constructor and Description |
---|
BertFullTokenizer(java.lang.String filepath,
boolean lowerCase)
Creates an instance of
BertFullTokenizer . |
Modifier and Type | Method and Description |
---|---|
static java.util.List<TextProcessor> |
getPreprocessors(boolean lowerCase)
Get a list of
TextProcessor s to process input text for Bert models. |
SimpleVocabulary |
getVocabulary()
Returns the
SimpleVocabulary used for tokenization. |
java.util.List<java.lang.String> |
tokenize(java.lang.String input)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.
|
buildSentence
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
preprocess
public BertFullTokenizer(java.lang.String filepath, boolean lowerCase)
BertFullTokenizer
.filepath
- the path to vocabulary filelowerCase
- whether to convert tokens to lowercasepublic SimpleVocabulary getVocabulary()
SimpleVocabulary
used for tokenization.SimpleVocabulary
used for tokenizationpublic java.util.List<java.lang.String> tokenize(java.lang.String input)
tokenize
in interface Tokenizer
tokenize
in class SimpleTokenizer
input
- the sentence to tokenizeList
of tokenspublic static java.util.List<TextProcessor> getPreprocessors(boolean lowerCase)
TextProcessor
s to process input text for Bert models.lowerCase
- whether to convert input to lowercaseTextProcessor
s