Package ai.djl.modality.nlp.bert
Class BertFullTokenizer
java.lang.Object
ai.djl.modality.nlp.preprocess.SimpleTokenizer
ai.djl.modality.nlp.bert.BertTokenizer
ai.djl.modality.nlp.bert.BertFullTokenizer
- All Implemented Interfaces:
TextProcessor
,Tokenizer
BertFullTokenizer runs end to end tokenization of input text
It will run basic preprocessors to clean the input text and then run WordpieceTokenizer
to split into word pieces.
Reference implementation: Google Research Bert Tokenizer
-
Constructor Summary
ConstructorsConstructorDescriptionBertFullTokenizer
(Vocabulary vocabulary, boolean lowerCase) Creates an instance ofBertFullTokenizer
. -
Method Summary
Modifier and TypeMethodDescriptionbuildSentence
(List<String> tokens) Combines a list of tokens to form a sentence.static List<TextProcessor>
getPreprocessors
(boolean lowerCase) Get a list ofTextProcessor
s to process input text for Bert models.Returns theVocabulary
used for tokenization.Breaks down the given sentence into a list of tokens that can be represented by embeddings.Methods inherited from class ai.djl.modality.nlp.bert.BertTokenizer
encode, encode, pad, tokenToString
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess
-
Constructor Details
-
BertFullTokenizer
Creates an instance ofBertFullTokenizer
.- Parameters:
vocabulary
- the BERT vocabularylowerCase
- whether to convert tokens to lowercase
-
-
Method Details
-
getVocabulary
Returns theVocabulary
used for tokenization.- Returns:
- the
Vocabulary
used for tokenization
-
tokenize
Breaks down the given sentence into a list of tokens that can be represented by embeddings.- Specified by:
tokenize
in interfaceTokenizer
- Overrides:
tokenize
in classBertTokenizer
- Parameters:
input
- the sentence to tokenize- Returns:
- a
List
of tokens
-
buildSentence
Combines a list of tokens to form a sentence.- Specified by:
buildSentence
in interfaceTokenizer
- Overrides:
buildSentence
in classSimpleTokenizer
- Parameters:
tokens
- theList
of tokens- Returns:
- the sentence built from the given tokens
-
getPreprocessors
Get a list ofTextProcessor
s to process input text for Bert models.- Parameters:
lowerCase
- whether to convert input to lowercase- Returns:
- List of
TextProcessor
s
-