Package ai.djl.modality.nlp.bert
Class BertFullTokenizer
- java.lang.Object
-
- ai.djl.modality.nlp.preprocess.SimpleTokenizer
-
- ai.djl.modality.nlp.bert.BertTokenizer
-
- ai.djl.modality.nlp.bert.BertFullTokenizer
-
- All Implemented Interfaces:
TextProcessor
,Tokenizer
public class BertFullTokenizer extends BertTokenizer
BertFullTokenizer runs end to end tokenization of input textIt will run basic preprocessors to clean the input text and then run
WordpieceTokenizer
to split into word pieces.Reference implementation: Google Research Bert Tokenizer
-
-
Constructor Summary
Constructors Constructor Description BertFullTokenizer(Vocabulary vocabulary, boolean lowerCase)
Creates an instance ofBertFullTokenizer
.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.String
buildSentence(java.util.List<java.lang.String> tokens)
Combines a list of tokens to form a sentence.static java.util.List<TextProcessor>
getPreprocessors(boolean lowerCase)
Get a list ofTextProcessor
s to process input text for Bert models.Vocabulary
getVocabulary()
Returns theVocabulary
used for tokenization.java.util.List<java.lang.String>
tokenize(java.lang.String input)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.-
Methods inherited from class ai.djl.modality.nlp.bert.BertTokenizer
encode, encode, pad, tokenToString
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess
-
-
-
-
Constructor Detail
-
BertFullTokenizer
public BertFullTokenizer(Vocabulary vocabulary, boolean lowerCase)
Creates an instance ofBertFullTokenizer
.- Parameters:
vocabulary
- the BERT vocabularylowerCase
- whether to convert tokens to lowercase
-
-
Method Detail
-
getVocabulary
public Vocabulary getVocabulary()
Returns theVocabulary
used for tokenization.- Returns:
- the
Vocabulary
used for tokenization
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String input)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.- Specified by:
tokenize
in interfaceTokenizer
- Overrides:
tokenize
in classBertTokenizer
- Parameters:
input
- the sentence to tokenize- Returns:
- a
List
of tokens
-
buildSentence
public java.lang.String buildSentence(java.util.List<java.lang.String> tokens)
Combines a list of tokens to form a sentence.- Specified by:
buildSentence
in interfaceTokenizer
- Overrides:
buildSentence
in classSimpleTokenizer
- Parameters:
tokens
- theList
of tokens- Returns:
- the sentence built from the given tokens
-
getPreprocessors
public static java.util.List<TextProcessor> getPreprocessors(boolean lowerCase)
Get a list ofTextProcessor
s to process input text for Bert models.- Parameters:
lowerCase
- whether to convert input to lowercase- Returns:
- List of
TextProcessor
s
-
-