Package ai.djl.modality.nlp.preprocess
Interface Tokenizer
-
- All Superinterfaces:
TextProcessor
- All Known Implementing Classes:
BertFullTokenizer
,BertTokenizer
,SimpleTokenizer
,WordpieceTokenizer
public interface Tokenizer extends TextProcessor
Tokenizer
interface provides the ability to break-down sentences into embeddable tokens.
-
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description java.lang.String
buildSentence(java.util.List<java.lang.String> tokens)
Combines a list of tokens to form a sentence.default java.util.List<java.lang.String>
preprocess(java.util.List<java.lang.String> tokens)
Applies the preprocessing defined to the given input tokens.java.util.List<java.lang.String>
tokenize(java.lang.String sentence)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.
-
-
-
Method Detail
-
preprocess
default java.util.List<java.lang.String> preprocess(java.util.List<java.lang.String> tokens)
Applies the preprocessing defined to the given input tokens.- Specified by:
preprocess
in interfaceTextProcessor
- Parameters:
tokens
- the tokens created after the input text is tokenized- Returns:
- the preprocessed tokens
-
tokenize
java.util.List<java.lang.String> tokenize(java.lang.String sentence)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.- Parameters:
sentence
- the sentence to tokenize- Returns:
- a
List
of tokens
-
buildSentence
java.lang.String buildSentence(java.util.List<java.lang.String> tokens)
Combines a list of tokens to form a sentence.- Parameters:
tokens
- theList
of tokens- Returns:
- the sentence built from the given tokens
-
-