public interface Tokenizer extends TextProcessor
Tokenizer
interface provides the ability to break-down sentences into embeddable tokens.Modifier and Type | Method and Description |
---|---|
java.lang.String |
buildSentence(java.util.List<java.lang.String> tokens)
Combines a list of tokens to form a sentence.
|
default java.util.List<java.lang.String> |
preprocess(java.util.List<java.lang.String> tokens)
Applies the preprocessing defined to the given input tokens.
|
java.util.List<java.lang.String> |
tokenize(java.lang.String sentence)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.
|
default java.util.List<java.lang.String> preprocess(java.util.List<java.lang.String> tokens)
preprocess
in interface TextProcessor
tokens
- the tokens created after the input text is tokenizedjava.util.List<java.lang.String> tokenize(java.lang.String sentence)
sentence
- the sentence to tokenizeList
of tokensjava.lang.String buildSentence(java.util.List<java.lang.String> tokens)
tokens
- the List
of tokens