Package ai.djl.modality.nlp.preprocess
Contains utility classes for natural language pre-processing tasks.
-
Interface Summary Interface Description TextProcessor TextProcessor
allows applying pre-processing to input tokens for natural language applications.Tokenizer Tokenizer
interface provides the ability to break-down sentences into embeddable tokens. -
Class Summary Class Description HyphenNormalizer Unicode normalization does not take care of "exotic" hyphens that we normally do not want in NLP input.LambdaProcessor TextProcessor
will apply user defined lambda function on input tokens.LowerCaseConvertor LowerCaseConvertor
converts every character of the input tokens to it's respective lower case character.PunctuationSeparator PunctuationSeparator
separates punctuation into a separate token.SimpleTokenizer SimpleTokenizer
is an implementation of theTokenizer
interface that converts sentences into token by splitting them by a given delimiter.TextCleaner Applies remove or replace of certain characters based on condition.TextTerminator ATextProcessor
that adds a beginning of string and end of string token.TextTruncator TextProcessor
that truncates text to a maximum size.UnicodeNormalizer Applies unicode normalization to input strings.