Package ai.djl.modality.nlp.preprocess
Class UnicodeNormalizer
java.lang.Object
ai.djl.modality.nlp.preprocess.UnicodeNormalizer
- All Implemented Interfaces:
TextProcessor
Applies unicode normalization to input strings. This is particularly important if you are dealing
with non-English input or with text originating from OCR applications.
-
Field Summary
Fields -
Constructor Summary
ConstructorsConstructorDescriptionDefault version of the Unicode Normalizer using NFKC normal form.UnicodeNormalizer
(Normalizer.Form normalForm) Unicode normalizer with a configurable normal form. -
Method Summary
Modifier and TypeMethodDescriptionstatic String
Normalizes a String using a sensible default normal form.preprocess
(List<String> tokens) Applies the preprocessing defined to the given input tokens.
-
Field Details
-
DEFAULT_FORM
-
-
Constructor Details
-
UnicodeNormalizer
Unicode normalizer with a configurable normal form.- Parameters:
normalForm
- The normal form to use.
-
UnicodeNormalizer
public UnicodeNormalizer()Default version of the Unicode Normalizer using NFKC normal form. If you do not know what normal form you need, this is the normal form you need.
-
-
Method Details
-
normalizeDefault
Normalizes a String using a sensible default normal form. Use this if you do not want to think about unicode preprocessing.- Parameters:
s
- Any non-null string- Returns:
- The given string with default unicode normalization applied.
-
preprocess
Applies the preprocessing defined to the given input tokens.- Specified by:
preprocess
in interfaceTextProcessor
- Parameters:
tokens
- the tokens created after the input text is tokenized- Returns:
- the preprocessed tokens
-