Package ai.djl.modality.nlp.preprocess
Class UnicodeNormalizer
- java.lang.Object
-
- ai.djl.modality.nlp.preprocess.UnicodeNormalizer
-
- All Implemented Interfaces:
TextProcessor
public class UnicodeNormalizer extends java.lang.Object implements TextProcessor
Applies unicode normalization to input strings. This is particularly important if you are dealing with non-English input or with text originating from OCR applications.
-
-
Field Summary
Fields Modifier and Type Field Description static java.text.Normalizer.Form
DEFAULT_FORM
-
Constructor Summary
Constructors Constructor Description UnicodeNormalizer()
Default version of the Unicode Normalizer using NFKC normal form.UnicodeNormalizer(java.text.Normalizer.Form normalForm)
Unicode normalizer with a configurable normal form.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static java.lang.String
normalizeDefault(java.lang.String s)
Normalizes a String using a sensible default normal form.java.util.List<java.lang.String>
preprocess(java.util.List<java.lang.String> tokens)
Applies the preprocessing defined to the given input tokens.
-
-
-
Constructor Detail
-
UnicodeNormalizer
public UnicodeNormalizer(java.text.Normalizer.Form normalForm)
Unicode normalizer with a configurable normal form.- Parameters:
normalForm
- The normal form to use.
-
UnicodeNormalizer
public UnicodeNormalizer()
Default version of the Unicode Normalizer using NFKC normal form. If you do not know what normal form you need, this is the normal form you need.
-
-
Method Detail
-
normalizeDefault
public static java.lang.String normalizeDefault(java.lang.String s)
Normalizes a String using a sensible default normal form. Use this if you do not want to think about unicode preprocessing.- Parameters:
s
- Any non-null string- Returns:
- The given string with default unicode normalization applied.
-
preprocess
public java.util.List<java.lang.String> preprocess(java.util.List<java.lang.String> tokens)
Applies the preprocessing defined to the given input tokens.- Specified by:
preprocess
in interfaceTextProcessor
- Parameters:
tokens
- the tokens created after the input text is tokenized- Returns:
- the preprocessed tokens
-
-