Package ai.djl.modality.nlp.preprocess
Class HyphenNormalizer
java.lang.Object
ai.djl.modality.nlp.preprocess.HyphenNormalizer
- All Implemented Interfaces:
TextProcessor
Unicode normalization does not take care of "exotic" hyphens that we normally do not want in NLP
input. This preprocessor turns all Hyphens into "normal" ASCII minus-hyphen characters (U+002D).
Invisible soft hyphens are dropped from the input.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic boolean
isHyphenLike
(Integer codePoint) Returns whether the given code point is a hyphen-like codepoint.static String
Replaces hyphen like codepoints by ASCII "-", removes soft hyphens.preprocess
(List<String> tokens) Applies the preprocessing defined to the given input tokens.
-
Constructor Details
-
HyphenNormalizer
public HyphenNormalizer()
-
-
Method Details
-
isHyphenLike
Returns whether the given code point is a hyphen-like codepoint. Tests for hyphen-minus, tilde, soft hyphen, armenian hyphen, hebrew punctuation maqaf, canadian syllabics hyphen, mongolian hyphen, non-breaking hyphen, figure dash, en dash, em dash, horizontal bar, swung dash, superscript minus, subscript minus, minus sign, double oblique hyphen, two-em dash, three-em dash, wave dash, wavy dash, katakana-hiragana double hyphen- Parameters:
codePoint
- A unicode code point. (not a char!)- Returns:
- true: given code point represents a hyphen-like glyph
-
normalizeHyphens
Replaces hyphen like codepoints by ASCII "-", removes soft hyphens.- Parameters:
s
- input string to replace hyphens in- Returns:
- the same string with soft hyphens dropped and hyphen-like codepoints replaced by an ASCII minus.
-
preprocess
Applies the preprocessing defined to the given input tokens.- Specified by:
preprocess
in interfaceTextProcessor
- Parameters:
tokens
- the tokens created after the input text is tokenized- Returns:
- the preprocessed tokens
-