Package ai.djl.modality.nlp.preprocess
Class HyphenNormalizer
- java.lang.Object
-
- ai.djl.modality.nlp.preprocess.HyphenNormalizer
-
- All Implemented Interfaces:
TextProcessor
public class HyphenNormalizer extends java.lang.Object implements TextProcessor
Unicode normalization does not take care of "exotic" hyphens that we normally do not want in NLP input. This preprocessor turns all Hyphens into "normal" ASCII minus-hyphen characters (U+002D). Invisible soft hyphens are dropped from the input.
-
-
Constructor Summary
Constructors Constructor Description HyphenNormalizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static boolean
isHyphenLike(java.lang.Integer codePoint)
Returns whether the given code point is a hyphen-like codepoint.static java.lang.String
normalizeHyphens(java.lang.String s)
Replaces hyphen like codepoints by ASCII "-", removes soft hyphens.java.util.List<java.lang.String>
preprocess(java.util.List<java.lang.String> tokens)
Applies the preprocessing defined to the given input tokens.
-
-
-
Method Detail
-
isHyphenLike
public static boolean isHyphenLike(java.lang.Integer codePoint)
Returns whether the given code point is a hyphen-like codepoint. Tests for hyphen-minus, tilde, soft hyphen, armenian hyphen, hebrew punctuation maqaf, canadian syllabics hyphen, mongolian hyphen, non-breaking hyphen, figure dash, en dash, em dash, horizontal bar, swung dash, superscript minus, subscript minus, minus sign, double oblique hyphen, two-em dash, three-em dash, wave dash, wavy dash, katakana-hiragana double hyphen- Parameters:
codePoint
- A unicode code point. (not a char!)- Returns:
- true: given code point represents a hyphen-like glyph
-
normalizeHyphens
public static java.lang.String normalizeHyphens(java.lang.String s)
Replaces hyphen like codepoints by ASCII "-", removes soft hyphens.- Parameters:
s
- input string to replace hyphens in- Returns:
- the same string with soft hyphens dropped and hyphen-like codepoints replaced by an ASCII minus.
-
preprocess
public java.util.List<java.lang.String> preprocess(java.util.List<java.lang.String> tokens)
Applies the preprocessing defined to the given input tokens.- Specified by:
preprocess
in interfaceTextProcessor
- Parameters:
tokens
- the tokens created after the input text is tokenized- Returns:
- the preprocessed tokens
-
-