Class HyphenNormalizer

java.lang.Object
ai.djl.modality.nlp.preprocess.HyphenNormalizer
All Implemented Interfaces:
TextProcessor

public class HyphenNormalizer extends Object implements TextProcessor
Unicode normalization does not take care of "exotic" hyphens that we normally do not want in NLP input. This preprocessor turns all Hyphens into "normal" ASCII minus-hyphen characters (U+002D). Invisible soft hyphens are dropped from the input.
  • Constructor Details

    • HyphenNormalizer

      public HyphenNormalizer()
  • Method Details

    • isHyphenLike

      public static boolean isHyphenLike(Integer codePoint)
      Returns whether the given code point is a hyphen-like codepoint. Tests for hyphen-minus, tilde, soft hyphen, armenian hyphen, hebrew punctuation maqaf, canadian syllabics hyphen, mongolian hyphen, non-breaking hyphen, figure dash, en dash, em dash, horizontal bar, swung dash, superscript minus, subscript minus, minus sign, double oblique hyphen, two-em dash, three-em dash, wave dash, wavy dash, katakana-hiragana double hyphen
      Parameters:
      codePoint - A unicode code point. (not a char!)
      Returns:
      true: given code point represents a hyphen-like glyph
    • normalizeHyphens

      public static String normalizeHyphens(String s)
      Replaces hyphen like codepoints by ASCII "-", removes soft hyphens.
      Parameters:
      s - input string to replace hyphens in
      Returns:
      the same string with soft hyphens dropped and hyphen-like codepoints replaced by an ASCII minus.
    • preprocess

      public List<String> preprocess(List<String> tokens)
      Applies the preprocessing defined to the given input tokens.
      Specified by:
      preprocess in interface TextProcessor
      Parameters:
      tokens - the tokens created after the input text is tokenized
      Returns:
      the preprocessed tokens