Class HyphenNormalizer

  • All Implemented Interfaces:
    TextProcessor

    public class HyphenNormalizer
    extends java.lang.Object
    implements TextProcessor
    Unicode normalization does not take care of "exotic" hyphens that we normally do not want in NLP input. This preprocessor turns all Hyphens into "normal" ASCII minus-hyphen characters (U+002D). Invisible soft hyphens are dropped from the input.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static boolean isHyphenLike​(java.lang.Integer codePoint)
      Returns whether the given code point is a hyphen-like codepoint.
      static java.lang.String normalizeHyphens​(java.lang.String s)
      Replaces hyphen like codepoints by ASCII "-", removes soft hyphens.
      java.util.List<java.lang.String> preprocess​(java.util.List<java.lang.String> tokens)
      Applies the preprocessing defined to the given input tokens.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • HyphenNormalizer

        public HyphenNormalizer()
    • Method Detail

      • isHyphenLike

        public static boolean isHyphenLike​(java.lang.Integer codePoint)
        Returns whether the given code point is a hyphen-like codepoint. Tests for hyphen-minus, tilde, soft hyphen, armenian hyphen, hebrew punctuation maqaf, canadian syllabics hyphen, mongolian hyphen, non-breaking hyphen, figure dash, en dash, em dash, horizontal bar, swung dash, superscript minus, subscript minus, minus sign, double oblique hyphen, two-em dash, three-em dash, wave dash, wavy dash, katakana-hiragana double hyphen
        Parameters:
        codePoint - A unicode code point. (not a char!)
        Returns:
        true: given code point represents a hyphen-like glyph
      • normalizeHyphens

        public static java.lang.String normalizeHyphens​(java.lang.String s)
        Replaces hyphen like codepoints by ASCII "-", removes soft hyphens.
        Parameters:
        s - input string to replace hyphens in
        Returns:
        the same string with soft hyphens dropped and hyphen-like codepoints replaced by an ASCII minus.
      • preprocess

        public java.util.List<java.lang.String> preprocess​(java.util.List<java.lang.String> tokens)
        Applies the preprocessing defined to the given input tokens.
        Specified by:
        preprocess in interface TextProcessor
        Parameters:
        tokens - the tokens created after the input text is tokenized
        Returns:
        the preprocessed tokens