Class UnicodeNormalizer

  • All Implemented Interfaces:
    TextProcessor

    public class UnicodeNormalizer
    extends java.lang.Object
    implements TextProcessor
    Applies unicode normalization to input strings. This is particularly important if you are dealing with non-English input or with text originating from OCR applications.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static java.text.Normalizer.Form DEFAULT_FORM  
    • Constructor Summary

      Constructors 
      Constructor Description
      UnicodeNormalizer()
      Default version of the Unicode Normalizer using NFKC normal form.
      UnicodeNormalizer​(java.text.Normalizer.Form normalForm)
      Unicode normalizer with a configurable normal form.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static java.lang.String normalizeDefault​(java.lang.String s)
      Normalizes a String using a sensible default normal form.
      java.util.List<java.lang.String> preprocess​(java.util.List<java.lang.String> tokens)
      Applies the preprocessing defined to the given input tokens.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • DEFAULT_FORM

        public static final java.text.Normalizer.Form DEFAULT_FORM
    • Constructor Detail

      • UnicodeNormalizer

        public UnicodeNormalizer​(java.text.Normalizer.Form normalForm)
        Unicode normalizer with a configurable normal form.
        Parameters:
        normalForm - The normal form to use.
      • UnicodeNormalizer

        public UnicodeNormalizer()
        Default version of the Unicode Normalizer using NFKC normal form. If you do not know what normal form you need, this is the normal form you need.
    • Method Detail

      • normalizeDefault

        public static java.lang.String normalizeDefault​(java.lang.String s)
        Normalizes a String using a sensible default normal form. Use this if you do not want to think about unicode preprocessing.
        Parameters:
        s - Any non-null string
        Returns:
        The given string with default unicode normalization applied.
      • preprocess

        public java.util.List<java.lang.String> preprocess​(java.util.List<java.lang.String> tokens)
        Applies the preprocessing defined to the given input tokens.
        Specified by:
        preprocess in interface TextProcessor
        Parameters:
        tokens - the tokens created after the input text is tokenized
        Returns:
        the preprocessed tokens