Class UnicodeNormalizer

java.lang.Object
ai.djl.modality.nlp.preprocess.UnicodeNormalizer
All Implemented Interfaces:
TextProcessor

public class UnicodeNormalizer extends Object implements TextProcessor
Applies unicode normalization to input strings. This is particularly important if you are dealing with non-English input or with text originating from OCR applications.
  • Field Details

  • Constructor Details

    • UnicodeNormalizer

      public UnicodeNormalizer(Normalizer.Form normalForm)
      Unicode normalizer with a configurable normal form.
      Parameters:
      normalForm - The normal form to use.
    • UnicodeNormalizer

      public UnicodeNormalizer()
      Default version of the Unicode Normalizer using NFKC normal form. If you do not know what normal form you need, this is the normal form you need.
  • Method Details

    • normalizeDefault

      public static String normalizeDefault(String s)
      Normalizes a String using a sensible default normal form. Use this if you do not want to think about unicode preprocessing.
      Parameters:
      s - Any non-null string
      Returns:
      The given string with default unicode normalization applied.
    • preprocess

      public List<String> preprocess(List<String> tokens)
      Applies the preprocessing defined to the given input tokens.
      Specified by:
      preprocess in interface TextProcessor
      Parameters:
      tokens - the tokens created after the input text is tokenized
      Returns:
      the preprocessed tokens