Class CaseCanonicalize


  • public final class CaseCanonicalize
    extends java.lang.Object
    Implements the ECMAScript 5 Canonicalize operation used to specify how case-insensitive regular expressions match.

    From section 15.10.2.9,

    The abstract operation Canonicalize takes a character parameter ch and performs the following steps:
    • If IgnoreCase is false, return ch.
    • Let u be ch converted to upper case as if by calling the standard built-in method String.prototype.toUpperCase on the one-character String ch.
    • If u does not consist of a single character, return ch.
    • Let cu be u's character.
    • If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch.
    • Return cu.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static com.google.javascript.jscomp.regex.CharRanges CASE_SENSITIVE
      Set of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static char caseCanonicalize​(char ch)
      Returns the case canonical version of the given code-unit.
      static java.lang.String caseCanonicalize​(java.lang.String s)
      Returns the case canonical version of the given string.
      static com.google.javascript.jscomp.regex.CharRanges expandToAllMatched​(com.google.javascript.jscomp.regex.CharRanges ranges)
      Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input.
      static com.google.javascript.jscomp.regex.CharRanges reduceToMinimum​(com.google.javascript.jscomp.regex.CharRanges ranges)
      Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • CASE_SENSITIVE

        public static final com.google.javascript.jscomp.regex.CharRanges CASE_SENSITIVE
        Set of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8. The case sensitive characters are the ones that canonicalize to a character other than themselves or have a character that canonicalizes to them. Canonicalize is based on the definition of String.prototype.toUpperCase which is itself based on Unicode 3.0.0 as specified at UnicodeData-3.0.0 and SpecialCasings-2.txt .

        This table was generated by running the below on Chrome:

         for (var cc = 0; cc < 0x10000; ++cc) {
           var ch = String.fromCharCode(cc);
           var u = ch.toUpperCase();
           if (ch != u && u.length === 1) {
             var cu = u.charCodeAt(0);
             if (cc <= 128 || u.charCodeAt(0) > 128) {
               print('0x' + cc.toString(16) + ', 0x' + cu.toString(16) + ',');
             }
           }
         }
         
    • Method Detail

      • caseCanonicalize

        public static java.lang.String caseCanonicalize​(java.lang.String s)
        Returns the case canonical version of the given string.
      • caseCanonicalize

        public static char caseCanonicalize​(char ch)
        Returns the case canonical version of the given code-unit. ECMAScript 5 explicitly says that code-units are to be treated as their code-point equivalent, even surrogates.
      • expandToAllMatched

        public static com.google.javascript.jscomp.regex.CharRanges expandToAllMatched​(com.google.javascript.jscomp.regex.CharRanges ranges)
        Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input.
      • reduceToMinimum

        public static com.google.javascript.jscomp.regex.CharRanges reduceToMinimum​(com.google.javascript.jscomp.regex.CharRanges ranges)
        Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.