Class CaseCanonicalize
- java.lang.Object
-
- com.google.javascript.jscomp.regex.CaseCanonicalize
-
public final class CaseCanonicalize extends java.lang.Object
Implements the ECMAScript 5 Canonicalize operation used to specify how case-insensitive regular expressions match.From section 15.10.2.9,
The abstract operation Canonicalize takes a character parameter ch and performs the following steps:
- If IgnoreCase is false, return ch.
- Let u be ch converted to upper case as if by calling the standard
built-in method
String.prototype.toUpperCase
on the one-character String ch. - If u does not consist of a single character, return ch.
- Let cu be u's character.
- If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch.
- Return cu.
-
-
Field Summary
Fields Modifier and Type Field Description static com.google.javascript.jscomp.regex.CharRanges
CASE_SENSITIVE
Set of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8.
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static char
caseCanonicalize(char ch)
Returns the case canonical version of the given code-unit.static java.lang.String
caseCanonicalize(java.lang.String s)
Returns the case canonical version of the given string.static com.google.javascript.jscomp.regex.CharRanges
expandToAllMatched(com.google.javascript.jscomp.regex.CharRanges ranges)
Given a character range that may include case sensitive code-units, such as[0-9B-M]
, returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input.static com.google.javascript.jscomp.regex.CharRanges
reduceToMinimum(com.google.javascript.jscomp.regex.CharRanges ranges)
Given a character range that may include case sensitive code-units, such as[0-9B-M]
, returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.
-
-
-
Field Detail
-
CASE_SENSITIVE
public static final com.google.javascript.jscomp.regex.CharRanges CASE_SENSITIVE
Set of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8. The case sensitive characters are the ones that canonicalize to a character other than themselves or have a character that canonicalizes to them. Canonicalize is based on the definition ofString.prototype.toUpperCase
which is itself based on Unicode 3.0.0 as specified at UnicodeData-3.0.0 and SpecialCasings-2.txt .This table was generated by running the below on Chrome:
for (var cc = 0; cc < 0x10000; ++cc) { var ch = String.fromCharCode(cc); var u = ch.toUpperCase(); if (ch != u && u.length === 1) { var cu = u.charCodeAt(0); if (cc <= 128 || u.charCodeAt(0) > 128) { print('0x' + cc.toString(16) + ', 0x' + cu.toString(16) + ','); } } }
-
-
Method Detail
-
caseCanonicalize
public static java.lang.String caseCanonicalize(java.lang.String s)
Returns the case canonical version of the given string.
-
caseCanonicalize
public static char caseCanonicalize(char ch)
Returns the case canonical version of the given code-unit. ECMAScript 5 explicitly says that code-units are to be treated as their code-point equivalent, even surrogates.
-
expandToAllMatched
public static com.google.javascript.jscomp.regex.CharRanges expandToAllMatched(com.google.javascript.jscomp.regex.CharRanges ranges)
Given a character range that may include case sensitive code-units, such as[0-9B-M]
, returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input.
-
reduceToMinimum
public static com.google.javascript.jscomp.regex.CharRanges reduceToMinimum(com.google.javascript.jscomp.regex.CharRanges ranges)
Given a character range that may include case sensitive code-units, such as[0-9B-M]
, returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.
-
-