public class SpoofChecker extends Object
This class is intended to check strings, typically identifiers of some type, such as URLs, for the presence of characters that are likely to be visually confusing - for cases where the displayed form of an identifier may not be what it appears to be.
Unicode Technical Report #36, http://unicode.org/reports/tr36 and Unicode Technical Standard #39, http://unicode.org/reports/tr39 "Unicode security considerations", give more background on security and spoofing issues with Unicode identifiers. The tests and checks provided by this module implement the recommendations from these Unicode documents.
The tests available on identifiers fall into two general categories:
The steps to perform confusability testing are
SpoofChecker.Builder
SpoofChecker
from the Builder. SpoofChecker
. The results indicate
which (if any) of the selected tests have identified possible problems with the identifier.
Results are reported as a set of SpoofCheck flags; this mirrors the form in which
the set of tests to perform was originally specified to the SpoofChecker. A SpoofChecker
instance may be used repeatedly to perform checks on any number
of identifiers.
Thread Safety: The methods on SpoofChecker objects are thread safe. The test functions for checking a single identifier, or for testing whether two identifiers are potentially confusable, may called concurrently from multiple threads using the same SpoofChecker instance.
Descriptions of the available checks.
When testing whether pairs of identifiers are confusable, with areConfusable()
the relevant tests are
SINGLE_SCRIPT_CONFUSABLE
: All of the characters from the two identifiers are
from a single script, and the two identifiers are visually confusable.MIXED_SCRIPT_CONFUSABLE
: At least one of the identifiers contains characters
from more than one script, and the two identifiers are visually confusable.WHOLE_SCRIPT_CONFUSABLE
: Each of the two identifiers is of a single script, but
the the two identifiers are from different scripts, and they are visually confusable.The safest approach is to enable all three of these checks as a group.
ANY_CASE
is a modifier for the above tests. If the identifiers being checked can
be of mixed case and are used in a case-sensitive manner, this option should be specified.
If the identifiers being checked are used in a case-insensitive manner, and if they are
displayed to users in lower-case form only, the ANY_CASE
option should not be
specified. Confusabality issues involving upper case letters will not be reported.
When performing tests on a single identifier, with the check() family of functions, the relevant tests are:
MIXED_SCRIPT_CONFUSABLE
: the identifier contains characters from multiple
scripts, and there exists an identifier of a single script that is visually confusable.WHOLE_SCRIPT_CONFUSABLE
: the identifier consists of characters from a single
script, and there exists a visually confusable identifier.
The visually confusable identifier also consists of characters from a single script.
but not the same script as the identifier being checked.ANY_CASE
: modifies the mixed script and whole script confusables tests. If
specified, the checks will find confusable characters of any case.
If this flag is not set, the test is performed assuming case folded identifiers.SINGLE_SCRIPT
: check that the identifier contains only characters from a
single script. (Characters from the common and inherited scripts are ignored.)
This is not a test for confusable identifiersINVISIBLE
: check an identifier for the presence of invisible characters,
such as zero-width spaces, or character sequences that are
likely not to display, such as multiple occurrences of the same
non-spacing mark. This check does not test the input string as a whole
for conformance to any particular syntax for identifiers.CHAR_LIMIT
: check that an identifier contains only characters from a specified set
of acceptable characters. See Builder.setAllowedChars()
and
Builder.setAllowedLocales()
.Note on Scripts:
Characters from the Unicode Scripts "Common" and "Inherited" are ignored when considering the script of an identifier. Common characters include digits and symbols that are normally used with text from many different scripts.
Modifier and Type | Class and Description |
---|---|
static class |
SpoofChecker.Builder
SpoofChecker Builder.
|
static class |
SpoofChecker.CheckResult
A struct-like class to hold the results of a Spoof Check operation.
|
static class |
SpoofChecker.RestrictionLevel
Constants from UAX 31 for use in setRestrictionLevel.
|
Modifier and Type | Field and Description |
---|---|
static int |
ALL_CHECKS
Enable all spoof checks.
|
static int |
ANY_CASE
Any Case Modifier for confusable identifier tests.
|
static int |
CHAR_LIMIT
Check that an identifier contains only characters from a specified set of acceptable characters.
|
static UnicodeSet |
INCLUSION
Security Profile constant from UAX 31 for use in setAllowedChars.
|
static int |
INVISIBLE
Check an identifier for the presence of invisible characters, such as zero-width spaces, or character sequences
that are likely not to display, such as multiple occurrences of the same non-spacing mark.
|
static int |
MIXED_NUMBERS
Check that an identifier does not mix numbers.
|
static int |
MIXED_SCRIPT_CONFUSABLE
Mixed script confusable test.
|
static UnicodeSet |
RECOMMENDED
Security Profile constant from UAX 31 for use in setAllowedChars.
|
static int |
RESTRICTION_LEVEL
Check that an identifier is no looser than the specified RestrictionLevel.
|
static int |
SINGLE_SCRIPT
Deprecated.
Use RESTRICTION_LEVEL
|
static int |
SINGLE_SCRIPT_CONFUSABLE
Single script confusable test.
|
static int |
WHOLE_SCRIPT_CONFUSABLE
Whole script confusable test.
|
Modifier and Type | Method and Description |
---|---|
int |
areConfusable(String s1,
String s2)
Check the whether two specified strings are visually confusable.
|
boolean |
failsChecks(String text)
Check the specified string for possible security issues.
|
boolean |
failsChecks(String text,
SpoofChecker.CheckResult checkResult)
Check the specified string for possible security issues.
|
UnicodeSet |
getAllowedChars()
Get a UnicodeSet for the characters permitted in an identifier.
|
Set<ULocale> |
getAllowedLocales()
Get a list of locales for the scripts that are acceptable in strings to be checked.
|
int |
getChecks()
Get the set of checks that this Spoof Checker has been configured to perform.
|
SpoofChecker.RestrictionLevel |
getRestrictionLevel()
Get the Restriction Level that is being tested.
|
String |
getSkeleton(int type,
String id)
Get the "skeleton" for an identifier string.
|
public static final UnicodeSet INCLUSION
public static final UnicodeSet RECOMMENDED
public static final int SINGLE_SCRIPT_CONFUSABLE
public static final int MIXED_SCRIPT_CONFUSABLE
public static final int WHOLE_SCRIPT_CONFUSABLE
public static final int ANY_CASE
public static final int RESTRICTION_LEVEL
public static final int SINGLE_SCRIPT
public static final int INVISIBLE
public static final int CHAR_LIMIT
public static final int MIXED_NUMBERS
public static final int ALL_CHECKS
public SpoofChecker.RestrictionLevel getRestrictionLevel()
public int getChecks()
public Set<ULocale> getAllowedLocales()
public UnicodeSet getAllowedChars()
public boolean failsChecks(String text, SpoofChecker.CheckResult checkResult)
text
- A String to be checked for possible security issues.checkResult
- Output parameter, indicates which specific tests failed.
May be null if the information is not wanted.public boolean failsChecks(String text)
text
- A String to be checked for possible security issues.public int areConfusable(String s1, String s2)
s1
- The first of the two strings to be compared for confusability.s2
- The second of the two strings to be compared for confusability.public String getSkeleton(int type, String id)
type
- The type of skeleton, corresponding to which of the Unicode confusable data tables to use. The default
is Mixed-Script, Lowercase. Allowed options are SINGLE_SCRIPT_CONFUSABLE and ANY_CASE_CONFUSABLE. The
two flags may be ORed.id
- The input identifier whose skeleton will be genereated.Copyright (c) 2013 IBM Corporation and others.