Class JaroWinklerDistance

  • All Implemented Interfaces:
    SimilarityScore<Double>

    public class JaroWinklerDistance
    extends Object
    implements SimilarityScore<Double>
    A similarity algorithm indicating the percentage of matched characters between two character sequences.

    The Jaro measure is the weighted sum of percentage of matched characters from each file and transposed characters. Winkler increased this measure for matching initial characters.

    This implementation is based on the Jaro Winkler similarity algorithm from http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance.

    This code has been adapted from Apache Commons Lang 3.3.

    Since:
    1.0
    • Field Detail

      • INDEX_NOT_FOUND

        public static final int INDEX_NOT_FOUND
        Represents a failed index search.
        See Also:
        Constant Field Values
    • Constructor Detail

      • JaroWinklerDistance

        public JaroWinklerDistance()
    • Method Detail

      • apply

        public Double apply​(CharSequence left,
                            CharSequence right)
        Find the Jaro Winkler Distance which indicates the similarity score between two CharSequences.
         distance.apply(null, null)          = IllegalArgumentException
         distance.apply("","")               = 0.0
         distance.apply("","a")              = 0.0
         distance.apply("aaapppp", "")       = 0.0
         distance.apply("frog", "fog")       = 0.93
         distance.apply("fly", "ant")        = 0.0
         distance.apply("elephant", "hippo") = 0.44
         distance.apply("hippo", "elephant") = 0.44
         distance.apply("hippo", "zzzzzzzz") = 0.0
         distance.apply("hello", "hallo")    = 0.88
         distance.apply("ABC Corporation", "ABC Corp") = 0.93
         distance.apply("D N H Enterprises Inc", "D & H Enterprises, Inc.") = 0.95
         distance.apply("My Gym Children's Fitness Center", "My Gym. Childrens Fitness") = 0.92
         distance.apply("PENNSYLVANIA", "PENNCISYLVNIA")    = 0.88
         
        Specified by:
        apply in interface SimilarityScore<Double>
        Parameters:
        left - the first CharSequence, must not be null
        right - the second CharSequence, must not be null
        Returns:
        result distance
        Throws:
        IllegalArgumentException - if either CharSequence input is null
      • matches

        protected static int[] matches​(CharSequence first,
                                       CharSequence second)
        This method returns the Jaro-Winkler string matches, half transpositions, prefix array.
        Parameters:
        first - the first string to be matched
        second - the second string to be matched
        Returns:
        mtp array containing: matches, half transpositions, and prefix