Class StringSimilarity

java.lang.Object
io.github.jspinak.brobot.util.string.StringSimilarity

@Component public class StringSimilarity extends Object
Calculates string similarity using the Levenshtein distance algorithm.

This component provides methods to compute similarity scores between strings based on their edit distance. The similarity score ranges from 0.0 (completely different) to 1.0 (identical), making it useful for fuzzy string matching.

Algorithm details:

  • Uses Levenshtein distance (minimum edit operations needed)
  • Normalizes by the longer string's length for consistent scoring
  • Case-insensitive comparison in edit distance calculation
  • Optimized space complexity implementation
  • Important: Character transpositions (e.g., "ab" → "ba") count as TWO edits (deletion + insertion), not one. This differs from Damerau-Levenshtein distance.

Similarity formula:

 similarity = (longerLength - editDistance) / longerLength
 

Use cases:

  • OCR result validation and selection
  • Fuzzy text matching in UI automation
  • Detecting typos or variations in user input
  • Finding best matches in string collections
  • Duplicate detection with tolerance

Performance characteristics:

  • Time complexity: O(m × n) where m, n are string lengths
  • Space complexity: O(min(m, n)) - optimized implementation
  • Suitable for moderate string lengths

Based on: https://stackoverflow.com/questions/955110/similarity-string-comparison-in-java/16018452

Thread safety: All methods are stateless and thread-safe.

See Also:
  • Constructor Details

    • StringSimilarity

      public StringSimilarity()
  • Method Details

    • similarity

      public static double similarity(String s1, String s2)
      Calculates the similarity score between two strings.

      Returns a normalized score between 0.0 and 1.0, where:

      • 1.0 = Identical strings
      • 0.5 = Half the characters need changing
      • 0.0 = Completely different (edit distance equals longer length)

      Algorithm steps:

      1. Identify longer and shorter strings
      2. Calculate edit distance between them
      3. Normalize by longer string's length

      Examples:

      • similarity("hello", "hello") = 1.0
      • similarity("hello", "hallo") = 0.8
      • similarity("hello", "help") = 0.6
      • similarity("abc", "xyz") = 0.0

      Special cases:

      • Both empty strings: Returns 1.0 (considered identical)
      • One empty string: Returns 0.0
      • Order independent: similarity(a,b) = similarity(b,a)
      Parameters:
      s1 - the first string to compare
      s2 - the second string to compare
      Returns:
      similarity score between 0.0 and 1.0 inclusive
    • editDistance

      public static int editDistance(String s1, String s2)
      Calculates the Levenshtein edit distance between two strings.

      The edit distance is the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another.

      Note on transpositions: This implementation uses standard Levenshtein distance, which counts character transpositions (swapping adjacent characters) as TWO edits. For example, "ab" → "ba" has an edit distance of 2 (delete 'b', insert 'b'), not 1. Use Damerau-Levenshtein distance if you need transpositions to count as a single edit.

      Implementation details:

      • Space-optimized dynamic programming approach
      • Uses single array instead of full matrix
      • Case-insensitive comparison (converts to lowercase)
      • Processes strings character by character

      Algorithm visualization:

       s1 = "cat", s2 = "cut"
       Edit operations: substitute 'a' with 'u'
       Edit distance = 1
       

      Examples:

      • editDistance("kitten", "sitting") = 3
      • editDistance("saturday", "sunday") = 3
      • editDistance("abc", "abc") = 0
      • editDistance("abc", "") = 3

      Performance notes:

      • Time: O(m × n) where m = s1.length(), n = s2.length()
      • Space: O(n) - only stores one row of the DP matrix
      • Lowercase conversion adds overhead but ensures consistency

      Based on: http://rosettacode.org/wiki/Levenshtein_distance#Java

      Parameters:
      s1 - the source string
      s2 - the target string
      Returns:
      the minimum number of edits needed to transform s1 into s2
    • printSimilarity

      public static void printSimilarity(String s, String t)
      Prints a formatted similarity report for two strings.

      Outputs the similarity score with 3 decimal places along with the compared strings in quotes for clarity. Useful for debugging and analysis of string matching results.

      Output format:

       0.857 is the similarity between "hello" and "hallo"
       

      Use cases:

      • Debugging OCR results
      • Analyzing text matching thresholds
      • Logging similarity calculations
      • Testing string comparison algorithms
      Parameters:
      s - the first string to compare
      t - the second string to compare