Package

com.nexthink.utils.parsing

distance

Permalink

package distance

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. distance
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. trait EditDistance[E] extends AnyRef

    Permalink

Value Members

  1. object DiceSorensenDistance

    Permalink
  2. object JaroWinklerDistance

    Permalink

    The Jaro-Winkler com.nexthink.utils.parsing.distance measures the similarity between two strings.

    The Jaro-Winkler com.nexthink.utils.parsing.distance measures the similarity between two strings. This is a metric which is best suited for short strings such as person's names, since it performs a comparison based on a limited window (whereas edit com.nexthink.utils.parsing.distance methods compare all characters)

    See https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for the definition. See http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html for a detailed explanation of the algorithm.

  3. object LevenshteinDistance extends EditDistance[Char]

    Permalink

    Levenshtein com.nexthink.utils.parsing.distance is the classical string difference metric.

    Levenshtein com.nexthink.utils.parsing.distance is the classical string difference metric. It is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into another. It is typically implemented with a dynamic programming approach.

    See https://en.wikipedia.org/wiki/Levenshtein_distance

  4. object NgramDistance extends EditDistance[String]

    Permalink

    N-gram edit com.nexthink.utils.parsing.distance is an edit com.nexthink.utils.parsing.distance metric which considers multiple characters at a time.

    N-gram edit com.nexthink.utils.parsing.distance is an edit com.nexthink.utils.parsing.distance metric which considers multiple characters at a time. N-gram edit com.nexthink.utils.parsing.distance takes the idea of Levenshtein com.nexthink.utils.parsing.distance and treats each n-gram as a character. The impact of this approach is that insertions and deletions which don't involve double letters are more heavily penalized using n-grams than unigrams. In essence, it introduces a notion of context and favors strings with continuous streches of equal characters (since it multiples the number of comparisons). It is generally used with bigrams, which offer the best efficiency/performance ratio. We also refine this approach with some level of partial credit for n-grams that share common characters. In addition, by using string affixing which allow the first character to participate in the same number of n-grams as an intermediate character. Also, words that don't begin with the same n-1 characters receive a penalty for not matching the prefix.

    See http://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf (N-Gram Similarity and Distance, Grzegorz Kondrak, 2005) This approach is described in "Taming Text", chapter 4 "Fuzzy string matching", https://www.manning.com/books/taming-text

  5. def affix(string: String)(arity: Int): String

    Permalink
  6. def bigramsWithAffixing(string: String): Seq[String]

    Permalink
  7. def ngrams(string: String)(arity: Int): Seq[String]

    Permalink
  8. def ngramsWithAffixing(string: String)(arity: Int): Seq[String]

    Permalink
  9. def tokenizeWords(s: String): Array[String]

    Permalink
  10. def trigramsWithAffixing(string: String): Seq[String]

    Permalink

Inherited from AnyRef

Inherited from Any

Ungrouped