Package

com.nexthink.utils.parsing

distance

Permalink

package distance

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

distance
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Type Members

trait EditDistance[E] extends AnyRef

Value Members

object DiceSorensenDistance
object JaroWinklerDistance

The Jaro-Winkler com.nexthink.utils.parsing.distance measures the similarity between two strings.
The Jaro-Winkler com.nexthink.utils.parsing.distance measures the similarity between two strings. This is a metric which is best suited for short strings such as person's names, since it performs a comparison based on a limited window (whereas edit com.nexthink.utils.parsing.distance methods compare all characters)
See https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for the definition. See http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html for a detailed explanation of the algorithm.
object LevenshteinDistance extends EditDistance[Char]

Levenshtein com.nexthink.utils.parsing.distance is the classical string difference metric.
Levenshtein com.nexthink.utils.parsing.distance is the classical string difference metric. It is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into another. It is typically implemented with a dynamic programming approach.
See https://en.wikipedia.org/wiki/Levenshtein_distance
object NgramDistance extends EditDistance[String]

N-gram edit com.nexthink.utils.parsing.distance is an edit com.nexthink.utils.parsing.distance metric which considers multiple characters at a time.
N-gram edit com.nexthink.utils.parsing.distance is an edit com.nexthink.utils.parsing.distance metric which considers multiple characters at a time. N-gram edit com.nexthink.utils.parsing.distance takes the idea of Levenshtein com.nexthink.utils.parsing.distance and treats each n-gram as a character. The impact of this approach is that insertions and deletions which don't involve double letters are more heavily penalized using n-grams than unigrams. In essence, it introduces a notion of context and favors strings with continuous streches of equal characters (since it multiples the number of comparisons). It is generally used with bigrams, which offer the best efficiency/performance ratio. We also refine this approach with some level of partial credit for n-grams that share common characters. In addition, by using string affixing which allow the first character to participate in the same number of n-grams as an intermediate character. Also, words that don't begin with the same n-1 characters receive a penalty for not matching the prefix.
See http://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf (N-Gram Similarity and Distance, Grzegorz Kondrak, 2005) This approach is described in "Taming Text", chapter 4 "Fuzzy string matching", https://www.manning.com/books/taming-text
def affix(string: String)(arity: Int): String
def bigramsWithAffixing(string: String): Seq[String]
def ngrams(string: String)(arity: Int): Seq[String]
def ngramsWithAffixing(string: String)(arity: Int): Seq[String]
def tokenizeWords(s: String): Array[String]
def trigramsWithAffixing(string: String): Seq[String]

Inherited from AnyRef

Inherited from Any

Ungrouped