minimum similarity two items need to have otherwise they are discarded from the result set
number of random vectors (hyperplanes) to generate bit vectors of length d
beam factor e.g. how many neighbours are considered in the sliding window
number of times bitsets are permuted
Creates a sliding window
Generates a random permutation of size n
Find the k nearest neighbors from a data set for every other object in the same data set.
Generate all pairs and emit if cosine of pair > minCosineSimilarity
Orderes an RDD of signatures by their bit set representation
Permutes a bit set representation of a vector by a given permutation
Permutes a signatures by a given permutation
Lsh implementation as described in 'Randomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering' by Ravichandran et al. See original publication for a detailed description of the parameters.
http://dl.acm.org/citation.cfm?id=1219917