A simple, fixed-size bit set implementation.
Lsh implementation as described in 'Randomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering' by Ravichandran et al.
Brute force O(n2) method to compute exact nearest neighbours. As this is a very expensive computation O(n2) an additional sample parameter may be passed such that neighbours are just computed for a random fraction.
Implementation based on approximated cosine distances.
Standard Lsh implementation.
Brute force O(size(query) * size(catalog)) method to compute exact nearest neighbours for rows in the query matrix.
An id with it's hash encoding and original vector.
Represents an RDD from grouping items of its parent RDD in fixed size blocks by passing a sliding window over them.
NOTE: both classes are copied from mllib and slightly modified since these classes are mllib private! Modified lines are marked with comments
An id with it's hash encoding.
interface defining similarity measurement between 2 vectors
implementation of VectorDistance that computes cosine similarity between two vectors
Compares two bit sets according to the first different bit
Compares two bit sets for their equality
Returns a string representation of a BitSet
Take distinct matrix entry values based on the indices only.
Take distinct matrix entry values based on the indices only. The actual values are discarded.
Returns the hamming distance between two bit vectors
Approximates the cosine distance of two bit sets using their hamming distance
Returns a local k by d matrix with random gaussian entries mean=0.
Returns a local k by d matrix with random gaussian entries mean=0.0 and std=1.0
This is a k by d matrix as it is multiplied by the input matrix
Converts a given input matrix to a bit set representation using random hyperplanes
Converts a given input matrix to a bit set representation using random hyperplanes
Converts a vector to a bit set by replacing all values of x with sign(x)