abstract class MinHasher[H] extends Monoid[MinHashSignature]
Instances of MinHasher can create, combine, and compare fixed-sized signatures of arbitrarily sized sets.
A signature is represented by a byte array of approx maxBytes size. You can initialize a signature with a single element, usually a Long or String. You can combine any two set's signatures to produce the signature of their union. You can compare any two set's signatures to estimate their Jaccard similarity. You can use a set's signature to estimate the number of distinct values in the set. You can also use a combination of the above to estimate the size of the intersection of two sets from their signatures. The more bytes in the signature, the more accurate all of the above will be.
You can also use these signatures to quickly find similar sets without doing n^2 comparisons. Each signature is assigned to several buckets; sets whose signatures end up in the same bucket are likely to be similar. The targetThreshold controls the desired level of similarity - the higher the threshold, the more efficiently you can find all the similar sets.
This abstract superclass is generic with regards to the size of the hash used. Depending on the number of unique values in the domain of the sets, you may want a MinHasher16, a MinHasher32, or a new custom subclass.
This implementation is modeled after Chapter 3 of Ullman and Rajaraman's Mining of Massive Datasets: http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf
- Alphabetic
- By Inheritance
- MinHasher
- Monoid
- AdditiveMonoid
- Monoid
- Semigroup
- AdditiveSemigroup
- Semigroup
- Serializable
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
- new MinHasher(numHashes: Int, numBands: Int)(implicit n: Numeric[H])
Abstract Value Members
- abstract def buildArray(left: Array[Byte], right: Array[Byte])(fn: (H, H) => H): Array[Byte]
Decode two signatures into hash values, combine them somehow, and produce a new array
Decode two signatures into hash values, combine them somehow, and produce a new array
- Attributes
- protected
- abstract def buildArray(fn: => H): Array[Byte]
Initialize a byte array by generating hash values
Initialize a byte array by generating hash values
- Attributes
- protected
- abstract def hashSize: Int
The number of bytes used for each hash in the signature
- abstract def maxHash: H
Maximum value the hash can take on (not 2*hashSize because of signed types)
Concrete Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- def additive: algebra.Monoid[MinHashSignature]
These are from algebra.Monoid
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def assertNotZero(v: MinHashSignature): Unit
- Definition Classes
- Monoid
- def buckets(sig: MinHashSignature): List[Long]
Bucket keys to use for quickly finding other similar items via locality sensitive hashing
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- def combine(l: MinHashSignature, r: MinHashSignature): MinHashSignature
- Definition Classes
- Semigroup → Semigroup
- def combineAll(t: TraversableOnce[MinHashSignature]): MinHashSignature
- Definition Classes
- Monoid → Monoid
- def combineAllOption(as: IterableOnce[MinHashSignature]): Option[MinHashSignature]
- Definition Classes
- Monoid → Semigroup
- def combineN(a: MinHashSignature, n: Int): MinHashSignature
- Definition Classes
- Monoid → Semigroup
- def empty: MinHashSignature
- Definition Classes
- Monoid → Monoid
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- val estimatedThreshold: Double
Useful for understanding the effects of numBands and numRows
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def init(fn: (MurmurHash128) => (Long, Long)): MinHashSignature
Create a signature for an arbitrary value
- def init(value: String): MinHashSignature
Create a signature for a single String value
- def init(value: Long): MinHashSignature
Create a signature for a single Long value
- def isEmpty(a: MinHashSignature)(implicit ev: Eq[MinHashSignature]): Boolean
- Definition Classes
- Monoid
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def isNonZero(v: MinHashSignature): Boolean
- Definition Classes
- Monoid
- def isZero(a: MinHashSignature)(implicit ev: Eq[MinHashSignature]): Boolean
- Definition Classes
- AdditiveMonoid
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def nonZeroOption(v: MinHashSignature): Option[MinHashSignature]
- Definition Classes
- Monoid
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- val numBands: Int
- val numBytes: Int
For explanation of the "bands" and "rows" see Ullman and Rajaraman
- val numHashes: Int
- val numRows: Int
- def plus(left: MinHashSignature, right: MinHashSignature): MinHashSignature
Set union
Set union
- Definition Classes
- MinHasher → AdditiveSemigroup
- def positiveSumN(a: MinHashSignature, n: Int): MinHashSignature
- Attributes
- protected[this]
- Definition Classes
- AdditiveSemigroup
- def probabilityOfInclusion(sim: Double): Double
Useful for understanding the effects of numBands and numRows
- def repeatedCombineN(a: MinHashSignature, n: Int): MinHashSignature
- Attributes
- protected[this]
- Definition Classes
- Semigroup
- def similarity(left: MinHashSignature, right: MinHashSignature): Double
Esimate Jaccard similarity (size of union / size of intersection)
- def sum(vs: TraversableOnce[MinHashSignature]): MinHashSignature
- Definition Classes
- Monoid → AdditiveMonoid
- def sumN(a: MinHashSignature, n: Int): MinHashSignature
- Definition Classes
- AdditiveMonoid → AdditiveSemigroup
- def sumOption(iter: TraversableOnce[MinHashSignature]): Option[MinHashSignature]
Returns an instance of
T
calculated by summing all instances initer
in one pass.Returns an instance of
T
calculated by summing all instances initer
in one pass. ReturnsNone
ifiter
is empty, elseSome[T]
.- iter
instances of
T
to be combined- returns
None
ifiter
is empty, else an option value containing the summedT
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- def trySum(as: TraversableOnce[MinHashSignature]): Option[MinHashSignature]
- Definition Classes
- AdditiveMonoid → AdditiveSemigroup
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- val zero: MinHashSignature
Signature for empty set, needed to be a proper Monoid
Signature for empty set, needed to be a proper Monoid
- Definition Classes
- MinHasher → AdditiveMonoid