Packages

abstract class MinHasher[H] extends Monoid[MinHashSignature]

Instances of MinHasher can create, combine, and compare fixed-sized signatures of arbitrarily sized sets.

A signature is represented by a byte array of approx maxBytes size. You can initialize a signature with a single element, usually a Long or String. You can combine any two set's signatures to produce the signature of their union. You can compare any two set's signatures to estimate their Jaccard similarity. You can use a set's signature to estimate the number of distinct values in the set. You can also use a combination of the above to estimate the size of the intersection of two sets from their signatures. The more bytes in the signature, the more accurate all of the above will be.

You can also use these signatures to quickly find similar sets without doing n^2 comparisons. Each signature is assigned to several buckets; sets whose signatures end up in the same bucket are likely to be similar. The targetThreshold controls the desired level of similarity - the higher the threshold, the more efficiently you can find all the similar sets.

This abstract superclass is generic with regards to the size of the hash used. Depending on the number of unique values in the domain of the sets, you may want a MinHasher16, a MinHasher32, or a new custom subclass.

This implementation is modeled after Chapter 3 of Ullman and Rajaraman's Mining of Massive Datasets: http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

Linear Supertypes
Monoid[MinHashSignature], AdditiveMonoid[MinHashSignature], cats.kernel.Monoid[MinHashSignature], Semigroup[MinHashSignature], AdditiveSemigroup[MinHashSignature], cats.kernel.Semigroup[MinHashSignature], Serializable, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. MinHasher
  2. Monoid
  3. AdditiveMonoid
  4. Monoid
  5. Semigroup
  6. AdditiveSemigroup
  7. Semigroup
  8. Serializable
  9. AnyRef
  10. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new MinHasher(numHashes: Int, numBands: Int)(implicit n: Numeric[H])

Abstract Value Members

  1. abstract def buildArray(left: Array[Byte], right: Array[Byte])(fn: (H, H) => H): Array[Byte]

    Decode two signatures into hash values, combine them somehow, and produce a new array

    Decode two signatures into hash values, combine them somehow, and produce a new array

    Attributes
    protected
  2. abstract def buildArray(fn: => H): Array[Byte]

    Initialize a byte array by generating hash values

    Initialize a byte array by generating hash values

    Attributes
    protected
  3. abstract def hashSize: Int

    The number of bytes used for each hash in the signature

  4. abstract def maxHash: H

    Maximum value the hash can take on (not 2*hashSize because of signed types)

Concrete Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def additive: algebra.Monoid[MinHashSignature]

    These are from algebra.Monoid

    These are from algebra.Monoid

    Definition Classes
    Monoid → AdditiveMonoid → Semigroup → AdditiveSemigroup
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. def assertNotZero(v: MinHashSignature): Unit
    Definition Classes
    Monoid
  7. def buckets(sig: MinHashSignature): List[Long]

    Bucket keys to use for quickly finding other similar items via locality sensitive hashing

  8. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @native()
  9. def combine(l: MinHashSignature, r: MinHashSignature): MinHashSignature
    Definition Classes
    Semigroup → Semigroup
  10. def combineAll(t: TraversableOnce[MinHashSignature]): MinHashSignature
    Definition Classes
    Monoid → Monoid
  11. def combineAllOption(as: IterableOnce[MinHashSignature]): Option[MinHashSignature]
    Definition Classes
    Monoid → Semigroup
  12. def combineN(a: MinHashSignature, n: Int): MinHashSignature
    Definition Classes
    Monoid → Semigroup
  13. def empty: MinHashSignature
    Definition Classes
    Monoid → Monoid
  14. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  15. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  16. val estimatedThreshold: Double

    Useful for understanding the effects of numBands and numRows

  17. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable])
  18. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  19. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  20. def init(fn: (MurmurHash128) => (Long, Long)): MinHashSignature

    Create a signature for an arbitrary value

  21. def init(value: String): MinHashSignature

    Create a signature for a single String value

  22. def init(value: Long): MinHashSignature

    Create a signature for a single Long value

  23. def isEmpty(a: MinHashSignature)(implicit ev: Eq[MinHashSignature]): Boolean
    Definition Classes
    Monoid
  24. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  25. def isNonZero(v: MinHashSignature): Boolean
    Definition Classes
    Monoid
  26. def isZero(a: MinHashSignature)(implicit ev: Eq[MinHashSignature]): Boolean
    Definition Classes
    AdditiveMonoid
  27. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  28. def nonZeroOption(v: MinHashSignature): Option[MinHashSignature]
    Definition Classes
    Monoid
  29. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  30. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  31. val numBands: Int
  32. val numBytes: Int

    For explanation of the "bands" and "rows" see Ullman and Rajaraman

  33. val numHashes: Int
  34. val numRows: Int
  35. def plus(left: MinHashSignature, right: MinHashSignature): MinHashSignature

    Set union

    Set union

    Definition Classes
    MinHasher → AdditiveSemigroup
  36. def positiveSumN(a: MinHashSignature, n: Int): MinHashSignature
    Attributes
    protected[this]
    Definition Classes
    AdditiveSemigroup
  37. def probabilityOfInclusion(sim: Double): Double

    Useful for understanding the effects of numBands and numRows

  38. def repeatedCombineN(a: MinHashSignature, n: Int): MinHashSignature
    Attributes
    protected[this]
    Definition Classes
    Semigroup
  39. def similarity(left: MinHashSignature, right: MinHashSignature): Double

    Esimate Jaccard similarity (size of union / size of intersection)

  40. def sum(vs: TraversableOnce[MinHashSignature]): MinHashSignature
    Definition Classes
    Monoid → AdditiveMonoid
  41. def sumN(a: MinHashSignature, n: Int): MinHashSignature
    Definition Classes
    AdditiveMonoid → AdditiveSemigroup
  42. def sumOption(iter: TraversableOnce[MinHashSignature]): Option[MinHashSignature]

    Returns an instance of T calculated by summing all instances in iter in one pass.

    Returns an instance of T calculated by summing all instances in iter in one pass. Returns None if iter is empty, else Some[T].

    iter

    instances of T to be combined

    returns

    None if iter is empty, else an option value containing the summed T

    Definition Classes
    Semigroup
    Note

    Override if there is a faster way to compute this sum than iter.reduceLeftOption using plus.

  43. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  44. def toString(): String
    Definition Classes
    AnyRef → Any
  45. def trySum(as: TraversableOnce[MinHashSignature]): Option[MinHashSignature]
    Definition Classes
    AdditiveMonoid → AdditiveSemigroup
  46. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  47. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  48. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()
  49. val zero: MinHashSignature

    Signature for empty set, needed to be a proper Monoid

    Signature for empty set, needed to be a proper Monoid

    Definition Classes
    MinHasher → AdditiveMonoid

Inherited from Monoid[MinHashSignature]

Inherited from AdditiveMonoid[MinHashSignature]

Inherited from cats.kernel.Monoid[MinHashSignature]

Inherited from Semigroup[MinHashSignature]

Inherited from AdditiveSemigroup[MinHashSignature]

Inherited from cats.kernel.Semigroup[MinHashSignature]

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped