org.bdgenomics.utils.minhash

MinHash

object MinHash extends Serializable

This object presents several methods for determining approximate pair-wise Jaccard similarity through the use of MinHash signatures. A description of this algorithm can be found in chapter 3 of:

Rajaraman, Anand, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2011.

This chapter may be freely (and legally) downloaded from:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. MinHash
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. def approximateMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, bands: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

    Implements an approximate pair-wise MinHash similarity check.

    Implements an approximate pair-wise MinHash similarity check. Approximate refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity. This method uses a locality sensitive hashing (LSH) based approach to reduce the number of comparisons required.

    We use the LSH technique described in section 3.4.1 of the Ullman text. This technique creates _b_ bands which divide the hashing space. For a MinHash signature with length _l_, we require b * r = l, where _r_ is the number of rows in each band. For given _b_ and _r_, we expect to compare all elements with similarity greater than (1/b)^(1/r).

    T

    This function will operate on RDDs containing any type T that extends the MinHashable trait.

    rdd

    The RDD of data points to compute similarity on.

    signatureLength

    The length of MinHash signature to use.

    bands

    The number of bands to use for LSHing.

    randomSeed

    An optional seed for random number generation.

    returns

    Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).

    Exceptions thrown
    IllegalArgumentException

    Throws an illegal argument exception if the number of bands does not divide evenly into the signature length.

  7. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  8. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  9. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  10. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  11. def exactMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

    Implements an exact pair-wise MinHash similarity check.

    Implements an exact pair-wise MinHash similarity check. Exact refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity, and this method _exactly_ compares all pairs of inputs, as opposed to locality sensitive hashing (LSH) based approximations.

    T

    This function will operate on RDDs containing any type T that extends the MinHashable trait.

    rdd

    The RDD of data points to compute similarity on.

    signatureLength

    The length of MinHash signature to use.

    randomSeed

    An optional seed for random number generation.

    returns

    Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).

    Note

    This operation may be expensive, as it performs a cartesian product of all elements in the input RDD.

  12. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  13. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  14. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  15. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  16. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  17. final def notify(): Unit

    Definition Classes
    AnyRef
  18. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  19. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  20. def toString(): String

    Definition Classes
    AnyRef → Any
  21. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  22. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped