CompactWordEmbeddingMap

This class and its companion object have been backported from Eidos. There it is/was an optional replacement for WordEmbeddingMap used for performance reasons. It loads data faster from disk and stores it more compactly in memory. It does not, however, include all the operations of processer's Word2Vec. For instance, logMultiplicativeTextSimilarity is not included, but could probably be added. Other methods like getWordVector, which in Word2Vec returns an Array[Double], would be inefficient to include because the arrays of doubles (or floats) are no longer part of the design. For more documentation other than that immediately below, both the companion object and the related test case (org.clulab.embeddings.TestCompactWord2Vec) may be helpful.

The class is typically instantiated by the apply method of the companion object which takes as arguments a filename and then two booleans: "resource", which specifies whether the named file exists as a resource or is alternatively stored on the broader filesystem, and "cached", which specifies that the data consists of Java-serialized objects (see the save method) or, alternatively, the standard vector text format. The apply method arranges for the file to be read in the appropriate way and converted into a map with the words being keys with values being the row numbers in an implied 2-dimentional matrix of the all vector values, also included in the constructor. So, rather than each word being mapped to an independent, mini array as in Word2Vec, they are mapped to an integer row number of a single, larger matrix/array.

To take advantage of the faster load times, the vector data file needs to be converted from text format into a binary (Java serialized objects) for loadBin below. The test case includes an example. In some preprocessing phase, call CompactWord2Vec(filename, resource = false, cached = false) on the file containing the vectors in text format, such as glove.840B.300d.txt. "resource" is usually false because it can be a very large file, too large to include as a resource. On the resulting return value, call save(compactFilename). Thereafter, for normal, speedy processing, use CompactWord2Vec(compactFilename, resource = false, cached = true).

Linear Supertypes

WordEmbeddingMap, AnyRef, Any

Instance Constructors

new CompactWordEmbeddingMap(buildType: BuildType)

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def add(dest: Array[Float], srcRow: Int): Unit

Attributes
protected
def addWeighted(dest: Array[Float], srcRow: Int, weight: Float): Unit

Attributes
protected
val array: Array[Float]

Attributes
protected
final def asInstanceOf[T0]: T0

Definition Classes
Any
def avgSimilarity(texts1: Iterable[String], texts2: Iterable[String]): Float

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
val buildType: BuildType

Attributes
protected
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
val columns: Int
def compare(left: Option[IndexedSeq[Float]], right: Option[IndexedSeq[Float]]): Boolean
def compare(left: ImplMapType, right: ImplMapType): Boolean
def compare(lefts: IndexedSeq[Float], rights: IndexedSeq[Float]): Boolean
val dim: Int

The dimension of an embedding vector
The dimension of an embedding vector

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
def dotProduct(row1: Int, row2: Int): Float
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(other: Any): Boolean

Definition Classes
CompactWordEmbeddingMap → AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def get(word: String): Option[IndexedSeq[Float]]

Retrieves the embedding for this word, if it exists in the map
Retrieves the embedding for this word, if it exists in the map

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def getOrElseUnknown(word: String): IndexedSeq[Float]

Retrieves the embedding for this word; if it doesn't exist in the map uses the Unknown token instead
Retrieves the embedding for this word; if it doesn't exist in the map uses the Unknown token instead

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
def hashCode(): Int

Definition Classes
CompactWordEmbeddingMap → AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isOutOfVocabulary(word: String): Boolean

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
def keys: Set[String]

Returns all keys presented in the map, excluding the key for the unknown token
Returns all keys presented in the map, excluding the key for the unknown token

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
def knownKeys: Iterable[String]
def makeCompositeVector(text: Iterable[String]): Array[Float]

Computes the embedding of a text, as an unweighted average of all words
Computes the embedding of a text, as an unweighted average of all words

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
def makeCompositeVectorWeighted(text: Iterable[String], weights: Iterable[Float]): Array[Float]

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
val map: ImplMapType

Attributes
protected
def mkTextFromMap(): String

Attributes
protected
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
val rows: Int
def save(filename: String): Unit

Save this object in binary format.
Save this object in binary format.

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
def saveKryo(filename: String): Unit
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
val unkEmbeddingOpt: Option[IndexedSeq[Float]]
def unknownEmbedding: IndexedSeq[Float]

The embedding corresponding to the unknown token
The embedding corresponding to the unknown token

Definition Classes
CompactWordEmbeddingMap → WordEmbeddingMap
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Docs: object CompactWordEmbeddingMap | package embeddings

class CompactWordEmbeddingMap extends WordEmbeddingMap

Instance Constructors

new CompactWordEmbeddingMap(buildType: BuildType)

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

def add(dest: Array[Float], srcRow: Int): Unit

def addWeighted(dest: Array[Float], srcRow: Int, weight: Float): Unit

val array: Array[Float]

final def asInstanceOf[T0]: T0

def avgSimilarity(texts1: Iterable[String], texts2: Iterable[String]): Float

val buildType: BuildType

def clone(): AnyRef

val columns: Int

def compare(left: Option[IndexedSeq[Float]], right: Option[IndexedSeq[Float]]): Boolean

def compare(left: ImplMapType, right: ImplMapType): Boolean

def compare(lefts: IndexedSeq[Float], rights: IndexedSeq[Float]): Boolean

val dim: Int

def dotProduct(row1: Int, row2: Int): Float

final def eq(arg0: AnyRef): Boolean

def equals(other: Any): Boolean

def finalize(): Unit

def get(word: String): Option[IndexedSeq[Float]]

final def getClass(): Class[_]

def getOrElseUnknown(word: String): IndexedSeq[Float]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def isOutOfVocabulary(word: String): Boolean

def keys: Set[String]

def knownKeys: Iterable[String]

def makeCompositeVector(text: Iterable[String]): Array[Float]

def makeCompositeVectorWeighted(text: Iterable[String], weights: Iterable[Float]): Array[Float]

val map: ImplMapType

def mkTextFromMap(): String

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

val rows: Int

def save(filename: String): Unit

def saveKryo(filename: String): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

val unkEmbeddingOpt: Option[IndexedSeq[Float]]

def unknownEmbedding: IndexedSeq[Float]

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from WordEmbeddingMap

Inherited from AnyRef

Inherited from Any

Ungrouped