TextSet

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def array(data: Array[TextFeature]): LocalTextSet

Create a LocalTextSet from array of TextFeature.
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def fromRelationLists(relations: Array[Relation], corpus1: TextSet, corpus2: TextSet): LocalTextSet

Generate a TextSet for ranking using Relation array.
Generate a TextSet for ranking using Relation array.
relations
Array of Relation.
corpus1
LocalTextSet that contains all Relation.id1. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
corpus2
LocalTextSet that contains all Relation.id2. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
returns
LocalTextSet.
def fromRelationLists(relations: RDD[Relation], corpus1: TextSet, corpus2: TextSet): DistributedTextSet

Used to generate a TextSet for ranking.
Used to generate a TextSet for ranking.
This method does the following: 1. For each Relation.id1, find the list of Relation.id2 with corresponding Relation.label that comes together with Relation.id1. In other words, group relations by Relation.id1. 2. Join with corpus to transform each id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each list, generate a TextFeature having Sample with: - feature of shape (listLength, text1Length + text2Length). - label of shape (listLength, 1).
relations
RDD of Relation.
corpus1
DistributedTextSet that contains all Relation.id1. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
corpus2
DistributedTextSet that contains all Relation.id2. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
returns
DistributedTextSet.
def fromRelationPairs(relations: Array[Relation], corpus1: TextSet, corpus2: TextSet): LocalTextSet

Generate a TextSet for pairwise training using Relation array.
Generate a TextSet for pairwise training using Relation array.
relations
Array of Relation.
corpus1
LocalTextSet that contains all Relation.id1. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
corpus2
LocalTextSet that contains all Relation.id2. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
returns
LocalTextSet.
def fromRelationPairs(relations: RDD[Relation], corpus1: TextSet, corpus2: TextSet, memoryType: MemoryType = DRAM): DistributedTextSet

Used to generate a TextSet for pairwise training.
Used to generate a TextSet for pairwise training.
This method does the following: 1. Generate all RelationPairs: (id1, id2Positive, id2Negative) from Relations. 2. Join RelationPairs with corpus to transform id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each pair, generate a TextFeature having Sample with: - feature of shape (2, text1Length + text2Length). - label of value [1 0] as the positive relation is placed before the negative one.
relations
RDD of Relation.
corpus1
DistributedTextSet that contains all Relation.id1. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
corpus2
DistributedTextSet that contains all Relation.id2. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
returns
DistributedTextSet.
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
val logger: Logger
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def rdd(data: RDD[TextFeature], memoryType: MemoryType = DRAM): DistributedTextSet

Create a DistributedTextSet from RDD of TextFeature.
def read(path: String, sc: SparkContext = null, minPartitions: Int = 1): TextSet

Read text files with labels from a directory.
Read text files with labels from a directory.
The directory structure is expected to be the following: path ├── dir1 - text1, text2, ... ├── dir2 - text1, text2, ... └── dir3 - text1, text2, ... Under the target path, there ought to be N subdirectories (dir1 to dirN). Each subdirectory represents a category and contains all texts that belong to such category. Each category will be a given a label according to its position in the ascending order sorted among all subdirectories. All texts will be given a label according to the subdirectory where it is located. Labels start from 0.
path
The folder path to texts. Local or distributed file system (such as HDFS) are supported. If you want to read from a distributed file system, sc needs to be specified.
sc
An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is null and in this case texts will be read as a LocalTextSet.
minPartitions
Integer. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not null. Default is 1.
returns
TextSet.
def readCSV(path: String, sc: SparkContext = null, minPartitions: Int = 1): TextSet

Read texts with id from csv file.
Read texts with id from csv file. Each record is supposed to contain the following two fields in order: id(String) and text(String).
path
The path to the csv file. Local or distributed file system (such as HDFS) are supported. If you want to read from a distributed file system, sc needs to be specified.
sc
An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is null and in this case texts will be read as a LocalTextSet.
minPartitions
Integer. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not null. Default is 1.
returns
TextSet.
def readParquet(path: String, sqlContext: SQLContext): DistributedTextSet

Read texts with id from parquet file.
Read texts with id from parquet file. Schema should be the following: "id"(String) and "text"(String).
path
The path to the parquet file.
sqlContext
An instance of SQLContext.
returns
DistributedTextSet.
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def wordsToMap(words: Array[String], existingMap: Map[String, Int] = null): Map[String, Int]

Assign each word an index to form a map.
Assign each word an index to form a map.
words
Array of words.
existingMap
Existing map of word index if any. Default is null and in this case a new map with index starting from 1 will be generated. If not null, then the generated map will preserve the word index in existingMap and assign subsequent indices to new words.
returns
wordIndex map.

Related Docs: class TextSet | package text

object TextSet

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

def array(data: Array[TextFeature]): LocalTextSet

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

def fromRelationLists(relations: Array[Relation], corpus1: TextSet, corpus2: TextSet): LocalTextSet

def fromRelationLists(relations: RDD[Relation], corpus1: TextSet, corpus2: TextSet): DistributedTextSet

def fromRelationPairs(relations: Array[Relation], corpus1: TextSet, corpus2: TextSet): LocalTextSet

def fromRelationPairs(relations: RDD[Relation], corpus1: TextSet, corpus2: TextSet, memoryType: MemoryType = DRAM): DistributedTextSet

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

val logger: Logger

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def rdd(data: RDD[TextFeature], memoryType: MemoryType = DRAM): DistributedTextSet

def read(path: String, sc: SparkContext = null, minPartitions: Int = 1): TextSet

def readCSV(path: String, sc: SparkContext = null, minPartitions: Int = 1): TextSet

def readParquet(path: String, sqlContext: SQLContext): DistributedTextSet

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

def wordsToMap(words: Array[String], existingMap: Map[String, Int] = null): Map[String, Int]

Inherited from AnyRef

Inherited from Any

Ungrouped