Create a LocalTextSet from array of TextFeature.
Generate a TextSet for ranking using Relation array.
Generate a TextSet for ranking using Relation array.
Array of Relation.
LocalTextSet that contains all Relation.id1. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
LocalTextSet that contains all Relation.id2. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
LocalTextSet.
Used to generate a TextSet for ranking.
Used to generate a TextSet for ranking.
This method does the following: 1. For each Relation.id1, find the list of Relation.id2 with corresponding Relation.label that comes together with Relation.id1. In other words, group relations by Relation.id1. 2. Join with corpus to transform each id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each list, generate a TextFeature having Sample with: - feature of shape (listLength, text1Length + text2Length). - label of shape (listLength, 1).
RDD of Relation.
DistributedTextSet that contains all Relation.id1. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
DistributedTextSet that contains all Relation.id2. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
DistributedTextSet.
Generate a TextSet for pairwise training using Relation array.
Generate a TextSet for pairwise training using Relation array.
Array of Relation.
LocalTextSet that contains all Relation.id1. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
LocalTextSet that contains all Relation.id2. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
LocalTextSet.
Used to generate a TextSet for pairwise training.
Used to generate a TextSet for pairwise training.
This method does the following: 1. Generate all RelationPairs: (id1, id2Positive, id2Negative) from Relations. 2. Join RelationPairs with corpus to transform id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each pair, generate a TextFeature having Sample with: - feature of shape (2, text1Length + text2Length). - label of value [1 0] as the positive relation is placed before the negative one.
RDD of Relation.
DistributedTextSet that contains all Relation.id1. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
DistributedTextSet that contains all Relation.id2. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
DistributedTextSet.
Create a DistributedTextSet from RDD of TextFeature.
Read text files with labels from a directory.
Read text files with labels from a directory.
The directory structure is expected to be the following: path ├── dir1 - text1, text2, ... ├── dir2 - text1, text2, ... └── dir3 - text1, text2, ... Under the target path, there ought to be N subdirectories (dir1 to dirN). Each subdirectory represents a category and contains all texts that belong to such category. Each category will be a given a label according to its position in the ascending order sorted among all subdirectories. All texts will be given a label according to the subdirectory where it is located. Labels start from 0.
The folder path to texts. Local or distributed file system (such as HDFS) are supported. If you want to read from a distributed file system, sc needs to be specified.
An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is null and in this case texts will be read as a LocalTextSet.
Integer. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not null. Default is 1.
TextSet.
Read texts with id from csv file.
Read texts with id from csv file. Each record is supposed to contain the following two fields in order: id(String) and text(String).
The path to the csv file. Local or distributed file system (such as HDFS) are supported. If you want to read from a distributed file system, sc needs to be specified.
An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is null and in this case texts will be read as a LocalTextSet.
Integer. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not null. Default is 1.
TextSet.
Read texts with id from parquet file.
Read texts with id from parquet file. Schema should be the following: "id"(String) and "text"(String).
The path to the parquet file.
An instance of SQLContext.
DistributedTextSet.
Assign each word an index to form a map.
Assign each word an index to form a map.
Array of words.
Existing map of word index if any. Default is null and in this case a new map with index starting from 1 will be generated. If not null, then the generated map will preserve the word index in existingMap and assign subsequent indices to new words.
wordIndex map.