Generate BigDL Sample.
Generate BigDL Sample. See TextFeatureToSample for more details.
Generate wordIndex map based on sorted word frequencies in descending order.
Generate wordIndex map based on sorted word frequencies in descending order. Return the result map, which will also be stored in 'wordIndex'. Make sure you call this after tokenize. Otherwise you will get an exception. See word2idx for more details.
Get the word index map of this TextSet.
Get the word index map of this TextSet. If the TextSet hasn't been transformed from word to index, null will be returned.
Whether it is a DistributedTextSet.
Whether it is a DistributedTextSet.
Whether it is a LocalTextSet.
Whether it is a LocalTextSet.
Do normalization on tokens.
Do normalization on tokens. See Normalizer for more details.
Randomly split into array of TextSet with provided weights.
Randomly split into array of TextSet with provided weights. Only available for DistributedTextSet for now.
Array of Double indicating the split portions.
Shape the sequence of tokens to a fixed length.
Shape the sequence of tokens to a fixed length. Padding element will be "##". See SequenceShaper for more details.
Convert TextSet to DataSet of Sample.
Convert TextSet to DataSet of Sample.
Convert to a DistributedTextSet.
Convert to a DistributedTextSet.
Need to specify SparkContext to convert a LocalTextSet to a DistributedTextSet. In this case, you may also want to specify partitionNum, the default of which is 4.
Convert to a LocalTextSet.
Convert to a LocalTextSet.
Do tokenization on original text.
Do tokenization on original text. See Tokenizer for more details.
Transform from one TextSet to another.
Transform from one TextSet to another.
Map word tokens to indices.
Map word tokens to indices. Index will start from 1 and corresponds to the occurrence frequency of each word sorted in descending order. See WordIndexer for more details. After word2idx, you can get the wordIndex map by calling 'getWordIndex'.
Integer. Remove the topN words with highest frequencies in the case where those are treated as stopwords. Default is 0, namely remove nothing.
Integer. The maximum number of words to be taken into consideration. Default is -1, namely all words will be considered.