TextSet

Instance Constructors

new TextSet()

Abstract Value Members

abstract def generateWordIndexMap(removeTopN: Int = 0, maxWordsNum: Int = 5000, minFreq: Int = 1, existingMap: Map[String, Int] = null): Map[String, Int]

Generate wordIndex map based on sorted word frequencies in descending order.
Generate wordIndex map based on sorted word frequencies in descending order. Return the result map, which will also be stored in 'wordIndex'. Make sure you call this after tokenize. Otherwise you will get an exception. See word2idx for more details.
abstract def isDistributed: Boolean

Whether it is a DistributedTextSet.
abstract def isLocal: Boolean

Whether it is a LocalTextSet.
abstract def loadWordIndex(path: String): TextSet

Load the wordIndex map which was saved after the training, so that this TextSet can directly use this wordIndex during inference.
Load the wordIndex map which was saved after the training, so that this TextSet can directly use this wordIndex during inference. Each separate line should be "word id".
Note that after calling loadWordIndex, you do not need to specify any argument when calling word2idx in the preprocessing pipeline as now you are using exactly the loaded wordIndex for transformation.
For LocalTextSet, load txt from a local file system. For DistributedTextSet, load txt from a local or distributed file system (such as HDFS).
path
The path to the text file.
abstract def randomSplit(weights: Array[Double]): Array[TextSet]

Randomly split into array of TextSet with provided weights.
Randomly split into array of TextSet with provided weights. Only available for DistributedTextSet for now.
weights
Array of Double indicating the split portions.
abstract def toDataSet: DataSet[Sample[Float]]

Convert TextSet to DataSet of Sample.
abstract def toDistributed(sc: SparkContext = null, partitionNum: Int = 4): DistributedTextSet

Convert to a DistributedTextSet.
Convert to a DistributedTextSet.
Need to specify SparkContext to convert a LocalTextSet to a DistributedTextSet. In this case, you may also want to specify partitionNum, the default of which is 4.
abstract def toLocal(): LocalTextSet

Convert to a LocalTextSet.
abstract def transform(transformer: Preprocessing[TextFeature, TextFeature]): TextSet

Transform from one TextSet to another.

Concrete Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
def ->(transformer: Preprocessing[TextFeature, TextFeature]): TextSet
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def generateSample(): TextSet

Generate BigDL Sample.
Generate BigDL Sample. Need to word2idx first. See TextFeatureToSample for more details.
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def getWordIndex: Map[String, Int]

Get the word index map of this TextSet.
Get the word index map of this TextSet. If the TextSet hasn't been transformed from word to index, null will be returned.
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def normalize(): TextSet

Do normalization on tokens.
Do normalization on tokens. Need to tokenize first. See Normalizer for more details.
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def saveWordIndex(path: String): Unit

Save wordIndex map to text file, which can be used for future inference.
Save wordIndex map to text file, which can be used for future inference. Each separate line will be "word id".
For LocalTextSet, save txt to a local file system. For DistributedTextSet, save txt to a local or distributed file system (such as HDFS).
path
The path to the text file.
def setWordIndex(vocab: Map[String, Int]): TextSet.this.type

Assign a wordIndex map for this TextSet to use during word2idx.
Assign a wordIndex map for this TextSet to use during word2idx. If you load the wordIndex from the saved file, you are recommended to use loadWordIndex directly.
vocab
Map of each word (String) and its index (integer).
def shapeSequence(len: Int, truncMode: TruncMode = TruncMode.pre, padElement: Int = 0): TextSet

Shape the sequence of indices to a fixed length.
Shape the sequence of indices to a fixed length. Need to word2idx first. See SequenceShaper for more details.
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
def tokenize(): TextSet

Do tokenization on original text.
Do tokenization on original text. See Tokenizer for more details.
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def word2idx(removeTopN: Int = 0, maxWordsNum: Int = 1, minFreq: Int = 1, existingMap: Map[String, Int] = null): TextSet

Map word tokens to indices.
Map word tokens to indices. Important: Take care that this method behaves a bit differently for training and inference.
During the training, you need to generate a new wordIndex map according to the texts you are dealing with. Thus this method will first do the map generation and then convert words to indices based on the generated map. You can specify the following arguments which pose some constraints when generating the map. In the result map, index will start from 1 and corresponds to the occurrence frequency of each word sorted in descending order. Here we adopt the convention that index 0 will be reserved for unknown words. After word2idx, you can get the generated wordIndex map by calling 'getWordIndex'. Also, you can call saveWordIndex to save this wordIndex map to be used in future training.
removeTopN
Non-negative integer. Remove the topN words with highest frequencies in the case where those are treated as stopwords. Default is 0, namely remove nothing.
maxWordsNum
Integer. The maximum number of words to be taken into consideration. Default is -1, namely all words will be considered. Otherwise, it should be a positive integer.
minFreq
Positive integer. Only those words with frequency >= minFreq will be taken into consideration. Default is 1, namely all words that occur will be considered.
existingMap
Existing map of word index if any. Default is null and in this case a new map with index starting from 1 will be generated. If not null, then the generated map will preserve the word index in existingMap and assign subsequent indices to new words. ---------------------------------------Inference-------------------------------------------- During the inference, you are supposed to use exactly the same wordIndex map as in the training stage instead of generating a new one. Thus please be aware that you do not need to specify any of the above arguments. You need to call loadWordIndex or setWordIndex beforehand for map loading. Need to tokenize first. See WordIndexer for more details.

Related Docs: object TextSet | package text

abstract class TextSet extends AnyRef

Instance Constructors

new TextSet()

Abstract Value Members

abstract def generateWordIndexMap(removeTopN: Int = 0, maxWordsNum: Int = 5000, minFreq: Int = 1, existingMap: Map[String, Int] = null): Map[String, Int]

abstract def isDistributed: Boolean

abstract def isLocal: Boolean

abstract def loadWordIndex(path: String): TextSet

abstract def randomSplit(weights: Array[Double]): Array[TextSet]

abstract def toDataSet: DataSet[Sample[Float]]

abstract def toDistributed(sc: SparkContext = null, partitionNum: Int = 4): DistributedTextSet

abstract def toLocal(): LocalTextSet

abstract def transform(transformer: Preprocessing[TextFeature, TextFeature]): TextSet

Concrete Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

def ->(transformer: Preprocessing[TextFeature, TextFeature]): TextSet

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

def generateSample(): TextSet

final def getClass(): Class[_]

def getWordIndex: Map[String, Int]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

def normalize(): TextSet

final def notify(): Unit

final def notifyAll(): Unit

def saveWordIndex(path: String): Unit

def setWordIndex(vocab: Map[String, Int]): TextSet.this.type

def shapeSequence(len: Int, truncMode: TruncMode = TruncMode.pre, padElement: Int = 0): TextSet

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

def tokenize(): TextSet

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

def word2idx(removeTopN: Int = 0, maxWordsNum: Int = 1, minFreq: Int = 1, existingMap: Map[String, Int] = null): TextSet

Inherited from AnyRef

Inherited from Any

Ungrouped