Class

com.intel.analytics.zoo.feature.text

LocalTextSet

Related Doc: package text

Permalink

class LocalTextSet extends TextSet

LocalTextSet is comprised of array of TextFeature.

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. LocalTextSet
  2. TextSet
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new LocalTextSet(array: Array[TextFeature])

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. def ->(transformer: Preprocessing[TextFeature, TextFeature]): TextSet

    Permalink
    Definition Classes
    TextSet
  4. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  5. var array: Array[TextFeature]

    Permalink
  6. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  7. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. def generateSample(): TextSet

    Permalink

    Generate BigDL Sample.

    Generate BigDL Sample. Need to word2idx first. See TextFeatureToSample for more details.

    Definition Classes
    TextSet
  12. def generateWordIndexMap(removeTopN: Int = 0, maxWordsNum: Int = 1, minFreq: Int = 1, existingMap: Map[String, Int] = null): Map[String, Int]

    Permalink

    Generate wordIndex map based on sorted word frequencies in descending order.

    Generate wordIndex map based on sorted word frequencies in descending order. Return the result map, which will also be stored in 'wordIndex'. Make sure you call this after tokenize. Otherwise you will get an exception. See word2idx for more details.

    Definition Classes
    LocalTextSetTextSet
  13. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  14. def getWordIndex: Map[String, Int]

    Permalink

    Get the word index map of this TextSet.

    Get the word index map of this TextSet. If the TextSet hasn't been transformed from word to index, null will be returned.

    Definition Classes
    TextSet
  15. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  16. def isDistributed: Boolean

    Permalink

    Whether it is a DistributedTextSet.

    Whether it is a DistributedTextSet.

    Definition Classes
    LocalTextSetTextSet
  17. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  18. def isLocal: Boolean

    Permalink

    Whether it is a LocalTextSet.

    Whether it is a LocalTextSet.

    Definition Classes
    LocalTextSetTextSet
  19. def loadWordIndex(path: String): TextSet

    Permalink

    Load the wordIndex map which was saved after the training, so that this TextSet can directly use this wordIndex during inference.

    Load the wordIndex map which was saved after the training, so that this TextSet can directly use this wordIndex during inference. Each separate line should be "word id".

    Note that after calling loadWordIndex, you do not need to specify any argument when calling word2idx in the preprocessing pipeline as now you are using exactly the loaded wordIndex for transformation.

    For LocalTextSet, load txt from a local file system. For DistributedTextSet, load txt from a local or distributed file system (such as HDFS).

    path

    The path to the text file.

    Definition Classes
    LocalTextSetTextSet
  20. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  21. def normalize(): TextSet

    Permalink

    Do normalization on tokens.

    Do normalization on tokens. Need to tokenize first. See Normalizer for more details.

    Definition Classes
    TextSet
  22. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  23. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  24. def randomSplit(weights: Array[Double]): Array[TextSet]

    Permalink

    Randomly split into array of TextSet with provided weights.

    Randomly split into array of TextSet with provided weights. Only available for DistributedTextSet for now.

    weights

    Array of Double indicating the split portions.

    Definition Classes
    LocalTextSetTextSet
  25. def saveWordIndex(path: String): Unit

    Permalink

    Save wordIndex map to text file, which can be used for future inference.

    Save wordIndex map to text file, which can be used for future inference. Each separate line will be "word id".

    For LocalTextSet, save txt to a local file system. For DistributedTextSet, save txt to a local or distributed file system (such as HDFS).

    path

    The path to the text file.

    Definition Classes
    LocalTextSetTextSet
  26. def setWordIndex(vocab: Map[String, Int]): LocalTextSet.this.type

    Permalink

    Assign a wordIndex map for this TextSet to use during word2idx.

    Assign a wordIndex map for this TextSet to use during word2idx. If you load the wordIndex from the saved file, you are recommended to use loadWordIndex directly.

    vocab

    Map of each word (String) and its index (integer).

    Definition Classes
    TextSet
  27. def shapeSequence(len: Int, truncMode: TruncMode = TruncMode.pre, padElement: Int = 0): TextSet

    Permalink

    Shape the sequence of indices to a fixed length.

    Shape the sequence of indices to a fixed length. Need to word2idx first. See SequenceShaper for more details.

    Definition Classes
    TextSet
  28. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  29. def toDataSet: DataSet[Sample[Float]]

    Permalink

    Convert TextSet to DataSet of Sample.

    Convert TextSet to DataSet of Sample.

    Definition Classes
    LocalTextSetTextSet
  30. def toDistributed(sc: SparkContext, partitionNum: Int = 4): DistributedTextSet

    Permalink

    Convert to a DistributedTextSet.

    Convert to a DistributedTextSet.

    Need to specify SparkContext to convert a LocalTextSet to a DistributedTextSet. In this case, you may also want to specify partitionNum, the default of which is 4.

    Definition Classes
    LocalTextSetTextSet
  31. def toLocal(): LocalTextSet

    Permalink

    Convert to a LocalTextSet.

    Convert to a LocalTextSet.

    Definition Classes
    LocalTextSetTextSet
  32. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  33. def tokenize(): TextSet

    Permalink

    Do tokenization on original text.

    Do tokenization on original text. See Tokenizer for more details.

    Definition Classes
    TextSet
  34. def transform(transformer: Preprocessing[TextFeature, TextFeature]): TextSet

    Permalink

    Transform from one TextSet to another.

    Transform from one TextSet to another.

    Definition Classes
    LocalTextSetTextSet
  35. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  36. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  37. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  38. def word2idx(removeTopN: Int = 0, maxWordsNum: Int = 1, minFreq: Int = 1, existingMap: Map[String, Int] = null): TextSet

    Permalink

    Map word tokens to indices.

    Map word tokens to indices. Important: Take care that this method behaves a bit differently for training and inference.


    During the training, you need to generate a new wordIndex map according to the texts you are dealing with. Thus this method will first do the map generation and then convert words to indices based on the generated map. You can specify the following arguments which pose some constraints when generating the map. In the result map, index will start from 1 and corresponds to the occurrence frequency of each word sorted in descending order. Here we adopt the convention that index 0 will be reserved for unknown words. After word2idx, you can get the generated wordIndex map by calling 'getWordIndex'. Also, you can call saveWordIndex to save this wordIndex map to be used in future training.

    removeTopN

    Non-negative integer. Remove the topN words with highest frequencies in the case where those are treated as stopwords. Default is 0, namely remove nothing.

    maxWordsNum

    Integer. The maximum number of words to be taken into consideration. Default is -1, namely all words will be considered. Otherwise, it should be a positive integer.

    minFreq

    Positive integer. Only those words with frequency >= minFreq will be taken into consideration. Default is 1, namely all words that occur will be considered.

    existingMap

    Existing map of word index if any. Default is null and in this case a new map with index starting from 1 will be generated. If not null, then the generated map will preserve the word index in existingMap and assign subsequent indices to new words. ---------------------------------------Inference-------------------------------------------- During the inference, you are supposed to use exactly the same wordIndex map as in the training stage instead of generating a new one. Thus please be aware that you do not need to specify any of the above arguments. You need to call loadWordIndex or setWordIndex beforehand for map loading. Need to tokenize first. See WordIndexer for more details.

    Definition Classes
    TextSet

Inherited from TextSet

Inherited from AnyRef

Inherited from Any

Ungrouped