DistributedTextSet is comprised of RDD of TextFeature.
LocalTextSet is comprised of array of TextFeature.
Removes all dirty (non English alphabet) characters from tokens and converts words to lower case.
Shape the sequence of indices to a fixed length.
Shape the sequence of indices to a fixed length. If the original sequence is longer than the target length, it will be truncated from the beginning or the end. If the original sequence is shorter than the target length, it will be padded to the end. Need to word2idx first. Input key: TextFeature.indexedTokens Output key: TextFeature.indexedTokens The original indices sequence will be replaced by the shaped sequence.
Each TextFeature keeps information of a single text record.
Each TextFeature keeps information of a single text record. It can include various status (if any) of a text, e.g. original text content, uri, category label, tokens, index representation of tokens, BigDL Sample representation, prediction result and so on. It uses a HashMap to store all these data. Each key is a string that can be used to identify the corresponding value.
Transform indexedTokens and label (if any) of a TextFeature to a BigDL Sample.
Transform indexedTokens and label (if any) of a TextFeature to a BigDL Sample. Need to word2idx first. Input key: TextFeature.indexedTokens and TextFeature.label (if any) Output key: TextFeature.sample
TextSet wraps a set of TextFeature.
Base class of Transformers that transform TextFeature.
Transform text to array of string tokens.
Transform text to array of string tokens. Input key: TextFeature.text Output key: TextFeature.tokens
Given a wordIndex map, transform tokens to corresponding indices.
Given a wordIndex map, transform tokens to corresponding indices. Those words not in the map will be aborted. Need to tokenize first. Input key: TextFeature.tokens Output key: TextFeature.indexedTokens
Removes all dirty (non English alphabet) characters from tokens and converts words to lower case. Need to tokenize first. Input key: TextFeature.tokens Output key: TextFeature.tokens In this case, original tokens will be replaced by normalized tokens.