public class SentenceLengthNormalizer extends java.lang.Object implements TextProcessor
SentenceLengthNormalizer
normalizes the length of all the input sentences to the
specified number of tokens.
If the number of tokens in the input sentence is higher than the fixed length, the sentence is truncated to the fixed number. If the number of tokens in the input sentence is fewer than the fixed sentence length, padding tokens are inserted to make the length equal to the sentence length.
Constructor and Description |
---|
SentenceLengthNormalizer()
Creates a
TextProcessor that normalizes the length of the input. |
SentenceLengthNormalizer(int sentenceLength,
boolean addEosBosTokens)
Creates a
TextProcessor that normalizes the length of the input to the given sentence
length. |
SentenceLengthNormalizer(int sentenceLength,
boolean addEosBosTokens,
java.lang.String paddingToken,
java.lang.String eosToken,
java.lang.String bosToken)
Creates a
TextProcessor that normalizes the length of the input to the given sentence
length. |
Modifier and Type | Method and Description |
---|---|
int |
getLastValidLength()
Returns the valid length of the sentence that was last served as input to
preprocess(List) . |
java.util.List<java.lang.String> |
preprocess(java.util.List<java.lang.String> tokens)
Applies the preprocessing defined to the given input tokens.
|
public SentenceLengthNormalizer()
TextProcessor
that normalizes the length of the input.public SentenceLengthNormalizer(int sentenceLength, boolean addEosBosTokens)
TextProcessor
that normalizes the length of the input to the given sentence
length.sentenceLength
- the sentence lengthaddEosBosTokens
- whether to add Eos and Bos tokens before normalizing sentence lengthpublic SentenceLengthNormalizer(int sentenceLength, boolean addEosBosTokens, java.lang.String paddingToken, java.lang.String eosToken, java.lang.String bosToken)
TextProcessor
that normalizes the length of the input to the given sentence
length.sentenceLength
- the sentence lengthaddEosBosTokens
- whether to add Eos and Bos tokens before normalizing sentence lengthpaddingToken
- the padding token to be used if the number of tokens in the input is less
than sentence lengtheosToken
- the end of sentence tokenbosToken
- the begining of sentence tokenpublic java.util.List<java.lang.String> preprocess(java.util.List<java.lang.String> tokens)
preprocess
in interface TextProcessor
tokens
- the tokens created after the input text is tokenizedpublic int getLastValidLength()
preprocess(List)
. If no sentences preprocess before calling this
method, it will -1.