Package sentencepiece
Interface SentencepieceModel.TrainerSpecOrBuilder
-
- All Superinterfaces:
com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>
,com.google.protobuf.MessageLiteOrBuilder
,com.google.protobuf.MessageOrBuilder
- All Known Implementing Classes:
SentencepieceModel.TrainerSpec
,SentencepieceModel.TrainerSpec.Builder
- Enclosing class:
- SentencepieceModel
public static interface SentencepieceModel.TrainerSpecOrBuilder extends com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>
-
-
Method Summary
All Methods Instance Methods Abstract Methods Deprecated Methods Modifier and Type Method Description java.lang.String
getAcceptLanguage(int index)
List of the languages this model can accept.com.google.protobuf.ByteString
getAcceptLanguageBytes(int index)
List of the languages this model can accept.int
getAcceptLanguageCount()
List of the languages this model can accept.java.util.List<java.lang.String>
getAcceptLanguageList()
List of the languages this model can accept.boolean
getAllowWhitespaceOnlyPieces()
Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.int
getBosId()
<s>java.lang.String
getBosPiece()
optional string bos_piece = 46 [default = "<s>"];
com.google.protobuf.ByteString
getBosPieceBytes()
optional string bos_piece = 46 [default = "<s>"];
boolean
getByteFallback()
Decomposes unknown pieces into UTF-8 bytes.float
getCharacterCoverage()
///////////////////////////////////////////////////////////////// Training parameters.java.lang.String
getControlSymbols(int index)
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.com.google.protobuf.ByteString
getControlSymbolsBytes(int index)
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.int
getControlSymbolsCount()
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.java.util.List<java.lang.String>
getControlSymbolsList()
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.int
getEosId()
</s>java.lang.String
getEosPiece()
optional string eos_piece = 47 [default = "</s>"];
com.google.protobuf.ByteString
getEosPieceBytes()
optional string eos_piece = 47 [default = "</s>"];
boolean
getHardVocabLimit()
`vocab_size` is treated as hard limit.java.lang.String
getInput(int index)
///////////////////////////////////////////////////////////////// General parameters Input corpus files.com.google.protobuf.ByteString
getInputBytes(int index)
///////////////////////////////////////////////////////////////// General parameters Input corpus files.int
getInputCount()
///////////////////////////////////////////////////////////////// General parameters Input corpus files.java.lang.String
getInputFormat()
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqcom.google.protobuf.ByteString
getInputFormatBytes()
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqjava.util.List<java.lang.String>
getInputList()
///////////////////////////////////////////////////////////////// General parameters Input corpus files.long
getInputSentenceSize()
Maximum size of sentences the trainer loads from `input` parameter.int
getMaxSentenceLength()
The maximum sentence length in byte.int
getMaxSentencepieceLength()
///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.int
getMiningSentenceSize()
Deprecated.java.lang.String
getModelPrefix()
Output model file prefix.com.google.protobuf.ByteString
getModelPrefixBytes()
Output model file prefix.SentencepieceModel.TrainerSpec.ModelType
getModelType()
optional .sentencepiece.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];
int
getNumSubIterations()
Number of EM sub iterations.int
getNumThreads()
Number of threads in the training.int
getPadId()
<pad> (padding)java.lang.String
getPadPiece()
optional string pad_piece = 48 [default = "<pad>"];
com.google.protobuf.ByteString
getPadPieceBytes()
optional string pad_piece = 48 [default = "<pad>"];
java.lang.String
getRequiredChars()
Defines required characters.com.google.protobuf.ByteString
getRequiredCharsBytes()
Defines required characters.int
getSeedSentencepieceSize()
The size of seed sentencepieces.int
getSelfTestSampleSize()
Size of self-test samples, which are encoded in the model file.float
getShrinkingFactor()
In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.boolean
getShuffleInputSentence()
optional bool shuffle_input_sentence = 19 [default = true];
boolean
getSplitByNumber()
When `split_by_number` is true, put a boundary between number and non-number transition.boolean
getSplitByUnicodeScript()
Uses Unicode script to split sentence pieces.boolean
getSplitByWhitespace()
Use a white space to split sentence pieces.boolean
getSplitDigits()
Split all digits (0-9) into separate pieces.boolean
getTrainExtremelyLargeCorpus()
Increase bit depth to allow unigram model training on large (>10M sentences) corpora.int
getTrainingSentenceSize()
Deprecated.boolean
getTreatWhitespaceAsSuffix()
Adds whitespace symbol (_) as a suffix instead of prefix.int
getUnkId()
///////////////////////////////////////////////////////////////// Reserved special meta tokens.java.lang.String
getUnkPiece()
optional string unk_piece = 45 [default = "<unk>"];
com.google.protobuf.ByteString
getUnkPieceBytes()
optional string unk_piece = 45 [default = "<unk>"];
java.lang.String
getUnkSurface()
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.com.google.protobuf.ByteString
getUnkSurfaceBytes()
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.boolean
getUseAllVocab()
use all symbols for vocab extraction.java.lang.String
getUserDefinedSymbols(int index)
Defines user defined symbols.com.google.protobuf.ByteString
getUserDefinedSymbolsBytes(int index)
Defines user defined symbols.int
getUserDefinedSymbolsCount()
Defines user defined symbols.java.util.List<java.lang.String>
getUserDefinedSymbolsList()
Defines user defined symbols.int
getVocabSize()
Vocabulary size.boolean
getVocabularyOutputPieceScore()
When creating the vocabulary file, defines whether or not to additionally output the score for each piece.boolean
hasAllowWhitespaceOnlyPieces()
Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.boolean
hasBosId()
<s>boolean
hasBosPiece()
optional string bos_piece = 46 [default = "<s>"];
boolean
hasByteFallback()
Decomposes unknown pieces into UTF-8 bytes.boolean
hasCharacterCoverage()
///////////////////////////////////////////////////////////////// Training parameters.boolean
hasEosId()
</s>boolean
hasEosPiece()
optional string eos_piece = 47 [default = "</s>"];
boolean
hasHardVocabLimit()
`vocab_size` is treated as hard limit.boolean
hasInputFormat()
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqboolean
hasInputSentenceSize()
Maximum size of sentences the trainer loads from `input` parameter.boolean
hasMaxSentenceLength()
The maximum sentence length in byte.boolean
hasMaxSentencepieceLength()
///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.boolean
hasMiningSentenceSize()
Deprecated.boolean
hasModelPrefix()
Output model file prefix.boolean
hasModelType()
optional .sentencepiece.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];
boolean
hasNumSubIterations()
Number of EM sub iterations.boolean
hasNumThreads()
Number of threads in the training.boolean
hasPadId()
<pad> (padding)boolean
hasPadPiece()
optional string pad_piece = 48 [default = "<pad>"];
boolean
hasRequiredChars()
Defines required characters.boolean
hasSeedSentencepieceSize()
The size of seed sentencepieces.boolean
hasSelfTestSampleSize()
Size of self-test samples, which are encoded in the model file.boolean
hasShrinkingFactor()
In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.boolean
hasShuffleInputSentence()
optional bool shuffle_input_sentence = 19 [default = true];
boolean
hasSplitByNumber()
When `split_by_number` is true, put a boundary between number and non-number transition.boolean
hasSplitByUnicodeScript()
Uses Unicode script to split sentence pieces.boolean
hasSplitByWhitespace()
Use a white space to split sentence pieces.boolean
hasSplitDigits()
Split all digits (0-9) into separate pieces.boolean
hasTrainExtremelyLargeCorpus()
Increase bit depth to allow unigram model training on large (>10M sentences) corpora.boolean
hasTrainingSentenceSize()
Deprecated.boolean
hasTreatWhitespaceAsSuffix()
Adds whitespace symbol (_) as a suffix instead of prefix.boolean
hasUnkId()
///////////////////////////////////////////////////////////////// Reserved special meta tokens.boolean
hasUnkPiece()
optional string unk_piece = 45 [default = "<unk>"];
boolean
hasUnkSurface()
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.boolean
hasUseAllVocab()
use all symbols for vocab extraction.boolean
hasVocabSize()
Vocabulary size.boolean
hasVocabularyOutputPieceScore()
When creating the vocabulary file, defines whether or not to additionally output the score for each piece.-
Methods inherited from interface com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder
getDefaultInstanceForType, getExtension, getExtension, getExtension, getExtension, getExtension, getExtension, getExtensionCount, getExtensionCount, getExtensionCount, hasExtension, hasExtension, hasExtension
-
-
-
-
Method Detail
-
getInputList
java.util.List<java.lang.String> getInputList()
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;
- Returns:
- A list containing the input.
-
getInputCount
int getInputCount()
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;
- Returns:
- The count of input.
-
getInput
java.lang.String getInput(int index)
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;
- Parameters:
index
- The index of the element to return.- Returns:
- The input at the given index.
-
getInputBytes
com.google.protobuf.ByteString getInputBytes(int index)
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;
- Parameters:
index
- The index of the value to return.- Returns:
- The bytes of the input at the given index.
-
hasInputFormat
boolean hasInputFormat()
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;
- Returns:
- Whether the inputFormat field is set.
-
getInputFormat
java.lang.String getInputFormat()
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;
- Returns:
- The inputFormat.
-
getInputFormatBytes
com.google.protobuf.ByteString getInputFormatBytes()
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;
- Returns:
- The bytes for inputFormat.
-
hasModelPrefix
boolean hasModelPrefix()
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;
- Returns:
- Whether the modelPrefix field is set.
-
getModelPrefix
java.lang.String getModelPrefix()
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;
- Returns:
- The modelPrefix.
-
getModelPrefixBytes
com.google.protobuf.ByteString getModelPrefixBytes()
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;
- Returns:
- The bytes for modelPrefix.
-
hasModelType
boolean hasModelType()
optional .sentencepiece.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];
- Returns:
- Whether the modelType field is set.
-
getModelType
SentencepieceModel.TrainerSpec.ModelType getModelType()
optional .sentencepiece.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];
- Returns:
- The modelType.
-
hasVocabSize
boolean hasVocabSize()
Vocabulary size. 8k is the default size.
optional int32 vocab_size = 4 [default = 8000];
- Returns:
- Whether the vocabSize field is set.
-
getVocabSize
int getVocabSize()
Vocabulary size. 8k is the default size.
optional int32 vocab_size = 4 [default = 8000];
- Returns:
- The vocabSize.
-
getAcceptLanguageList
java.util.List<java.lang.String> getAcceptLanguageList()
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;
- Returns:
- A list containing the acceptLanguage.
-
getAcceptLanguageCount
int getAcceptLanguageCount()
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;
- Returns:
- The count of acceptLanguage.
-
getAcceptLanguage
java.lang.String getAcceptLanguage(int index)
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;
- Parameters:
index
- The index of the element to return.- Returns:
- The acceptLanguage at the given index.
-
getAcceptLanguageBytes
com.google.protobuf.ByteString getAcceptLanguageBytes(int index)
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;
- Parameters:
index
- The index of the value to return.- Returns:
- The bytes of the acceptLanguage at the given index.
-
hasSelfTestSampleSize
boolean hasSelfTestSampleSize()
Size of self-test samples, which are encoded in the model file.
optional int32 self_test_sample_size = 6 [default = 0];
- Returns:
- Whether the selfTestSampleSize field is set.
-
getSelfTestSampleSize
int getSelfTestSampleSize()
Size of self-test samples, which are encoded in the model file.
optional int32 self_test_sample_size = 6 [default = 0];
- Returns:
- The selfTestSampleSize.
-
hasCharacterCoverage
boolean hasCharacterCoverage()
///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
optional float character_coverage = 10 [default = 0.9995];
- Returns:
- Whether the characterCoverage field is set.
-
getCharacterCoverage
float getCharacterCoverage()
///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
optional float character_coverage = 10 [default = 0.9995];
- Returns:
- The characterCoverage.
-
hasInputSentenceSize
boolean hasInputSentenceSize()
Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
optional uint64 input_sentence_size = 11 [default = 0];
- Returns:
- Whether the inputSentenceSize field is set.
-
getInputSentenceSize
long getInputSentenceSize()
Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
optional uint64 input_sentence_size = 11 [default = 0];
- Returns:
- The inputSentenceSize.
-
hasShuffleInputSentence
boolean hasShuffleInputSentence()
optional bool shuffle_input_sentence = 19 [default = true];
- Returns:
- Whether the shuffleInputSentence field is set.
-
getShuffleInputSentence
boolean getShuffleInputSentence()
optional bool shuffle_input_sentence = 19 [default = true];
- Returns:
- The shuffleInputSentence.
-
hasMiningSentenceSize
@Deprecated boolean hasMiningSentenceSize()
Deprecated.Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
optional int32 mining_sentence_size = 12 [deprecated = true];
- Returns:
- Whether the miningSentenceSize field is set.
-
getMiningSentenceSize
@Deprecated int getMiningSentenceSize()
Deprecated.Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
optional int32 mining_sentence_size = 12 [deprecated = true];
- Returns:
- The miningSentenceSize.
-
hasTrainingSentenceSize
@Deprecated boolean hasTrainingSentenceSize()
Deprecated.Maximum size of sentences to train sentence pieces.
optional int32 training_sentence_size = 13 [deprecated = true];
- Returns:
- Whether the trainingSentenceSize field is set.
-
getTrainingSentenceSize
@Deprecated int getTrainingSentenceSize()
Deprecated.Maximum size of sentences to train sentence pieces.
optional int32 training_sentence_size = 13 [deprecated = true];
- Returns:
- The trainingSentenceSize.
-
hasSeedSentencepieceSize
boolean hasSeedSentencepieceSize()
The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
optional int32 seed_sentencepiece_size = 14 [default = 1000000];
- Returns:
- Whether the seedSentencepieceSize field is set.
-
getSeedSentencepieceSize
int getSeedSentencepieceSize()
The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
optional int32 seed_sentencepiece_size = 14 [default = 1000000];
- Returns:
- The seedSentencepieceSize.
-
hasShrinkingFactor
boolean hasShrinkingFactor()
In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
optional float shrinking_factor = 15 [default = 0.75];
- Returns:
- Whether the shrinkingFactor field is set.
-
getShrinkingFactor
float getShrinkingFactor()
In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
optional float shrinking_factor = 15 [default = 0.75];
- Returns:
- The shrinkingFactor.
-
hasMaxSentenceLength
boolean hasMaxSentenceLength()
The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
optional int32 max_sentence_length = 18 [default = 4192];
- Returns:
- Whether the maxSentenceLength field is set.
-
getMaxSentenceLength
int getMaxSentenceLength()
The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
optional int32 max_sentence_length = 18 [default = 4192];
- Returns:
- The maxSentenceLength.
-
hasNumThreads
boolean hasNumThreads()
Number of threads in the training.
optional int32 num_threads = 16 [default = 16];
- Returns:
- Whether the numThreads field is set.
-
getNumThreads
int getNumThreads()
Number of threads in the training.
optional int32 num_threads = 16 [default = 16];
- Returns:
- The numThreads.
-
hasNumSubIterations
boolean hasNumSubIterations()
Number of EM sub iterations.
optional int32 num_sub_iterations = 17 [default = 2];
- Returns:
- Whether the numSubIterations field is set.
-
getNumSubIterations
int getNumSubIterations()
Number of EM sub iterations.
optional int32 num_sub_iterations = 17 [default = 2];
- Returns:
- The numSubIterations.
-
hasMaxSentencepieceLength
boolean hasMaxSentencepieceLength()
///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
optional int32 max_sentencepiece_length = 20 [default = 16];
- Returns:
- Whether the maxSentencepieceLength field is set.
-
getMaxSentencepieceLength
int getMaxSentencepieceLength()
///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
optional int32 max_sentencepiece_length = 20 [default = 16];
- Returns:
- The maxSentencepieceLength.
-
hasSplitByUnicodeScript
boolean hasSplitByUnicodeScript()
Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
optional bool split_by_unicode_script = 21 [default = true];
- Returns:
- Whether the splitByUnicodeScript field is set.
-
getSplitByUnicodeScript
boolean getSplitByUnicodeScript()
Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
optional bool split_by_unicode_script = 21 [default = true];
- Returns:
- The splitByUnicodeScript.
-
hasSplitByNumber
boolean hasSplitByNumber()
When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
optional bool split_by_number = 23 [default = true];
- Returns:
- Whether the splitByNumber field is set.
-
getSplitByNumber
boolean getSplitByNumber()
When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
optional bool split_by_number = 23 [default = true];
- Returns:
- The splitByNumber.
-
hasSplitByWhitespace
boolean hasSplitByWhitespace()
Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
optional bool split_by_whitespace = 22 [default = true];
- Returns:
- Whether the splitByWhitespace field is set.
-
getSplitByWhitespace
boolean getSplitByWhitespace()
Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
optional bool split_by_whitespace = 22 [default = true];
- Returns:
- The splitByWhitespace.
-
hasTreatWhitespaceAsSuffix
boolean hasTreatWhitespaceAsSuffix()
Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
optional bool treat_whitespace_as_suffix = 24 [default = false];
- Returns:
- Whether the treatWhitespaceAsSuffix field is set.
-
getTreatWhitespaceAsSuffix
boolean getTreatWhitespaceAsSuffix()
Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
optional bool treat_whitespace_as_suffix = 24 [default = false];
- Returns:
- The treatWhitespaceAsSuffix.
-
hasAllowWhitespaceOnlyPieces
boolean hasAllowWhitespaceOnlyPieces()
Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
optional bool allow_whitespace_only_pieces = 26 [default = false];
- Returns:
- Whether the allowWhitespaceOnlyPieces field is set.
-
getAllowWhitespaceOnlyPieces
boolean getAllowWhitespaceOnlyPieces()
Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
optional bool allow_whitespace_only_pieces = 26 [default = false];
- Returns:
- The allowWhitespaceOnlyPieces.
-
hasSplitDigits
boolean hasSplitDigits()
Split all digits (0-9) into separate pieces.
optional bool split_digits = 25 [default = false];
- Returns:
- Whether the splitDigits field is set.
-
getSplitDigits
boolean getSplitDigits()
Split all digits (0-9) into separate pieces.
optional bool split_digits = 25 [default = false];
- Returns:
- The splitDigits.
-
getControlSymbolsList
java.util.List<java.lang.String> getControlSymbolsList()
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;
- Returns:
- A list containing the controlSymbols.
-
getControlSymbolsCount
int getControlSymbolsCount()
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;
- Returns:
- The count of controlSymbols.
-
getControlSymbols
java.lang.String getControlSymbols(int index)
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;
- Parameters:
index
- The index of the element to return.- Returns:
- The controlSymbols at the given index.
-
getControlSymbolsBytes
com.google.protobuf.ByteString getControlSymbolsBytes(int index)
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;
- Parameters:
index
- The index of the value to return.- Returns:
- The bytes of the controlSymbols at the given index.
-
getUserDefinedSymbolsList
java.util.List<java.lang.String> getUserDefinedSymbolsList()
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;
- Returns:
- A list containing the userDefinedSymbols.
-
getUserDefinedSymbolsCount
int getUserDefinedSymbolsCount()
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;
- Returns:
- The count of userDefinedSymbols.
-
getUserDefinedSymbols
java.lang.String getUserDefinedSymbols(int index)
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;
- Parameters:
index
- The index of the element to return.- Returns:
- The userDefinedSymbols at the given index.
-
getUserDefinedSymbolsBytes
com.google.protobuf.ByteString getUserDefinedSymbolsBytes(int index)
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;
- Parameters:
index
- The index of the value to return.- Returns:
- The bytes of the userDefinedSymbols at the given index.
-
hasRequiredChars
boolean hasRequiredChars()
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;
- Returns:
- Whether the requiredChars field is set.
-
getRequiredChars
java.lang.String getRequiredChars()
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;
- Returns:
- The requiredChars.
-
getRequiredCharsBytes
com.google.protobuf.ByteString getRequiredCharsBytes()
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;
- Returns:
- The bytes for requiredChars.
-
hasByteFallback
boolean hasByteFallback()
Decomposes unknown pieces into UTF-8 bytes.
optional bool byte_fallback = 35 [default = false];
- Returns:
- Whether the byteFallback field is set.
-
getByteFallback
boolean getByteFallback()
Decomposes unknown pieces into UTF-8 bytes.
optional bool byte_fallback = 35 [default = false];
- Returns:
- The byteFallback.
-
hasVocabularyOutputPieceScore
boolean hasVocabularyOutputPieceScore()
When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
optional bool vocabulary_output_piece_score = 32 [default = true];
- Returns:
- Whether the vocabularyOutputPieceScore field is set.
-
getVocabularyOutputPieceScore
boolean getVocabularyOutputPieceScore()
When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
optional bool vocabulary_output_piece_score = 32 [default = true];
- Returns:
- The vocabularyOutputPieceScore.
-
hasHardVocabLimit
boolean hasHardVocabLimit()
`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
optional bool hard_vocab_limit = 33 [default = true];
- Returns:
- Whether the hardVocabLimit field is set.
-
getHardVocabLimit
boolean getHardVocabLimit()
`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
optional bool hard_vocab_limit = 33 [default = true];
- Returns:
- The hardVocabLimit.
-
hasUseAllVocab
boolean hasUseAllVocab()
use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
optional bool use_all_vocab = 34 [default = false];
- Returns:
- Whether the useAllVocab field is set.
-
getUseAllVocab
boolean getUseAllVocab()
use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
optional bool use_all_vocab = 34 [default = false];
- Returns:
- The useAllVocab.
-
hasUnkId
boolean hasUnkId()
///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contigous.
optional int32 unk_id = 40 [default = 0];
- Returns:
- Whether the unkId field is set.
-
getUnkId
int getUnkId()
///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contigous.
optional int32 unk_id = 40 [default = 0];
- Returns:
- The unkId.
-
hasBosId
boolean hasBosId()
<s>
optional int32 bos_id = 41 [default = 1];
- Returns:
- Whether the bosId field is set.
-
getBosId
int getBosId()
<s>
optional int32 bos_id = 41 [default = 1];
- Returns:
- The bosId.
-
hasEosId
boolean hasEosId()
</s>
optional int32 eos_id = 42 [default = 2];
- Returns:
- Whether the eosId field is set.
-
getEosId
int getEosId()
</s>
optional int32 eos_id = 42 [default = 2];
- Returns:
- The eosId.
-
hasPadId
boolean hasPadId()
<pad> (padding)
optional int32 pad_id = 43 [default = -1];
- Returns:
- Whether the padId field is set.
-
getPadId
int getPadId()
<pad> (padding)
optional int32 pad_id = 43 [default = -1];
- Returns:
- The padId.
-
hasUnkPiece
boolean hasUnkPiece()
optional string unk_piece = 45 [default = "<unk>"];
- Returns:
- Whether the unkPiece field is set.
-
getUnkPiece
java.lang.String getUnkPiece()
optional string unk_piece = 45 [default = "<unk>"];
- Returns:
- The unkPiece.
-
getUnkPieceBytes
com.google.protobuf.ByteString getUnkPieceBytes()
optional string unk_piece = 45 [default = "<unk>"];
- Returns:
- The bytes for unkPiece.
-
hasBosPiece
boolean hasBosPiece()
optional string bos_piece = 46 [default = "<s>"];
- Returns:
- Whether the bosPiece field is set.
-
getBosPiece
java.lang.String getBosPiece()
optional string bos_piece = 46 [default = "<s>"];
- Returns:
- The bosPiece.
-
getBosPieceBytes
com.google.protobuf.ByteString getBosPieceBytes()
optional string bos_piece = 46 [default = "<s>"];
- Returns:
- The bytes for bosPiece.
-
hasEosPiece
boolean hasEosPiece()
optional string eos_piece = 47 [default = "</s>"];
- Returns:
- Whether the eosPiece field is set.
-
getEosPiece
java.lang.String getEosPiece()
optional string eos_piece = 47 [default = "</s>"];
- Returns:
- The eosPiece.
-
getEosPieceBytes
com.google.protobuf.ByteString getEosPieceBytes()
optional string eos_piece = 47 [default = "</s>"];
- Returns:
- The bytes for eosPiece.
-
hasPadPiece
boolean hasPadPiece()
optional string pad_piece = 48 [default = "<pad>"];
- Returns:
- Whether the padPiece field is set.
-
getPadPiece
java.lang.String getPadPiece()
optional string pad_piece = 48 [default = "<pad>"];
- Returns:
- The padPiece.
-
getPadPieceBytes
com.google.protobuf.ByteString getPadPieceBytes()
optional string pad_piece = 48 [default = "<pad>"];
- Returns:
- The bytes for padPiece.
-
hasUnkSurface
boolean hasUnkSurface()
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];
- Returns:
- Whether the unkSurface field is set.
-
getUnkSurface
java.lang.String getUnkSurface()
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];
- Returns:
- The unkSurface.
-
getUnkSurfaceBytes
com.google.protobuf.ByteString getUnkSurfaceBytes()
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];
- Returns:
- The bytes for unkSurface.
-
hasTrainExtremelyLargeCorpus
boolean hasTrainExtremelyLargeCorpus()
Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
optional bool train_extremely_large_corpus = 49 [default = false];
- Returns:
- Whether the trainExtremelyLargeCorpus field is set.
-
getTrainExtremelyLargeCorpus
boolean getTrainExtremelyLargeCorpus()
Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
optional bool train_extremely_large_corpus = 49 [default = false];
- Returns:
- The trainExtremelyLargeCorpus.
-
-