All Superinterfaces:

com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>, com.google.protobuf.MessageLiteOrBuilder, com.google.protobuf.MessageOrBuilder

All Known Implementing Classes:

SentencepieceModel.TrainerSpec, SentencepieceModel.TrainerSpec.Builder

Enclosing class:

SentencepieceModel
```
public static interface SentencepieceModel.TrainerSpecOrBuilder
extends com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>
```

Method Summary

All Methods Instance Methods Abstract Methods Deprecated Methods
Modifier and Type	Method	Description
`java.lang.String`	`getAcceptLanguage(int index)`	List of the languages this model can accept.
`com.google.protobuf.ByteString`	`getAcceptLanguageBytes(int index)`	List of the languages this model can accept.
`int`	`getAcceptLanguageCount()`	List of the languages this model can accept.
`java.util.List<java.lang.String>`	`getAcceptLanguageList()`	List of the languages this model can accept.
`boolean`	`getAllowWhitespaceOnlyPieces()`	Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
`int`	`getBosId()`	<s>
`java.lang.String`	`getBosPiece()`	`optional string bos_piece = 46 [default = "<s>"];`
`com.google.protobuf.ByteString`	`getBosPieceBytes()`	`optional string bos_piece = 46 [default = "<s>"];`
`boolean`	`getByteFallback()`	Decomposes unknown pieces into UTF-8 bytes.
`float`	`getCharacterCoverage()`	///////////////////////////////////////////////////////////////// Training parameters.
`java.lang.String`	`getControlSymbols(int index)`	///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.
`com.google.protobuf.ByteString`	`getControlSymbolsBytes(int index)`	///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.
`int`	`getControlSymbolsCount()`	///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.
`java.util.List<java.lang.String>`	`getControlSymbolsList()`	///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.
`int`	`getEosId()`	</s>
`java.lang.String`	`getEosPiece()`	`optional string eos_piece = 47 [default = "</s>"];`
`com.google.protobuf.ByteString`	`getEosPieceBytes()`	`optional string eos_piece = 47 [default = "</s>"];`
`boolean`	`getHardVocabLimit()`	`vocab_size` is treated as hard limit.
`java.lang.String`	`getInput(int index)`	///////////////////////////////////////////////////////////////// General parameters Input corpus files.
`com.google.protobuf.ByteString`	`getInputBytes(int index)`	///////////////////////////////////////////////////////////////// General parameters Input corpus files.
`int`	`getInputCount()`	///////////////////////////////////////////////////////////////// General parameters Input corpus files.
`java.lang.String`	`getInputFormat()`	Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
`com.google.protobuf.ByteString`	`getInputFormatBytes()`	Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
`java.util.List<java.lang.String>`	`getInputList()`	///////////////////////////////////////////////////////////////// General parameters Input corpus files.
`long`	`getInputSentenceSize()`	Maximum size of sentences the trainer loads from `input` parameter.
`int`	`getMaxSentenceLength()`	The maximum sentence length in byte.
`int`	`getMaxSentencepieceLength()`	///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.
`int`	`getMiningSentenceSize()`	Deprecated.
`java.lang.String`	`getModelPrefix()`	Output model file prefix.
`com.google.protobuf.ByteString`	`getModelPrefixBytes()`	Output model file prefix.
`SentencepieceModel.TrainerSpec.ModelType`	`getModelType()`	`optional .sentencepiece.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];`
`int`	`getNumSubIterations()`	Number of EM sub iterations.
`int`	`getNumThreads()`	Number of threads in the training.
`int`	`getPadId()`	<pad> (padding)
`java.lang.String`	`getPadPiece()`	`optional string pad_piece = 48 [default = "<pad>"];`
`com.google.protobuf.ByteString`	`getPadPieceBytes()`	`optional string pad_piece = 48 [default = "<pad>"];`
`java.lang.String`	`getRequiredChars()`	Defines required characters.
`com.google.protobuf.ByteString`	`getRequiredCharsBytes()`	Defines required characters.
`int`	`getSeedSentencepieceSize()`	The size of seed sentencepieces.
`int`	`getSelfTestSampleSize()`	Size of self-test samples, which are encoded in the model file.
`float`	`getShrinkingFactor()`	In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.
`boolean`	`getShuffleInputSentence()`	`optional bool shuffle_input_sentence = 19 [default = true];`
`boolean`	`getSplitByNumber()`	When `split_by_number` is true, put a boundary between number and non-number transition.
`boolean`	`getSplitByUnicodeScript()`	Uses Unicode script to split sentence pieces.
`boolean`	`getSplitByWhitespace()`	Use a white space to split sentence pieces.
`boolean`	`getSplitDigits()`	Split all digits (0-9) into separate pieces.
`boolean`	`getTrainExtremelyLargeCorpus()`	Increase bit depth to allow unigram model training on large (>10M sentences) corpora.
`int`	`getTrainingSentenceSize()`	Deprecated.
`boolean`	`getTreatWhitespaceAsSuffix()`	Adds whitespace symbol (_) as a suffix instead of prefix.
`int`	`getUnkId()`	///////////////////////////////////////////////////////////////// Reserved special meta tokens.
`java.lang.String`	`getUnkPiece()`	`optional string unk_piece = 45 [default = "<unk>"];`
`com.google.protobuf.ByteString`	`getUnkPieceBytes()`	`optional string unk_piece = 45 [default = "<unk>"];`
`java.lang.String`	`getUnkSurface()`	Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.
`com.google.protobuf.ByteString`	`getUnkSurfaceBytes()`	Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.
`boolean`	`getUseAllVocab()`	use all symbols for vocab extraction.
`java.lang.String`	`getUserDefinedSymbols(int index)`	Defines user defined symbols.
`com.google.protobuf.ByteString`	`getUserDefinedSymbolsBytes(int index)`	Defines user defined symbols.
`int`	`getUserDefinedSymbolsCount()`	Defines user defined symbols.
`java.util.List<java.lang.String>`	`getUserDefinedSymbolsList()`	Defines user defined symbols.
`int`	`getVocabSize()`	Vocabulary size.
`boolean`	`getVocabularyOutputPieceScore()`	When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
`boolean`	`hasAllowWhitespaceOnlyPieces()`	Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
`boolean`	`hasBosId()`	<s>
`boolean`	`hasBosPiece()`	`optional string bos_piece = 46 [default = "<s>"];`
`boolean`	`hasByteFallback()`	Decomposes unknown pieces into UTF-8 bytes.
`boolean`	`hasCharacterCoverage()`	///////////////////////////////////////////////////////////////// Training parameters.
`boolean`	`hasEosId()`	</s>
`boolean`	`hasEosPiece()`	`optional string eos_piece = 47 [default = "</s>"];`
`boolean`	`hasHardVocabLimit()`	`vocab_size` is treated as hard limit.
`boolean`	`hasInputFormat()`	Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
`boolean`	`hasInputSentenceSize()`	Maximum size of sentences the trainer loads from `input` parameter.
`boolean`	`hasMaxSentenceLength()`	The maximum sentence length in byte.
`boolean`	`hasMaxSentencepieceLength()`	///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.
`boolean`	`hasMiningSentenceSize()`	Deprecated.
`boolean`	`hasModelPrefix()`	Output model file prefix.
`boolean`	`hasModelType()`	`optional .sentencepiece.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];`
`boolean`	`hasNumSubIterations()`	Number of EM sub iterations.
`boolean`	`hasNumThreads()`	Number of threads in the training.
`boolean`	`hasPadId()`	<pad> (padding)
`boolean`	`hasPadPiece()`	`optional string pad_piece = 48 [default = "<pad>"];`
`boolean`	`hasRequiredChars()`	Defines required characters.
`boolean`	`hasSeedSentencepieceSize()`	The size of seed sentencepieces.
`boolean`	`hasSelfTestSampleSize()`	Size of self-test samples, which are encoded in the model file.
`boolean`	`hasShrinkingFactor()`	In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.
`boolean`	`hasShuffleInputSentence()`	`optional bool shuffle_input_sentence = 19 [default = true];`
`boolean`	`hasSplitByNumber()`	When `split_by_number` is true, put a boundary between number and non-number transition.
`boolean`	`hasSplitByUnicodeScript()`	Uses Unicode script to split sentence pieces.
`boolean`	`hasSplitByWhitespace()`	Use a white space to split sentence pieces.
`boolean`	`hasSplitDigits()`	Split all digits (0-9) into separate pieces.
`boolean`	`hasTrainExtremelyLargeCorpus()`	Increase bit depth to allow unigram model training on large (>10M sentences) corpora.
`boolean`	`hasTrainingSentenceSize()`	Deprecated.
`boolean`	`hasTreatWhitespaceAsSuffix()`	Adds whitespace symbol (_) as a suffix instead of prefix.
`boolean`	`hasUnkId()`	///////////////////////////////////////////////////////////////// Reserved special meta tokens.
`boolean`	`hasUnkPiece()`	`optional string unk_piece = 45 [default = "<unk>"];`
`boolean`	`hasUnkSurface()`	Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.
`boolean`	`hasUseAllVocab()`	use all symbols for vocab extraction.
`boolean`	`hasVocabSize()`	Vocabulary size.
`boolean`	`hasVocabularyOutputPieceScore()`	When creating the vocabulary file, defines whether or not to additionally output the score for each piece.

Methods inherited from interface com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder
getDefaultInstanceForType, getExtension, getExtension, getExtension, getExtension, getExtension, getExtension, getExtensionCount, getExtensionCount, getExtensionCount, hasExtension, hasExtension, hasExtension

Methods inherited from interface com.google.protobuf.MessageLiteOrBuilder
isInitialized

Methods inherited from interface com.google.protobuf.MessageOrBuilder
findInitializationErrors, getAllFields, getDescriptorForType, getField, getInitializationErrorString, getOneofFieldDescriptor, getRepeatedField, getRepeatedFieldCount, getUnknownFields, hasField, hasOneof

Method Detail

getInputList

java.util.List<java.lang.String> getInputList()

/////////////////////////////////////////////////////////////////
 General parameters
 Input corpus files.
  Trainer accepts the following two formats:
  A) Monolingual: plain text, one sentence per line.
  B) Bilingual:   TSV, source sentence <tab> target sentence
  When bilingual data is passed, shared vocabulary model is built.
  Note that the input file must be raw corpus, not a preprocessed corpus.
  Trainer only loads the first `input_sentence_size` sentences specified
  with this parameter.

repeated string input = 1;

Returns:: A list containing the input.

getInputCount

int getInputCount()

/////////////////////////////////////////////////////////////////
 General parameters
 Input corpus files.
  Trainer accepts the following two formats:
  A) Monolingual: plain text, one sentence per line.
  B) Bilingual:   TSV, source sentence <tab> target sentence
  When bilingual data is passed, shared vocabulary model is built.
  Note that the input file must be raw corpus, not a preprocessed corpus.
  Trainer only loads the first `input_sentence_size` sentences specified
  with this parameter.

repeated string input = 1;

Returns:: The count of input.

getInput

java.lang.String getInput(int index)

/////////////////////////////////////////////////////////////////
 General parameters
 Input corpus files.
  Trainer accepts the following two formats:
  A) Monolingual: plain text, one sentence per line.
  B) Bilingual:   TSV, source sentence <tab> target sentence
  When bilingual data is passed, shared vocabulary model is built.
  Note that the input file must be raw corpus, not a preprocessed corpus.
  Trainer only loads the first `input_sentence_size` sentences specified
  with this parameter.

repeated string input = 1;

Parameters:: index - The index of the element to return.
Returns:: The input at the given index.

getInputBytes

com.google.protobuf.ByteString getInputBytes(int index)

/////////////////////////////////////////////////////////////////
 General parameters
 Input corpus files.
  Trainer accepts the following two formats:
  A) Monolingual: plain text, one sentence per line.
  B) Bilingual:   TSV, source sentence <tab> target sentence
  When bilingual data is passed, shared vocabulary model is built.
  Note that the input file must be raw corpus, not a preprocessed corpus.
  Trainer only loads the first `input_sentence_size` sentences specified
  with this parameter.

repeated string input = 1;

Parameters:: index - The index of the value to return.
Returns:: The bytes of the input at the given index.

hasInputFormat

boolean hasInputFormat()

 Input corpus format:
 "text": one-sentence-per-line text format (default)
 "tsv":  sentence <tab> freq

optional string input_format = 7;

Returns:: Whether the inputFormat field is set.

getInputFormat

java.lang.String getInputFormat()

 Input corpus format:
 "text": one-sentence-per-line text format (default)
 "tsv":  sentence <tab> freq

optional string input_format = 7;

Returns:: The inputFormat.

getInputFormatBytes

com.google.protobuf.ByteString getInputFormatBytes()

 Input corpus format:
 "text": one-sentence-per-line text format (default)
 "tsv":  sentence <tab> freq

optional string input_format = 7;

Returns:: The bytes for inputFormat.

hasModelPrefix

boolean hasModelPrefix()

 Output model file prefix.
 <model_prefix>.model and <model_prefix>.vocab are generated.

optional string model_prefix = 2;

Returns:: Whether the modelPrefix field is set.

getModelPrefix

java.lang.String getModelPrefix()

 Output model file prefix.
 <model_prefix>.model and <model_prefix>.vocab are generated.

optional string model_prefix = 2;

Returns:: The modelPrefix.

getModelPrefixBytes

com.google.protobuf.ByteString getModelPrefixBytes()

 Output model file prefix.
 <model_prefix>.model and <model_prefix>.vocab are generated.

optional string model_prefix = 2;

Returns:: The bytes for modelPrefix.

hasModelType
```
boolean hasModelType()
```
optional .sentencepiece.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];

Returns:

Whether the modelType field is set.

getModelType
```
SentencepieceModel.TrainerSpec.ModelType getModelType()
```
optional .sentencepiece.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];

Returns:

The modelType.

hasVocabSize
```
boolean hasVocabSize()
```
```
 Vocabulary size. 8k is the default size.
 
```
optional int32 vocab_size = 4 [default = 8000];
Returns:

Whether the vocabSize field is set.

getVocabSize
```
int getVocabSize()
```
```
 Vocabulary size. 8k is the default size.
 
```
optional int32 vocab_size = 4 [default = 8000];
Returns:

The vocabSize.

getAcceptLanguageList

java.util.List<java.lang.String> getAcceptLanguageList()

 List of the languages this model can accept.
 Since the model is language-agnostic, this field is used as a reference.

repeated string accept_language = 5;

Returns:: A list containing the acceptLanguage.

getAcceptLanguageCount

int getAcceptLanguageCount()

 List of the languages this model can accept.
 Since the model is language-agnostic, this field is used as a reference.

repeated string accept_language = 5;

Returns:: The count of acceptLanguage.

getAcceptLanguage

java.lang.String getAcceptLanguage(int index)

 List of the languages this model can accept.
 Since the model is language-agnostic, this field is used as a reference.

repeated string accept_language = 5;

Parameters:: index - The index of the element to return.
Returns:: The acceptLanguage at the given index.

getAcceptLanguageBytes

com.google.protobuf.ByteString getAcceptLanguageBytes(int index)

 List of the languages this model can accept.
 Since the model is language-agnostic, this field is used as a reference.

repeated string accept_language = 5;

Parameters:: index - The index of the value to return.
Returns:: The bytes of the acceptLanguage at the given index.

hasSelfTestSampleSize
```
boolean hasSelfTestSampleSize()
```
```
 Size of self-test samples, which are encoded in the model file.
 
```
optional int32 self_test_sample_size = 6 [default = 0];
Returns:

Whether the selfTestSampleSize field is set.

getSelfTestSampleSize
```
int getSelfTestSampleSize()
```
```
 Size of self-test samples, which are encoded in the model file.
 
```
optional int32 self_test_sample_size = 6 [default = 0];
Returns:

The selfTestSampleSize.

hasCharacterCoverage

boolean hasCharacterCoverage()

/////////////////////////////////////////////////////////////////
 Training parameters.
 Uses characters which cover the corpus with the ratio of `chars_coverage`.
 This parameter determines the set of basic Alphabet of sentence piece.
 1.0 - `chars_coverage` characters are treated as UNK.
 See also required_chars field.

optional float character_coverage = 10 [default = 0.9995];

Returns:: Whether the characterCoverage field is set.

getCharacterCoverage

float getCharacterCoverage()

/////////////////////////////////////////////////////////////////
 Training parameters.
 Uses characters which cover the corpus with the ratio of `chars_coverage`.
 This parameter determines the set of basic Alphabet of sentence piece.
 1.0 - `chars_coverage` characters are treated as UNK.
 See also required_chars field.

optional float character_coverage = 10 [default = 0.9995];

Returns:: The characterCoverage.

hasInputSentenceSize

boolean hasInputSentenceSize()

 Maximum size of sentences the trainer loads from `input` parameter.
 Trainer simply loads the `input` files in sequence.
 It is better to shuffle the input corpus randomly.

optional uint64 input_sentence_size = 11 [default = 0];

Returns:: Whether the inputSentenceSize field is set.

getInputSentenceSize

long getInputSentenceSize()

 Maximum size of sentences the trainer loads from `input` parameter.
 Trainer simply loads the `input` files in sequence.
 It is better to shuffle the input corpus randomly.

optional uint64 input_sentence_size = 11 [default = 0];

Returns:: The inputSentenceSize.

hasShuffleInputSentence
```
boolean hasShuffleInputSentence()
```
optional bool shuffle_input_sentence = 19 [default = true];

Returns:

Whether the shuffleInputSentence field is set.

getShuffleInputSentence
```
boolean getShuffleInputSentence()
```
optional bool shuffle_input_sentence = 19 [default = true];

Returns:

The shuffleInputSentence.

hasMiningSentenceSize

@Deprecated
boolean hasMiningSentenceSize()

Deprecated.

 Maximum size of sentences to make seed sentence pieces.
 Extended suffix array is constructed to extract frequent
 sub-strings from the corpus. This uses 20N working space,
 where N is the size of corpus.

optional int32 mining_sentence_size = 12 [deprecated = true];

Returns:: Whether the miningSentenceSize field is set.

getMiningSentenceSize

@Deprecated
int getMiningSentenceSize()

Deprecated.

 Maximum size of sentences to make seed sentence pieces.
 Extended suffix array is constructed to extract frequent
 sub-strings from the corpus. This uses 20N working space,
 where N is the size of corpus.

optional int32 mining_sentence_size = 12 [deprecated = true];

Returns:: The miningSentenceSize.

hasTrainingSentenceSize
```
@Deprecated
boolean hasTrainingSentenceSize()
```
Deprecated.
```
 Maximum size of sentences to train sentence pieces.
 
```
optional int32 training_sentence_size = 13 [deprecated = true];
Returns:

Whether the trainingSentenceSize field is set.

getTrainingSentenceSize
```
@Deprecated
int getTrainingSentenceSize()
```
Deprecated.
```
 Maximum size of sentences to train sentence pieces.
 
```
optional int32 training_sentence_size = 13 [deprecated = true];
Returns:

The trainingSentenceSize.

hasSeedSentencepieceSize
```
boolean hasSeedSentencepieceSize()
```
```
 The size of seed sentencepieces.
 `seed_sentencepiece_size` must be larger than `vocab_size`.
 
```
optional int32 seed_sentencepiece_size = 14 [default = 1000000];
Returns:

Whether the seedSentencepieceSize field is set.

getSeedSentencepieceSize

int getSeedSentencepieceSize()

 The size of seed sentencepieces.
 `seed_sentencepiece_size` must be larger than `vocab_size`.

optional int32 seed_sentencepiece_size = 14 [default = 1000000];

Returns:: The seedSentencepieceSize.

hasShrinkingFactor

boolean hasShrinkingFactor()

 In every EM sub-iterations, keeps top
 `shrinking_factor` * `current sentencepieces size` with respect to
 the loss of the sentence piece. This value should be smaller than 1.0.

optional float shrinking_factor = 15 [default = 0.75];

Returns:: Whether the shrinkingFactor field is set.

getShrinkingFactor

float getShrinkingFactor()

 In every EM sub-iterations, keeps top
 `shrinking_factor` * `current sentencepieces size` with respect to
 the loss of the sentence piece. This value should be smaller than 1.0.

optional float shrinking_factor = 15 [default = 0.75];

Returns:: The shrinkingFactor.

hasMaxSentenceLength

boolean hasMaxSentenceLength()

 The maximum sentence length in byte. The sentences with the length
 larger than `max_sentence_length` is simply ignored.
 Longer input tends to bring the following risks:
  * Overflow during EM training (unigram language model only)
  * Performance drop because of O(n log n) cost in BPE.

optional int32 max_sentence_length = 18 [default = 4192];

Returns:: Whether the maxSentenceLength field is set.

getMaxSentenceLength

int getMaxSentenceLength()

 The maximum sentence length in byte. The sentences with the length
 larger than `max_sentence_length` is simply ignored.
 Longer input tends to bring the following risks:
  * Overflow during EM training (unigram language model only)
  * Performance drop because of O(n log n) cost in BPE.

optional int32 max_sentence_length = 18 [default = 4192];

Returns:: The maxSentenceLength.

hasNumThreads
```
boolean hasNumThreads()
```
```
 Number of threads in the training.
 
```
optional int32 num_threads = 16 [default = 16];
Returns:

Whether the numThreads field is set.

getNumThreads
```
int getNumThreads()
```
```
 Number of threads in the training.
 
```
optional int32 num_threads = 16 [default = 16];
Returns:

The numThreads.

hasNumSubIterations
```
boolean hasNumSubIterations()
```
```
 Number of EM sub iterations.
 
```
optional int32 num_sub_iterations = 17 [default = 2];
Returns:

Whether the numSubIterations field is set.

getNumSubIterations
```
int getNumSubIterations()
```
```
 Number of EM sub iterations.
 
```
optional int32 num_sub_iterations = 17 [default = 2];
Returns:

The numSubIterations.

hasMaxSentencepieceLength

boolean hasMaxSentencepieceLength()

/////////////////////////////////////////////////////////////////
 SentencePiece parameters which control the shapes of sentence piece.
 Maximum length of sentencepiece.

optional int32 max_sentencepiece_length = 20 [default = 16];

Returns:: Whether the maxSentencepieceLength field is set.

getMaxSentencepieceLength

int getMaxSentencepieceLength()

/////////////////////////////////////////////////////////////////
 SentencePiece parameters which control the shapes of sentence piece.
 Maximum length of sentencepiece.

optional int32 max_sentencepiece_length = 20 [default = 16];

Returns:: The maxSentencepieceLength.

hasSplitByUnicodeScript

boolean hasSplitByUnicodeScript()

 Uses Unicode script to split sentence pieces.
 When `split_by_unicode_script` is true, we do not allow sentence piece to
 include multiple Unicode scripts, e.g. "F1" is not a valid piece.
 Exception: CJ characters (Hiragana/Katakana/Han) are all handled
 as one script type, since Japanese word can consist of multiple scripts.
 This exception is always applied regardless of the accept-language
 parameter.

optional bool split_by_unicode_script = 21 [default = true];

Returns:: Whether the splitByUnicodeScript field is set.

getSplitByUnicodeScript

boolean getSplitByUnicodeScript()

 Uses Unicode script to split sentence pieces.
 When `split_by_unicode_script` is true, we do not allow sentence piece to
 include multiple Unicode scripts, e.g. "F1" is not a valid piece.
 Exception: CJ characters (Hiragana/Katakana/Han) are all handled
 as one script type, since Japanese word can consist of multiple scripts.
 This exception is always applied regardless of the accept-language
 parameter.

optional bool split_by_unicode_script = 21 [default = true];

Returns:: The splitByUnicodeScript.

hasSplitByNumber

boolean hasSplitByNumber()

 When `split_by_number` is true, put a boundary between number and
 non-number transition. If we want to treat "F1" is one token, set this flag
 to be false.

optional bool split_by_number = 23 [default = true];

Returns:: Whether the splitByNumber field is set.

getSplitByNumber

boolean getSplitByNumber()

 When `split_by_number` is true, put a boundary between number and
 non-number transition. If we want to treat "F1" is one token, set this flag
 to be false.

optional bool split_by_number = 23 [default = true];

Returns:: The splitByNumber.

hasSplitByWhitespace

boolean hasSplitByWhitespace()

 Use a white space to split sentence pieces.
 When `split_by_whitespace` is false, we may have the piece containing
 a white space in the middle. e.g., "in_the".

optional bool split_by_whitespace = 22 [default = true];

Returns:: Whether the splitByWhitespace field is set.

getSplitByWhitespace

boolean getSplitByWhitespace()

 Use a white space to split sentence pieces.
 When `split_by_whitespace` is false, we may have the piece containing
 a white space in the middle. e.g., "in_the".

optional bool split_by_whitespace = 22 [default = true];

Returns:: The splitByWhitespace.

hasTreatWhitespaceAsSuffix

boolean hasTreatWhitespaceAsSuffix()

 Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello =>
 hello_. When `treat_whitespace_as_suffix` is true,
 NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end
 of sentence.

optional bool treat_whitespace_as_suffix = 24 [default = false];

Returns:: Whether the treatWhitespaceAsSuffix field is set.

getTreatWhitespaceAsSuffix

boolean getTreatWhitespaceAsSuffix()

 Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello =>
 hello_. When `treat_whitespace_as_suffix` is true,
 NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end
 of sentence.

optional bool treat_whitespace_as_suffix = 24 [default = false];

Returns:: The treatWhitespaceAsSuffix.

hasAllowWhitespaceOnlyPieces
```
boolean hasAllowWhitespaceOnlyPieces()
```
```
 Allows pieces that only contain whitespaces instead of appearing only as
 prefix or suffix of other pieces.
 
```
optional bool allow_whitespace_only_pieces = 26 [default = false];
Returns:

Whether the allowWhitespaceOnlyPieces field is set.

getAllowWhitespaceOnlyPieces

boolean getAllowWhitespaceOnlyPieces()

 Allows pieces that only contain whitespaces instead of appearing only as
 prefix or suffix of other pieces.

optional bool allow_whitespace_only_pieces = 26 [default = false];

Returns:: The allowWhitespaceOnlyPieces.

hasSplitDigits
```
boolean hasSplitDigits()
```
```
 Split all digits (0-9) into separate pieces.
 
```
optional bool split_digits = 25 [default = false];
Returns:

Whether the splitDigits field is set.

getSplitDigits
```
boolean getSplitDigits()
```
```
 Split all digits (0-9) into separate pieces.
 
```
optional bool split_digits = 25 [default = false];
Returns:

The splitDigits.

getControlSymbolsList

java.util.List<java.lang.String> getControlSymbolsList()

/////////////////////////////////////////////////////////////////
 Vocabulary management
 Defines control symbols used as an indicator to
 change the behavior of the decoder. <s> and </s> are pre-defined.
 We can use this field to encode various meta information,
 including language indicator in multilingual model.
 These symbols are not visible to users, but visible to
 the decoder. Note that when the input sentence contains control symbols,
 they are not treated as one token, but segmented into normal pieces.
 Control symbols must be inserted independently from the segmentation.

repeated string control_symbols = 30;

Returns:: A list containing the controlSymbols.

getControlSymbolsCount

int getControlSymbolsCount()

/////////////////////////////////////////////////////////////////
 Vocabulary management
 Defines control symbols used as an indicator to
 change the behavior of the decoder. <s> and </s> are pre-defined.
 We can use this field to encode various meta information,
 including language indicator in multilingual model.
 These symbols are not visible to users, but visible to
 the decoder. Note that when the input sentence contains control symbols,
 they are not treated as one token, but segmented into normal pieces.
 Control symbols must be inserted independently from the segmentation.

repeated string control_symbols = 30;

Returns:: The count of controlSymbols.

getControlSymbols

java.lang.String getControlSymbols(int index)

/////////////////////////////////////////////////////////////////
 Vocabulary management
 Defines control symbols used as an indicator to
 change the behavior of the decoder. <s> and </s> are pre-defined.
 We can use this field to encode various meta information,
 including language indicator in multilingual model.
 These symbols are not visible to users, but visible to
 the decoder. Note that when the input sentence contains control symbols,
 they are not treated as one token, but segmented into normal pieces.
 Control symbols must be inserted independently from the segmentation.

repeated string control_symbols = 30;

Parameters:: index - The index of the element to return.
Returns:: The controlSymbols at the given index.

getControlSymbolsBytes

com.google.protobuf.ByteString getControlSymbolsBytes(int index)

/////////////////////////////////////////////////////////////////
 Vocabulary management
 Defines control symbols used as an indicator to
 change the behavior of the decoder. <s> and </s> are pre-defined.
 We can use this field to encode various meta information,
 including language indicator in multilingual model.
 These symbols are not visible to users, but visible to
 the decoder. Note that when the input sentence contains control symbols,
 they are not treated as one token, but segmented into normal pieces.
 Control symbols must be inserted independently from the segmentation.

repeated string control_symbols = 30;

Parameters:: index - The index of the value to return.
Returns:: The bytes of the controlSymbols at the given index.

getUserDefinedSymbolsList

java.util.List<java.lang.String> getUserDefinedSymbolsList()

 Defines user defined symbols.
 These symbols are added with extremely high score
 so they are always treated as one unique symbol in any context.
 Typical usage of user_defined_symbols is placeholder for named entities.

repeated string user_defined_symbols = 31;

Returns:: A list containing the userDefinedSymbols.

getUserDefinedSymbolsCount

int getUserDefinedSymbolsCount()

 Defines user defined symbols.
 These symbols are added with extremely high score
 so they are always treated as one unique symbol in any context.
 Typical usage of user_defined_symbols is placeholder for named entities.

repeated string user_defined_symbols = 31;

Returns:: The count of userDefinedSymbols.

getUserDefinedSymbols

java.lang.String getUserDefinedSymbols(int index)

 Defines user defined symbols.
 These symbols are added with extremely high score
 so they are always treated as one unique symbol in any context.
 Typical usage of user_defined_symbols is placeholder for named entities.

repeated string user_defined_symbols = 31;

Parameters:: index - The index of the element to return.
Returns:: The userDefinedSymbols at the given index.

getUserDefinedSymbolsBytes

com.google.protobuf.ByteString getUserDefinedSymbolsBytes(int index)

 Defines user defined symbols.
 These symbols are added with extremely high score
 so they are always treated as one unique symbol in any context.
 Typical usage of user_defined_symbols is placeholder for named entities.

repeated string user_defined_symbols = 31;

Parameters:: index - The index of the value to return.
Returns:: The bytes of the userDefinedSymbols at the given index.

hasRequiredChars

boolean hasRequiredChars()

 Defines required characters. Each UTF8 character in this string is included
 in the character set regardless of character_coverage value. Unlike
 user_defined_symbols, these characters have scores based on the frequency
 on input sentences, and the model can form subwords using characters
 in this field.

optional string required_chars = 36;

Returns:: Whether the requiredChars field is set.

getRequiredChars

java.lang.String getRequiredChars()

 Defines required characters. Each UTF8 character in this string is included
 in the character set regardless of character_coverage value. Unlike
 user_defined_symbols, these characters have scores based on the frequency
 on input sentences, and the model can form subwords using characters
 in this field.

optional string required_chars = 36;

Returns:: The requiredChars.

getRequiredCharsBytes

com.google.protobuf.ByteString getRequiredCharsBytes()

 Defines required characters. Each UTF8 character in this string is included
 in the character set regardless of character_coverage value. Unlike
 user_defined_symbols, these characters have scores based on the frequency
 on input sentences, and the model can form subwords using characters
 in this field.

optional string required_chars = 36;

Returns:: The bytes for requiredChars.

hasByteFallback
```
boolean hasByteFallback()
```
```
 Decomposes unknown pieces into UTF-8 bytes.
 
```
optional bool byte_fallback = 35 [default = false];
Returns:

Whether the byteFallback field is set.

getByteFallback
```
boolean getByteFallback()
```
```
 Decomposes unknown pieces into UTF-8 bytes.
 
```
optional bool byte_fallback = 35 [default = false];
Returns:

The byteFallback.

hasVocabularyOutputPieceScore
```
boolean hasVocabularyOutputPieceScore()
```
```
 When creating the vocabulary file, defines whether or not to additionally
 output the score for each piece.
 
```
optional bool vocabulary_output_piece_score = 32 [default = true];
Returns:

Whether the vocabularyOutputPieceScore field is set.

getVocabularyOutputPieceScore

boolean getVocabularyOutputPieceScore()

 When creating the vocabulary file, defines whether or not to additionally
 output the score for each piece.

optional bool vocabulary_output_piece_score = 32 [default = true];

Returns:: The vocabularyOutputPieceScore.

hasHardVocabLimit

boolean hasHardVocabLimit()

 `vocab_size` is treated as hard limit. Crash if
 the model can not produce the vocab of size `vocab_size`,
 When `hard_vocab_limit` is false, vocab_size is treated
 as soft limit. Note that when model_type=char,
 always assumes hard_vocab_limit = false.

optional bool hard_vocab_limit = 33 [default = true];

Returns:: Whether the hardVocabLimit field is set.

getHardVocabLimit

boolean getHardVocabLimit()

 `vocab_size` is treated as hard limit. Crash if
 the model can not produce the vocab of size `vocab_size`,
 When `hard_vocab_limit` is false, vocab_size is treated
 as soft limit. Note that when model_type=char,
 always assumes hard_vocab_limit = false.

optional bool hard_vocab_limit = 33 [default = true];

Returns:: The hardVocabLimit.

hasUseAllVocab

boolean hasUseAllVocab()

 use all symbols for vocab extraction. This flag is valid
 if model type is either CHAR or WORD

optional bool use_all_vocab = 34 [default = false];

Returns:: Whether the useAllVocab field is set.

getUseAllVocab

boolean getUseAllVocab()

 use all symbols for vocab extraction. This flag is valid
 if model type is either CHAR or WORD

optional bool use_all_vocab = 34 [default = false];

Returns:: The useAllVocab.

hasUnkId

boolean hasUnkId()

/////////////////////////////////////////////////////////////////
 Reserved special meta tokens.
 * -1 is not used.
 * unk_id must not be -1.
 Id must starts with 0 and be contigous.

optional int32 unk_id = 40 [default = 0];

Returns:: Whether the unkId field is set.

getUnkId

int getUnkId()

/////////////////////////////////////////////////////////////////
 Reserved special meta tokens.
 * -1 is not used.
 * unk_id must not be -1.
 Id must starts with 0 and be contigous.

optional int32 unk_id = 40 [default = 0];

Returns:: The unkId.

hasBosId
```
boolean hasBosId()
```
```
 <s>
 
```
optional int32 bos_id = 41 [default = 1];
Returns:

Whether the bosId field is set.

getBosId
```
int getBosId()
```
```
 <s>
 
```
optional int32 bos_id = 41 [default = 1];
Returns:

The bosId.

hasEosId
```
boolean hasEosId()
```
```
 </s>
 
```
optional int32 eos_id = 42 [default = 2];
Returns:

Whether the eosId field is set.

getEosId
```
int getEosId()
```
```
 </s>
 
```
optional int32 eos_id = 42 [default = 2];
Returns:

The eosId.

hasPadId
```
boolean hasPadId()
```
```
 <pad> (padding)
 
```
optional int32 pad_id = 43 [default = -1];
Returns:

Whether the padId field is set.

getPadId
```
int getPadId()
```
```
 <pad> (padding)
 
```
optional int32 pad_id = 43 [default = -1];
Returns:

The padId.

hasUnkPiece
```
boolean hasUnkPiece()
```
optional string unk_piece = 45 [default = "<unk>"];

Returns:

Whether the unkPiece field is set.

getUnkPiece
```
java.lang.String getUnkPiece()
```
optional string unk_piece = 45 [default = "<unk>"];

Returns:

The unkPiece.

getUnkPieceBytes
```
com.google.protobuf.ByteString getUnkPieceBytes()
```
optional string unk_piece = 45 [default = "<unk>"];

Returns:

The bytes for unkPiece.

hasBosPiece
```
boolean hasBosPiece()
```
optional string bos_piece = 46 [default = "<s>"];

Returns:

Whether the bosPiece field is set.

getBosPiece
```
java.lang.String getBosPiece()
```
optional string bos_piece = 46 [default = "<s>"];

Returns:

The bosPiece.

getBosPieceBytes
```
com.google.protobuf.ByteString getBosPieceBytes()
```
optional string bos_piece = 46 [default = "<s>"];

Returns:

The bytes for bosPiece.

hasEosPiece
```
boolean hasEosPiece()
```
optional string eos_piece = 47 [default = "</s>"];

Returns:

Whether the eosPiece field is set.

getEosPiece
```
java.lang.String getEosPiece()
```
optional string eos_piece = 47 [default = "</s>"];

Returns:

The eosPiece.

getEosPieceBytes
```
com.google.protobuf.ByteString getEosPieceBytes()
```
optional string eos_piece = 47 [default = "</s>"];

Returns:

The bytes for eosPiece.

hasPadPiece
```
boolean hasPadPiece()
```
optional string pad_piece = 48 [default = "<pad>"];

Returns:

Whether the padPiece field is set.

getPadPiece
```
java.lang.String getPadPiece()
```
optional string pad_piece = 48 [default = "<pad>"];

Returns:

The padPiece.

getPadPieceBytes
```
com.google.protobuf.ByteString getPadPieceBytes()
```
optional string pad_piece = 48 [default = "<pad>"];

Returns:

The bytes for padPiece.

hasUnkSurface

boolean hasUnkSurface()

 Encodes <unk> into U+2047 (DOUBLE QUESTION MARK),
 since this character can be useful both for user and
 developer. We can easily figure out that <unk> is emitted.

optional string unk_surface = 44 [default = " \342\201\207 "];

Returns:: Whether the unkSurface field is set.

getUnkSurface

java.lang.String getUnkSurface()

 Encodes <unk> into U+2047 (DOUBLE QUESTION MARK),
 since this character can be useful both for user and
 developer. We can easily figure out that <unk> is emitted.

optional string unk_surface = 44 [default = " \342\201\207 "];

Returns:: The unkSurface.

getUnkSurfaceBytes

com.google.protobuf.ByteString getUnkSurfaceBytes()

 Encodes <unk> into U+2047 (DOUBLE QUESTION MARK),
 since this character can be useful both for user and
 developer. We can easily figure out that <unk> is emitted.

optional string unk_surface = 44 [default = " \342\201\207 "];

Returns:: The bytes for unkSurface.

hasTrainExtremelyLargeCorpus

boolean hasTrainExtremelyLargeCorpus()

 Increase bit depth to allow unigram model training on large
 (>10M sentences) corpora. A Side-effect of enabling this flag
 is increased memory usage.

optional bool train_extremely_large_corpus = 49 [default = false];

Returns:: Whether the trainExtremelyLargeCorpus field is set.

getTrainExtremelyLargeCorpus

boolean getTrainExtremelyLargeCorpus()

 Increase bit depth to allow unigram model training on large
 (>10M sentences) corpora. A Side-effect of enabling this flag
 is increased memory usage.

optional bool train_extremely_large_corpus = 49 [default = false];

Returns:: The trainExtremelyLargeCorpus.

Interface SentencepieceModel.TrainerSpecOrBuilder

Method Summary

Methods inherited from interface com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder

Methods inherited from interface com.google.protobuf.MessageLiteOrBuilder

Methods inherited from interface com.google.protobuf.MessageOrBuilder

Method Detail

getInputList

getInputCount

getInput

getInputBytes

hasInputFormat

getInputFormat

getInputFormatBytes

hasModelPrefix

getModelPrefix

getModelPrefixBytes

hasModelType

getModelType

hasVocabSize

getVocabSize

getAcceptLanguageList

getAcceptLanguageCount

getAcceptLanguage

getAcceptLanguageBytes

hasSelfTestSampleSize

getSelfTestSampleSize

hasCharacterCoverage

getCharacterCoverage

hasInputSentenceSize

getInputSentenceSize

hasShuffleInputSentence

getShuffleInputSentence

hasMiningSentenceSize

getMiningSentenceSize

hasTrainingSentenceSize

getTrainingSentenceSize

hasSeedSentencepieceSize

getSeedSentencepieceSize

hasShrinkingFactor

getShrinkingFactor

hasMaxSentenceLength

getMaxSentenceLength

hasNumThreads

getNumThreads

hasNumSubIterations

getNumSubIterations

hasMaxSentencepieceLength

getMaxSentencepieceLength

hasSplitByUnicodeScript

getSplitByUnicodeScript

hasSplitByNumber

getSplitByNumber

hasSplitByWhitespace

getSplitByWhitespace

hasTreatWhitespaceAsSuffix

getTreatWhitespaceAsSuffix

hasAllowWhitespaceOnlyPieces

getAllowWhitespaceOnlyPieces

hasSplitDigits

getSplitDigits

getControlSymbolsList

getControlSymbolsCount

getControlSymbols

getControlSymbolsBytes

getUserDefinedSymbolsList

getUserDefinedSymbolsCount

getUserDefinedSymbols

getUserDefinedSymbolsBytes

hasRequiredChars

getRequiredChars

getRequiredCharsBytes

hasByteFallback

getByteFallback

hasVocabularyOutputPieceScore

getVocabularyOutputPieceScore

hasHardVocabLimit

getHardVocabLimit

hasUseAllVocab

getUseAllVocab

hasUnkId