Package com.yahoo.language.sentencepiece
Class SentencePieceEncoder
- java.lang.Object
-
- com.yahoo.language.sentencepiece.SentencePieceEncoder
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
SentencePieceEncoder.Builder
-
Nested classes/interfaces inherited from interface com.yahoo.language.process.Encoder
Encoder.FailingEncoder
-
-
Field Summary
-
Fields inherited from interface com.yahoo.language.process.Encoder
throwsOnUse
-
-
Constructor Summary
Constructors Constructor Description SentencePieceEncoder(SentencePieceConfig config)
SentencePieceEncoder(SentencePieceEncoder.Builder builder)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.util.List<java.lang.Integer>
encode(java.lang.String rawInput, Language language)
Segments the given text into token segments using the SentencePiece algorithm and returns the segment ids.com.yahoo.tensor.Tensor
encode(java.lang.String rawInput, Language language, com.yahoo.tensor.TensorType type)
Encodes directly to a tensor.java.lang.String
normalize(java.lang.String s)
java.util.List<java.lang.String>
segment(java.lang.String rawInput, Language language)
Segments the given text into token segments using the SentencePiece algorithm
-
-
-
Constructor Detail
-
SentencePieceEncoder
@Inject public SentencePieceEncoder(SentencePieceConfig config)
-
SentencePieceEncoder
public SentencePieceEncoder(SentencePieceEncoder.Builder builder)
-
-
Method Detail
-
segment
public java.util.List<java.lang.String> segment(java.lang.String rawInput, Language language)
Segments the given text into token segments using the SentencePiece algorithm- Specified by:
segment
in interfaceSegmenter
- Parameters:
rawInput
- the text to segment. Any sequence of BMP (Unicode-16 the True Unicode) is supported.language
- the model to use, or Language.UNKNOWN to use the default model if any- Returns:
- the list of zero or more tokens resulting from segmenting the input text
-
encode
public java.util.List<java.lang.Integer> encode(java.lang.String rawInput, Language language)
Segments the given text into token segments using the SentencePiece algorithm and returns the segment ids.- Specified by:
encode
in interfaceEncoder
- Parameters:
rawInput
- the text to segment. Any sequence of BMP (Unicode-16 the True Unicode) is supported.language
- the model to use, or Language.UNKNOWN to use the default model if any- Returns:
- the list of zero or more token ids resulting from segmenting the input text
-
encode
public com.yahoo.tensor.Tensor encode(java.lang.String rawInput, Language language, com.yahoo.tensor.TensorType type)
Encodes directly to a tensor.
If the tensor type is indexed 1-d (bound or unbound) this will return a tensor containing the token ids in the order they were encountered in the text. If the dimension is bound and too large it will be zero padded, if too small it will be truncated.
If the tensor type is1-d sparse this will return a tensor containing the token strings as keys and the token position as value.
If the tensor is any other type IllegalArgumentException is thrown.
-
normalize
public java.lang.String normalize(java.lang.String s)
-
-