java.lang.Object
- com.yahoo.language.sentencepiece.SentencePieceEncoder

All Implemented Interfaces:

Encoder, Segmenter
```
@Beta
public class SentencePieceEncoder
extends java.lang.Object
implements Segmenter, Encoder
```
Integration with https://github.com/google/sentencepiece through http://docs.djl.ai/extensions/sentencepiece/index.html SentencePiece is a language-agnostic tokenizer for neural nets.

Author:

bratseth

Nested Class Summary

Nested Classes
Modifier and Type Class Description

static class SentencePieceEncoder.Builder
- Nested classes/interfaces inherited from interface com.yahoo.language.process.Encoder
  Encoder.FailingEncoder

Field Summary
- Fields inherited from interface com.yahoo.language.process.Encoder
  throwsOnUse

Constructor Summary

Constructors
Constructor Description

SentencePieceEncoder(SentencePieceConfig config)

SentencePieceEncoder(SentencePieceEncoder.Builder builder)

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`java.util.List<java.lang.Integer>`	`encode(java.lang.String rawInput, Language language)`	Segments the given text into token segments using the SentencePiece algorithm and returns the segment ids.
`com.yahoo.tensor.Tensor`	`encode(java.lang.String rawInput, Language language, com.yahoo.tensor.TensorType type)`	Encodes directly to a tensor.
`java.lang.String`	`normalize(java.lang.String s)`
`java.util.List<java.lang.String>`	`segment(java.lang.String rawInput, Language language)`	Segments the given text into token segments using the SentencePiece algorithm

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - SentencePieceEncoder
```
@Inject
public SentencePieceEncoder(SentencePieceConfig config)
```
  - SentencePieceEncoder
```
public SentencePieceEncoder(SentencePieceEncoder.Builder builder)
```
- Method Detail
  - segment
```
public java.util.List<java.lang.String> segment(java.lang.String rawInput,
                                                Language language)
```
    Segments the given text into token segments using the SentencePiece algorithm
    
    Specified by:
    
    segment in interface Segmenter
    
    Parameters:
    
    rawInput - the text to segment. Any sequence of BMP (Unicode-16 the True Unicode) is supported.
    
    language - the model to use, or Language.UNKNOWN to use the default model if any
    
    Returns:
    
    the list of zero or more tokens resulting from segmenting the input text
  - encode
```
public java.util.List<java.lang.Integer> encode(java.lang.String rawInput,
                                                Language language)
```
    Segments the given text into token segments using the SentencePiece algorithm and returns the segment ids.
    
    Specified by:
    
    encode in interface Encoder
    
    Parameters:
    
    rawInput - the text to segment. Any sequence of BMP (Unicode-16 the True Unicode) is supported.
    
    language - the model to use, or Language.UNKNOWN to use the default model if any
    
    Returns:
    
    the list of zero or more token ids resulting from segmenting the input text
  - encode
```
public com.yahoo.tensor.Tensor encode(java.lang.String rawInput,
                                      Language language,
                                      com.yahoo.tensor.TensorType type)
```
    Encodes directly to a tensor.
    
    If the tensor type is indexed 1-d (bound or unbound) this will return a tensor containing the token ids in the order they were encountered in the text. If the dimension is bound and too large it will be zero padded, if too small it will be truncated.
    
    If the tensor type is1-d sparse this will return a tensor containing the token strings as keys and the token position as value.
    
    If the tensor is any other type IllegalArgumentException is thrown.
    
    Specified by:
    
    encode in interface Encoder
    
    Parameters:
    
    rawInput - the text to encode
    
    language - the language of the text, or UNKNOWN to use language independent encoding
    
    type - the type of the ttensor to be returned
    
    Returns:
    
    the tex encoded into a tensor of the supplied type
  - normalize
```
public java.lang.String normalize(java.lang.String s)
```

Constructor	Description
`SentencePieceEncoder(SentencePieceConfig config)`
`SentencePieceEncoder(SentencePieceEncoder.Builder builder)`

Class SentencePieceEncoder

Nested Class Summary

Nested classes/interfaces inherited from interface com.yahoo.language.process.Encoder

Field Summary

Fields inherited from interface com.yahoo.language.process.Encoder

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

SentencePieceEncoder

SentencePieceEncoder

Method Detail

segment

encode

encode

normalize