Class SentencePieceEncoder

  • All Implemented Interfaces:
    Encoder, Segmenter

    @Beta
    public class SentencePieceEncoder
    extends java.lang.Object
    implements Segmenter, Encoder
    Integration with https://github.com/google/sentencepiece through http://docs.djl.ai/extensions/sentencepiece/index.html SentencePiece is a language-agnostic tokenizer for neural nets.
    Author:
    bratseth
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.util.List<java.lang.Integer> encode​(java.lang.String rawInput, Language language)
      Segments the given text into token segments using the SentencePiece algorithm and returns the segment ids.
      com.yahoo.tensor.Tensor encode​(java.lang.String rawInput, Language language, com.yahoo.tensor.TensorType type)
      Encodes directly to a tensor.
      java.lang.String normalize​(java.lang.String s)  
      java.util.List<java.lang.String> segment​(java.lang.String rawInput, Language language)
      Segments the given text into token segments using the SentencePiece algorithm
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • segment

        public java.util.List<java.lang.String> segment​(java.lang.String rawInput,
                                                        Language language)
        Segments the given text into token segments using the SentencePiece algorithm
        Specified by:
        segment in interface Segmenter
        Parameters:
        rawInput - the text to segment. Any sequence of BMP (Unicode-16 the True Unicode) is supported.
        language - the model to use, or Language.UNKNOWN to use the default model if any
        Returns:
        the list of zero or more tokens resulting from segmenting the input text
      • encode

        public java.util.List<java.lang.Integer> encode​(java.lang.String rawInput,
                                                        Language language)
        Segments the given text into token segments using the SentencePiece algorithm and returns the segment ids.
        Specified by:
        encode in interface Encoder
        Parameters:
        rawInput - the text to segment. Any sequence of BMP (Unicode-16 the True Unicode) is supported.
        language - the model to use, or Language.UNKNOWN to use the default model if any
        Returns:
        the list of zero or more token ids resulting from segmenting the input text
      • encode

        public com.yahoo.tensor.Tensor encode​(java.lang.String rawInput,
                                              Language language,
                                              com.yahoo.tensor.TensorType type)

        Encodes directly to a tensor.

        If the tensor type is indexed 1-d (bound or unbound) this will return a tensor containing the token ids in the order they were encountered in the text. If the dimension is bound and too large it will be zero padded, if too small it will be truncated.

        If the tensor type is1-d sparse this will return a tensor containing the token strings as keys and the token position as value.

        If the tensor is any other type IllegalArgumentException is thrown.

        Specified by:
        encode in interface Encoder
        Parameters:
        rawInput - the text to encode
        language - the language of the text, or UNKNOWN to use language independent encoding
        type - the type of the ttensor to be returned
        Returns:
        the tex encoded into a tensor of the supplied type
      • normalize

        public java.lang.String normalize​(java.lang.String s)