Class

com.johnsnowlabs.nlp.annotators.tokenizer.bpe

Gpt2Tokenizer

Related Doc: package bpe

Permalink

class Gpt2Tokenizer extends BpeTokenizer

Linear Supertypes
BpeTokenizer, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. Gpt2Tokenizer
  2. BpeTokenizer
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new Gpt2Tokenizer(merges: Map[(String, String), Int], vocab: Map[String, Int], specialTokens: SpecialTokens, padWithSentenceTokens: Boolean = true, prependString: String = "")

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. val appendForPieceId: Option[String]

    Permalink
    Attributes
    protected
    Definition Classes
    BpeTokenizer
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. def bpe(indToken: IndexedToken): Array[TokenPiece]

    Permalink

    Do the BPE algorithm.

    Do the BPE algorithm. Goal is to find the token as the largest words in the known vocabulary. If not possible, the word is split into smaller subwords, until they are known.

    returns

    Array of TokenPieces, corresponding to encoded token

    Attributes
    protected
    Definition Classes
    BpeTokenizer
  7. val bpeRanks: Map[(String, String), Int]

    Permalink
    Attributes
    protected
    Definition Classes
    BpeTokenizer
  8. val cache: Map[String, Array[String]]

    Permalink

    cache for already encoded tokens

    cache for already encoded tokens

    Attributes
    protected
    Definition Classes
    BpeTokenizer
  9. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  10. def decodeTokens(tokens: Array[Int]): String

    Permalink
  11. def encode(indTokens: Array[IndexedToken]): Array[TokenPiece]

    Permalink
    Definition Classes
    BpeTokenizer
  12. def encode(indToken: IndexedToken): Array[TokenPiece]

    Permalink
    Definition Classes
    BpeTokenizer
  13. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  14. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  15. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  16. def getBpeRanking: ((String, String)) ⇒ Int

    Permalink

    Rankings for the byte pairs.

    Rankings for the byte pairs. Derived from merges.txt

    Attributes
    protected
    Definition Classes
    BpeTokenizer
  17. def getBytePairs(word: Array[String]): Array[(String, String)]

    Permalink

    Create a sequence of byte-pairs of the word

    Create a sequence of byte-pairs of the word

    Attributes
    protected
    Definition Classes
    BpeTokenizer
  18. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  19. def getTokenPieces(indToken: IndexedToken, word: Array[String], processedToken: String): Array[TokenPiece]

    Permalink
    Attributes
    protected
    Definition Classes
    BpeTokenizer
  20. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  21. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  22. val merges: Map[(String, String), Int]

    Permalink
    Definition Classes
    BpeTokenizer
  23. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  24. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  25. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  26. var padWithSentenceTokens: Boolean

    Permalink
    Definition Classes
    BpeTokenizer
  27. def performMerges(wordChars: Array[String], charPairs: Array[(String, String)]): Array[String]

    Permalink
    Attributes
    protected
    Definition Classes
    BpeTokenizer
  28. def preProcessTokenForBpe(token: String): String

    Permalink
    Definition Classes
    Gpt2Tokenizer → BpeTokenizer
  29. val prependForPieceId: Option[String]

    Permalink
    Definition Classes
    Gpt2Tokenizer → BpeTokenizer
  30. val sentencePadding: (String, String)

    Permalink

    Special tokens of the model for processing

    Special tokens of the model for processing

    Definition Classes
    BpeTokenizer
  31. val specialTokens: SpecialTokens

    Permalink
    Definition Classes
    BpeTokenizer
  32. def splitOnSpecialToken(specialToken: SpecialToken, text: String): ListBuffer[String]

    Permalink

    Split the the individual sub texts on special tokens, e.g.

    Split the the individual sub texts on special tokens, e.g. masking etc.

    Attributes
    protected
    Definition Classes
    BpeTokenizer
  33. val splitPattern: Regex

    Permalink
  34. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  35. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  36. def tokenize(sentence: Sentence): Array[IndexedToken]

    Permalink

    Tokenize considering special tokens and split algorithm

    Tokenize considering special tokens and split algorithm

    Definition Classes
    BpeTokenizer
  37. def tokenizeSubText(text: String, indexOffset: Int): Array[IndexedToken]

    Permalink

    Needs to be implemented

    Needs to be implemented

    Definition Classes
    Gpt2Tokenizer → BpeTokenizer
  38. val vocab: Map[String, Int]

    Permalink
    Definition Classes
    BpeTokenizer
  39. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  40. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  41. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from BpeTokenizer

Inherited from AnyRef

Inherited from Any

Ungrouped