WordpieceTokenizer (Deep Java Library 0.13.0 API specification)

Skip navigation links

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- ai.djl.modality.nlp.preprocess.SimpleTokenizer
- - ai.djl.modality.nlp.bert.WordpieceTokenizer

All Implemented Interfaces:

TextProcessor, Tokenizer
```
public class WordpieceTokenizer
extends SimpleTokenizer
```
WordpieceTokenizer tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. The input text should already be cleaned and preprocessed.
```
 jshell> String input = "unaffable";
 jshell> wordpieceTokenizer.tokenize(intput);
 ["un", "##aff", "##able"]
 
```
Reference implementation: Google Research Bert Tokenizer

Constructor Summary

Constructors
Constructor and Description
`WordpieceTokenizer(DefaultVocabulary vocabulary, java.lang.String unknown, int maxInputChars)` Creates an instance of `WordpieceTokenizer`.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`java.util.List<java.lang.String>`	`tokenize(java.lang.String sentence)` Breaks down the given sentence into a list of tokens that can be represented by embeddings.

Methods inherited from class ai.djl.modality.nlp.preprocess.SimpleTokenizer
buildSentence

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess

- Constructor Detail
  - WordpieceTokenizer
```
public WordpieceTokenizer(DefaultVocabulary vocabulary,
                          java.lang.String unknown,
                          int maxInputChars)
```
    Creates an instance of WordpieceTokenizer.
    
    Parameters:
    
    vocabulary - a DefaultVocabulary used for wordpiece tokenization
    
    unknown - String that represent unknown token
    
    maxInputChars - maximum number of input characters
- Method Detail
  - tokenize
```
public java.util.List<java.lang.String> tokenize(java.lang.String sentence)
```
    Breaks down the given sentence into a list of tokens that can be represented by embeddings.
    
    Specified by:
    
    tokenize in interface Tokenizer
    
    Overrides:
    
    tokenize in class SimpleTokenizer
    
    Parameters:
    
    sentence - the sentence to tokenize
    
    Returns:
    
    a List of tokens

Skip navigation links

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method