BertFullTokenizer (Deep Java Library 0.8.0 API specification)

Skip navigation links

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- ai.djl.modality.nlp.preprocess.SimpleTokenizer
- - ai.djl.modality.nlp.bert.BertFullTokenizer

All Implemented Interfaces:

TextProcessor, Tokenizer
```
public class BertFullTokenizer
extends SimpleTokenizer
```
BertFullTokenizer runs end to end tokenization of input text
It will run basic preprocessors to clean the input text and then run WordpieceTokenizer to split into word pieces.
Reference implementation: Google Research Bert Tokenizer

Constructor Summary

Constructors
Constructor and Description

BertFullTokenizer(java.lang.String filepath, boolean lowerCase)
Creates an instance of BertFullTokenizer.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static java.util.List<TextProcessor>`	`getPreprocessors(boolean lowerCase)` Get a list of `TextProcessor`s to process input text for Bert models.
`SimpleVocabulary`	`getVocabulary()` Returns the `SimpleVocabulary` used for tokenization.
`java.util.List<java.lang.String>`	`tokenize(java.lang.String input)` Breaks down the given sentence into a list of tokens that can be represented by embeddings.

Methods inherited from class ai.djl.modality.nlp.preprocess.SimpleTokenizer
buildSentence

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess

- Constructor Detail
  - BertFullTokenizer
```
public BertFullTokenizer(java.lang.String filepath,
                         boolean lowerCase)
```
    Creates an instance of BertFullTokenizer.
    
    Parameters:
    
    filepath - the path to vocabulary file
    
    lowerCase - whether to convert tokens to lowercase
- Method Detail
  - getVocabulary
```
public SimpleVocabulary getVocabulary()
```
    Returns the SimpleVocabulary used for tokenization.
    
    Returns:
    
    the SimpleVocabulary used for tokenization
  - tokenize
```
public java.util.List<java.lang.String> tokenize(java.lang.String input)
```
    Breaks down the given sentence into a list of tokens that can be represented by embeddings.
    
    Specified by:
    
    tokenize in interface Tokenizer
    
    Overrides:
    
    tokenize in class SimpleTokenizer
    
    Parameters:
    
    input - the sentence to tokenize
    
    Returns:
    
    a List of tokens
  - getPreprocessors
```
public static java.util.List<TextProcessor> getPreprocessors(boolean lowerCase)
```
    Get a list of TextProcessors to process input text for Bert models.
    
    Parameters:
    
    lowerCase - whether to convert input to lowercase
    
    Returns:
    
    List of TextProcessors

Skip navigation links

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method