BertFullTokenizer (Deep Java Library 0.33.0 API specification)

java.lang.Object

ai.djl.modality.nlp.preprocess.SimpleTokenizer

ai.djl.modality.nlp.bert.BertTokenizer

ai.djl.modality.nlp.bert.BertFullTokenizer

All Implemented Interfaces:: TextProcessor, Tokenizer

public class BertFullTokenizer extends BertTokenizer

BertFullTokenizer runs end to end tokenization of input text

It will run basic preprocessors to clean the input text and then run WordpieceTokenizer to split into word pieces.

Reference implementation: Google Research Bert Tokenizer

Constructor Summary

Constructors

Constructor

Description

BertFullTokenizer(Vocabulary vocabulary, boolean lowerCase)

Creates an instance of BertFullTokenizer.
Method Summary

Modifier and Type

Method

Description

String

buildSentence(List<String> tokens)

Combines a list of tokens to form a sentence.

static List<TextProcessor>

getPreprocessors(boolean lowerCase)

Get a list of TextProcessors to process input text for Bert models.

Vocabulary

getVocabulary()

Returns the Vocabulary used for tokenization.

List<String>

tokenize(String input)

Breaks down the given sentence into a list of tokens that can be represented by embeddings.

Methods inherited from class ai.djl.modality.nlp.bert.BertTokenizer
encode, encode, pad, tokenToString

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess

Constructor Details
- BertFullTokenizer
  
  public BertFullTokenizer(Vocabulary vocabulary, boolean lowerCase)
  
  Creates an instance of BertFullTokenizer.
  
  Parameters:
  
  vocabulary - the BERT vocabulary
  
  lowerCase - whether to convert tokens to lowercase
Method Details
- getVocabulary
  
  public Vocabulary getVocabulary()
  
  Returns the Vocabulary used for tokenization.
  
  Returns:
  
  the Vocabulary used for tokenization
- tokenize
  
  public List<String> tokenize(String input)
  
  Breaks down the given sentence into a list of tokens that can be represented by embeddings.
  
  Specified by:
  
  tokenize in interface Tokenizer
  
  Overrides:
  
  tokenize in class BertTokenizer
  
  Parameters:
  
  input - the sentence to tokenize
  
  Returns:
  
  a List of tokens
- buildSentence
  
  public String buildSentence(List<String> tokens)
  
  Combines a list of tokens to form a sentence.
  
  Specified by:
  
  buildSentence in interface Tokenizer
  
  Overrides:
  
  buildSentence in class SimpleTokenizer
  
  Parameters:
  
  tokens - the List of tokens
  
  Returns:
  
  the sentence built from the given tokens
- getPreprocessors
  
  public static List<TextProcessor> getPreprocessors(boolean lowerCase)
  
  Get a list of TextProcessors to process input text for Bert models.
  
  Parameters:
  
  lowerCase - whether to convert input to lowercase
  
  Returns:
  
  List of TextProcessors