Class BertTokenizer

All Implemented Interfaces:
TextProcessor, Tokenizer
Direct Known Subclasses:
BertFullTokenizer

public class BertTokenizer extends SimpleTokenizer
BertTokenizer is a class to help you encode question and paragraph sentence.
  • Constructor Details

    • BertTokenizer

      public BertTokenizer()
  • Method Details

    • tokenize

      public List<String> tokenize(String input)
      Breaks down the given sentence into a list of tokens that can be represented by embeddings.
      Specified by:
      tokenize in interface Tokenizer
      Overrides:
      tokenize in class SimpleTokenizer
      Parameters:
      input - the sentence to tokenize
      Returns:
      a List of tokens
    • tokenToString

      public String tokenToString(List<String> tokens)
      Returns a string presentation of the tokens.
      Parameters:
      tokens - a list of tokens
      Returns:
      a string presentation of the tokens
    • pad

      public <E> List<E> pad(List<E> tokens, E padItem, int num)
      Pads the tokens to the required length.
      Type Parameters:
      E - the type of the List
      Parameters:
      tokens - the input tokens
      padItem - the things to pad at the end
      num - the total length after padding
      Returns:
      a list of padded tokens
    • encode

      public BertToken encode(String question, String paragraph)
      Encodes questions and paragraph sentences.
      Parameters:
      question - the input question
      paragraph - the input paragraph
      Returns:
      BertToken
    • encode

      public BertToken encode(String question, String paragraph, int maxLength)
      Encodes questions and paragraph sentences with max length.
      Parameters:
      question - the input question
      paragraph - the input paragraph
      maxLength - the maxLength
      Returns:
      BertToken