Class BertFullTokenizer

    • Constructor Detail

      • BertFullTokenizer

        public BertFullTokenizer​(Vocabulary vocabulary,
                                 boolean lowerCase)
        Creates an instance of BertFullTokenizer.
        Parameters:
        vocabulary - the BERT vocabulary
        lowerCase - whether to convert tokens to lowercase
    • Method Detail

      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String input)
        Breaks down the given sentence into a list of tokens that can be represented by embeddings.
        Specified by:
        tokenize in interface Tokenizer
        Overrides:
        tokenize in class BertTokenizer
        Parameters:
        input - the sentence to tokenize
        Returns:
        a List of tokens
      • buildSentence

        public java.lang.String buildSentence​(java.util.List<java.lang.String> tokens)
        Combines a list of tokens to form a sentence.
        Specified by:
        buildSentence in interface Tokenizer
        Overrides:
        buildSentence in class SimpleTokenizer
        Parameters:
        tokens - the List of tokens
        Returns:
        the sentence built from the given tokens
      • getPreprocessors

        public static java.util.List<TextProcessor> getPreprocessors​(boolean lowerCase)
        Get a list of TextProcessors to process input text for Bert models.
        Parameters:
        lowerCase - whether to convert input to lowercase
        Returns:
        List of TextProcessors