Class BertTokenizer

    • Constructor Summary

      Constructors 
      Constructor Description
      BertTokenizer()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      BertToken encode​(java.lang.String question, java.lang.String paragraph)
      Encodes questions and paragraph sentences.
      BertToken encode​(java.lang.String question, java.lang.String paragraph, int maxLength)
      Encodes questions and paragraph sentences with max length.
      <E> java.util.List<E> pad​(java.util.List<E> tokens, E padItem, int num)
      Pads the tokens to the required length.
      java.util.List<java.lang.String> tokenize​(java.lang.String input)
      Breaks down the given sentence into a list of tokens that can be represented by embeddings.
      java.lang.String tokenToString​(java.util.List<java.lang.String> tokens)
      Returns a string presentation of the tokens.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • BertTokenizer

        public BertTokenizer()
    • Method Detail

      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String input)
        Breaks down the given sentence into a list of tokens that can be represented by embeddings.
        Specified by:
        tokenize in interface Tokenizer
        Overrides:
        tokenize in class SimpleTokenizer
        Parameters:
        input - the sentence to tokenize
        Returns:
        a List of tokens
      • tokenToString

        public java.lang.String tokenToString​(java.util.List<java.lang.String> tokens)
        Returns a string presentation of the tokens.
        Parameters:
        tokens - a list of tokens
        Returns:
        a string presentation of the tokens
      • pad

        public <E> java.util.List<E> pad​(java.util.List<E> tokens,
                                         E padItem,
                                         int num)
        Pads the tokens to the required length.
        Type Parameters:
        E - the type of the List
        Parameters:
        tokens - the input tokens
        padItem - the things to pad at the end
        num - the total length after padding
        Returns:
        a list of padded tokens
      • encode

        public BertToken encode​(java.lang.String question,
                                java.lang.String paragraph)
        Encodes questions and paragraph sentences.
        Parameters:
        question - the input question
        paragraph - the input paragraph
        Returns:
        BertToken
      • encode

        public BertToken encode​(java.lang.String question,
                                java.lang.String paragraph,
                                int maxLength)
        Encodes questions and paragraph sentences with max length.
        Parameters:
        question - the input question
        paragraph - the input paragraph
        maxLength - the maxLength
        Returns:
        BertToken