public class BertTokenizer extends SimpleTokenizer
Constructor and Description |
---|
BertTokenizer() |
Modifier and Type | Method and Description |
---|---|
BertToken |
encode(java.lang.String question,
java.lang.String paragraph)
Encodes questions and paragraph sentences.
|
BertToken |
encode(java.lang.String question,
java.lang.String paragraph,
int maxLength)
Encodes questions and paragraph sentences with max length.
|
<E> java.util.List<E> |
pad(java.util.List<E> tokens,
E padItem,
int num)
Pads the tokens to the required length.
|
java.util.List<java.lang.String> |
tokenize(java.lang.String input)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.
|
buildSentence
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
preprocess
public java.util.List<java.lang.String> tokenize(java.lang.String input)
tokenize
in interface Tokenizer
tokenize
in class SimpleTokenizer
input
- the sentence to tokenizeList
of tokenspublic <E> java.util.List<E> pad(java.util.List<E> tokens, E padItem, int num)
E
- the type of the Listtokens
- the input tokenspadItem
- the things to pad at the endnum
- the total length after paddingpublic BertToken encode(java.lang.String question, java.lang.String paragraph)
question
- the input questionparagraph
- the input paragraphpublic BertToken encode(java.lang.String question, java.lang.String paragraph, int maxLength)
question
- the input questionparagraph
- the input paragraphmaxLength
- the maxLength