Class UTFTwitterTokeniser
java.lang.Object
org.terrier.indexing.tokenisation.Tokeniser
org.terrier.indexing.tokenisation.UTFTwitterTokeniser
- All Implemented Interfaces:
java.io.Serializable
public class UTFTwitterTokeniser extends Tokeniser
A tokeniser designed for use on tweets. It maintains UTF-8 encoding
and keeps mentions
- Since:
- 4.0
- Author:
- Richard McCreadie
- See Also:
- Serialized Form
-
Field Summary
Fields Modifier and Type Field Description protected static boolean
DROP_LONG_TOKENS
Whether tokens longer than MAX_TERM_LENGTH should be dropped.protected static int
maxNumOfDigitsPerTerm
The maximum number of digits that are allowed in valid terms.protected static int
maxNumOfSameConseqLettersPerTerm
The maximum number of consecutive same letters or digits that are allowed in valid terms. -
Constructor Summary
Constructors Constructor Description UTFTwitterTokeniser()
-
Method Summary
Modifier and Type Method Description TokenStream
tokenise(java.io.Reader reader)
Tokenises the text obtained from the specified reader.Methods inherited from class org.terrier.indexing.tokenisation.Tokeniser
getTokeniser, getTokens, getTokens
-
Field Details
-
maxNumOfDigitsPerTerm
protected static final int maxNumOfDigitsPerTermThe maximum number of digits that are allowed in valid terms.- See Also:
- Constant Field Values
-
maxNumOfSameConseqLettersPerTerm
protected static final int maxNumOfSameConseqLettersPerTermThe maximum number of consecutive same letters or digits that are allowed in valid terms.- See Also:
- Constant Field Values
-
DROP_LONG_TOKENS
protected static final boolean DROP_LONG_TOKENSWhether tokens longer than MAX_TERM_LENGTH should be dropped.- See Also:
- Constant Field Values
-
-
Constructor Details
-
UTFTwitterTokeniser
public UTFTwitterTokeniser()
-
-
Method Details