Class UTFTwitterTokeniser

java.lang.Object
org.terrier.indexing.tokenisation.Tokeniser
org.terrier.indexing.tokenisation.UTFTwitterTokeniser
All Implemented Interfaces:
java.io.Serializable

public class UTFTwitterTokeniser
extends Tokeniser
A tokeniser designed for use on tweets. It maintains UTF-8 encoding and keeps mentions
Since:
4.0
Author:
Richard McCreadie
See Also:
Serialized Form
  • Field Summary

    Fields 
    Modifier and Type Field Description
    protected static boolean DROP_LONG_TOKENS
    Whether tokens longer than MAX_TERM_LENGTH should be dropped.
    protected static int maxNumOfDigitsPerTerm
    The maximum number of digits that are allowed in valid terms.
    protected static int maxNumOfSameConseqLettersPerTerm
    The maximum number of consecutive same letters or digits that are allowed in valid terms.

    Fields inherited from class org.terrier.indexing.tokenisation.Tokeniser

    EMPTY_STREAM
  • Constructor Summary

    Constructors 
    Constructor Description
    UTFTwitterTokeniser()  
  • Method Summary

    Modifier and Type Method Description
    TokenStream tokenise​(java.io.Reader reader)
    Tokenises the text obtained from the specified reader.

    Methods inherited from class org.terrier.indexing.tokenisation.Tokeniser

    getTokeniser, getTokens, getTokens

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait