gate.creole.tokeniser
Class SimpleTokeniser

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractProcessingResource
              extended by gate.creole.AbstractLanguageAnalyser
                  extended by gate.creole.tokeniser.SimpleTokeniser
All Implemented Interfaces:
ANNIEConstants, Executable, LanguageAnalyser, ProcessingResource, Resource, FeatureBearer, NameBearer, Serializable

public class SimpleTokeniser
extends AbstractLanguageAnalyser

Implementation of a Unicode rule based tokeniser. The tokeniser gets its rules from a file an InputStream or a Reader which should be sent to one of the constructors. The implementations is based on a finite state machine that is built based on the set of rules. A rule has two sides, the left hand side (LHS)and the right hand side (RHS) that are separated by the ">" character. The LHS represents a regular expression that will be matched against the input while the RHS describes a Gate2 annotation in terms of annotation type and attribute-value pairs. The matching is done using Unicode enumarated types as defined by the Character class. At the time of writing this class the suported Unicode categories were:

The accepted operators for the LHS are "+", "*" and "|" having the usual interpretations of "1 to n occurences", "0 to n occurences" and "boolean OR". For instance this is a valid LHS:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+
meaning an uppercase letter followed by one or more lowercase letters. The RHS describes an annotation that is to be created and inserted in the annotation set provided in case of a match. The new annotation will span the text that has been recognised. The RHS consists in the annotation type followed by pairs of attributes and associated values. E.g. for the LHS above a possible RHS can be:
Token;kind=upperInitial;
representing an annotation of type "Token" having one attribute named "kind" with the value "upperInitial"
The entire rule willbe:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+ > Token;kind=upperInitial;

The tokeniser ignores all the empty lines or the ones that start with # or //.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource
AbstractProcessingResource.InternalStatusListener, AbstractProcessingResource.IntervalProgressListener
 
Field Summary
protected  String annotationSetName
          the annotations et where the new annotations will be adde
protected static String defaultResourceName
           
protected  Set dfsmStates
          A set containng all the states of the deterministic machine
protected  gate.creole.tokeniser.DFSMState dInitialState
          The initial state of the deterministic machine
protected  Set fsmStates
          A set containng all the states of the non deterministic machine
protected  gate.creole.tokeniser.FSMState initialState
          The initial state of the non deterministic machin
static int maxTypeId
          The maximum int value used internally as a type i
protected  Map<Set,gate.creole.tokeniser.DFSMState> newStates
           
static String SIMP_TOK_ANNOT_SET_PARAMETER_NAME
           
static String SIMP_TOK_DOCUMENT_PARAMETER_NAME
           
static String SIMP_TOK_ENCODING_PARAMETER_NAME
           
static String SIMP_TOK_RULES_URL_PARAMETER_NAME
           
static Map<String,Integer> stringTypeIds
          Maps from type names to type internal id
static Map<Integer,Integer> typeIds
          maps from int (the static value on Character to int the internal value used by the tokeniser.
static String[] typeMnemonics
          Maps the internal type ids to the type name
 
Fields inherited from class gate.creole.AbstractLanguageAnalyser
corpus, document
 
Fields inherited from class gate.creole.AbstractProcessingResource
interrupted
 
Fields inherited from class gate.creole.AbstractResource
name
 
Fields inherited from class gate.util.AbstractFeatureBearer
features
 
Fields inherited from interface gate.creole.ANNIEConstants
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DEFAULT_FILE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_INSTANCE_FEATURE_NAME, LOOKUP_LANGUAGE_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PLUGIN_DIR, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME
 
Constructor Summary
SimpleTokeniser()
          Creates a tokeniser
 
Method Summary
 void execute()
          The method that does the actual tokenisation.
 String getAnnotationSetName()
           
 String getDFSMgml()
          Returns a string representation of the deterministic FSM graph using GML.
 String getEncoding()
           
 String getFSMgml()
          Returns a string representation of the non-deterministic FSM graph using GML (Graph modelling language).
 String getRulesResourceName()
           
 URL getRulesURL()
          Gets the value of the rulesURL property hich holds an URL to the file containing the rules for this tokeniser.
 Resource init()
          Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building the finite state machine at the core of the tokeniser.
 void reset()
          Prepares this Processing resource for a new run.
 void setAnnotationSetName(String newAnnotationSetName)
           
 void setEncoding(String newEncoding)
           
 void setRulesResourceName(String newRulesResourceName)
           
 void setRulesURL(URL newRulesURL)
          Sets the value of the rulesURL property which holds an URL to the file containing the rules for this tokeniser.
protected static String skipIgnoreTokens(StringTokenizer st)
          Skips the ignorable tokens from the input returning the first significant token.
 
Methods inherited from class gate.creole.AbstractLanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from class gate.creole.AbstractProcessingResource
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, getRuntimeParameterValues, getRuntimeParameterValues, interrupt, isInterrupted, reInit, removeProgressListener, removeStatusListener
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getInitParameterValues, getInitParameterValues, getName, getParameterValue, getParameterValue, getParameterValues, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class gate.util.AbstractFeatureBearer
getFeatures, setFeatures
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.ProcessingResource
reInit
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.FeatureBearer
getFeatures, setFeatures
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 
Methods inherited from interface gate.Executable
interrupt, isInterrupted
 

Field Detail

SIMP_TOK_DOCUMENT_PARAMETER_NAME

public static final String SIMP_TOK_DOCUMENT_PARAMETER_NAME
See Also:
Constant Field Values

SIMP_TOK_ANNOT_SET_PARAMETER_NAME

public static final String SIMP_TOK_ANNOT_SET_PARAMETER_NAME
See Also:
Constant Field Values

SIMP_TOK_RULES_URL_PARAMETER_NAME

public static final String SIMP_TOK_RULES_URL_PARAMETER_NAME
See Also:
Constant Field Values

SIMP_TOK_ENCODING_PARAMETER_NAME

public static final String SIMP_TOK_ENCODING_PARAMETER_NAME
See Also:
Constant Field Values

annotationSetName

protected String annotationSetName
the annotations et where the new annotations will be adde


initialState

protected gate.creole.tokeniser.FSMState initialState
The initial state of the non deterministic machin


fsmStates

protected Set fsmStates
A set containng all the states of the non deterministic machine


dInitialState

protected gate.creole.tokeniser.DFSMState dInitialState
The initial state of the deterministic machine


dfsmStates

protected Set dfsmStates
A set containng all the states of the deterministic machine


typeIds

public static final Map<Integer,Integer> typeIds
maps from int (the static value on Character to int the internal value used by the tokeniser. The ins values used by the tokeniser are consecutive values, starting from 0 and going as high as necessary. They map all the public static int members onCharacter


maxTypeId

public static int maxTypeId
The maximum int value used internally as a type i


typeMnemonics

public static String[] typeMnemonics
Maps the internal type ids to the type name


stringTypeIds

public static final Map<String,Integer> stringTypeIds
Maps from type names to type internal id


defaultResourceName

protected static String defaultResourceName

newStates

protected transient Map<Set,gate.creole.tokeniser.DFSMState> newStates
Constructor Detail

SimpleTokeniser

public SimpleTokeniser()
Creates a tokeniser

Method Detail

init

public Resource init()
              throws ResourceInstantiationException
Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building the finite state machine at the core of the tokeniser.

Specified by:
init in interface Resource
Overrides:
init in class AbstractProcessingResource
Throws:
ResourceInstantiationException

reset

public void reset()
Prepares this Processing resource for a new run.


skipIgnoreTokens

protected static String skipIgnoreTokens(StringTokenizer st)
Skips the ignorable tokens from the input returning the first significant token. The ignorable tokens are defined by a set


getFSMgml

public String getFSMgml()
Returns a string representation of the non-deterministic FSM graph using GML (Graph modelling language).


getDFSMgml

public String getDFSMgml()
Returns a string representation of the deterministic FSM graph using GML.


execute

public void execute()
             throws ExecutionException
The method that does the actual tokenisation.

Specified by:
execute in interface Executable
Overrides:
execute in class AbstractProcessingResource
Throws:
ExecutionException

setRulesURL

public void setRulesURL(URL newRulesURL)
Sets the value of the rulesURL property which holds an URL to the file containing the rules for this tokeniser.

Parameters:
newRulesURL -

getRulesURL

public URL getRulesURL()
Gets the value of the rulesURL property hich holds an URL to the file containing the rules for this tokeniser.


setAnnotationSetName

public void setAnnotationSetName(String newAnnotationSetName)

getAnnotationSetName

public String getAnnotationSetName()

setRulesResourceName

public void setRulesResourceName(String newRulesResourceName)

getRulesResourceName

public String getRulesResourceName()

setEncoding

public void setEncoding(String newEncoding)

getEncoding

public String getEncoding()