@CreoleResource(name="GATE Unicode Tokeniser", comment="A customisable Unicode tokeniser.", helpURL="http://gate.ac.uk/userguide/sec:annie:tokeniser", icon="tokeniser") public class SimpleTokeniser extends AbstractLanguageAnalyser
InputStream
or a Reader
which should be sent to one
of the constructors.
The implementations is based on a finite state machine that is built based
on the set of rules.
A rule has two sides, the left hand side (LHS)and the right hand side (RHS)
that are separated by the ">" character. The LHS represents a
regular expression that will be matched against the input while the RHS
describes a Gate2 annotation in terms of annotation type and attribute-value
pairs.
The matching is done using Unicode enumarated types as defined by the Character
class. At the time of writing this class the
suported Unicode categories were:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+ > Token;kind=upperInitial;
AbstractProcessingResource.InternalStatusListener, AbstractProcessingResource.IntervalProgressListener
Modifier and Type | Field and Description |
---|---|
protected String |
annotationSetName
the annotations et where the new annotations will be adde
|
protected static String |
defaultResourceName |
protected Set<gate.creole.tokeniser.DFSMState> |
dfsmStates
A set containng all the states of the deterministic machine
|
protected gate.creole.tokeniser.DFSMState |
dInitialState
The initial state of the deterministic machine
|
protected Set<gate.creole.tokeniser.FSMState> |
fsmStates
A set containng all the states of the non deterministic machine
|
protected gate.creole.tokeniser.FSMState |
initialState
The initial state of the non deterministic machin
|
static int |
maxTypeId
The maximum int value used internally as a type i
|
protected Map<Set<gate.creole.tokeniser.FSMState>,gate.creole.tokeniser.DFSMState> |
newStates |
static String |
SIMP_TOK_ANNOT_SET_PARAMETER_NAME |
static String |
SIMP_TOK_DOCUMENT_PARAMETER_NAME |
static String |
SIMP_TOK_ENCODING_PARAMETER_NAME |
static String |
SIMP_TOK_RULES_URL_PARAMETER_NAME |
static Map<String,Integer> |
stringTypeIds
Maps from type names to type internal id
|
static Map<Integer,Integer> |
typeIds
maps from int (the static value on
Character to int
the internal value used by the tokeniser. |
static String[] |
typeMnemonics
Maps the internal type ids to the type name
|
corpus, document
interrupted
name
features
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DEFAULT_FILE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_INSTANCE_FEATURE_NAME, LOOKUP_LANGUAGE_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PLUGIN_DIR, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME
Constructor and Description |
---|
SimpleTokeniser()
Creates a tokeniser
|
Modifier and Type | Method and Description |
---|---|
void |
execute()
The method that does the actual tokenisation.
|
String |
getAnnotationSetName() |
String |
getDFSMgml()
Returns a string representation of the deterministic FSM graph using
GML.
|
String |
getEncoding() |
String |
getFSMgml()
Returns a string representation of the non-deterministic FSM graph using
GML (Graph modelling language).
|
String |
getRulesResourceName() |
URL |
getRulesURL()
Gets the value of the
rulesURL property hich holds an
URL to the file containing the rules for this tokeniser. |
Resource |
init()
Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building
the finite state machine at the core of the tokeniser.
|
void |
reset()
Prepares this Processing resource for a new run.
|
void |
setAnnotationSetName(String newAnnotationSetName) |
void |
setEncoding(String newEncoding) |
void |
setRulesResourceName(String newRulesResourceName) |
void |
setRulesURL(URL newRulesURL)
Sets the value of the
rulesURL property which holds an URL
to the file containing the rules for this tokeniser. |
protected static String |
skipIgnoreTokens(StringTokenizer st)
Skips the ignorable tokens from the input returning the first significant
token.
|
getCorpus, getDocument, setCorpus, setDocument
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, getRuntimeParameterValues, getRuntimeParameterValues, interrupt, isInterrupted, reInit, removeProgressListener, removeStatusListener
checkParameterValues, flushBeanInfoCache, forgetBeanInfo, getBeanInfo, getInitParameterValues, getInitParameterValues, getName, getParameterValue, getParameterValue, getParameterValues, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners, toString
getFeatures, setFeatures
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
reInit
cleanup, getParameterValue, setParameterValue, setParameterValues
getFeatures, setFeatures
getName, setName
interrupt, isInterrupted
public static final String SIMP_TOK_DOCUMENT_PARAMETER_NAME
public static final String SIMP_TOK_ANNOT_SET_PARAMETER_NAME
public static final String SIMP_TOK_RULES_URL_PARAMETER_NAME
public static final String SIMP_TOK_ENCODING_PARAMETER_NAME
protected String annotationSetName
protected gate.creole.tokeniser.FSMState initialState
protected Set<gate.creole.tokeniser.FSMState> fsmStates
protected gate.creole.tokeniser.DFSMState dInitialState
protected Set<gate.creole.tokeniser.DFSMState> dfsmStates
public static int maxTypeId
public static String[] typeMnemonics
public static final Map<String,Integer> stringTypeIds
protected static String defaultResourceName
public SimpleTokeniser()
public Resource init() throws ResourceInstantiationException
init
in interface Resource
init
in class AbstractProcessingResource
ResourceInstantiationException
public void reset()
protected static String skipIgnoreTokens(StringTokenizer st)
a set
public String getFSMgml()
public String getDFSMgml()
public void execute() throws ExecutionException
execute
in interface Executable
execute
in class AbstractProcessingResource
ExecutionException
@CreoleParameter(defaultValue="resources/tokeniser/DefaultTokeniser.rules", comment="The URL to the rules file", suffixes="rules") public void setRulesURL(URL newRulesURL)
rulesURL
property which holds an URL
to the file containing the rules for this tokeniser.newRulesURL
- public URL getRulesURL()
rulesURL
property hich holds an
URL to the file containing the rules for this tokeniser.@RunTime @Optional @CreoleParameter(comment="The annotation set to be used for the generated annotations") public void setAnnotationSetName(String newAnnotationSetName)
public String getAnnotationSetName()
public void setRulesResourceName(String newRulesResourceName)
public String getRulesResourceName()
@CreoleParameter(defaultValue="UTF-8", comment="The encoding used for reading the definitions") public void setEncoding(String newEncoding)
public String getEncoding()