Whether to ignore case in index lookups (Default depends on model)
Whether to ignore case in index lookups (Default depends on model)
Enables regex pattern match (Default: false
).
input annotations columns currently used
Gets annotation column name going to generate
Gets annotation column name going to generate
Input annotator types: DOCUMENT, TOKEN
Input annotator types: DOCUMENT, TOKEN
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
Output annotator types: CHUNK
Output annotator types: CHUNK
Resource in JSON or CSV format to map entities to patterns (Default: null
).
Overrides required annotators column if different than default
Overrides required annotators column if different than default
Overrides annotation column name when transforming
Overrides annotation column name when transforming
Path to the external resource.
Path to the external resource.
Unique identifier for storage (Default: this.uid
)
Unique identifier for storage (Default: this.uid
)
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
required uid for storing annotator to disk
required uid for storing annotator to disk
Whether to use RocksDB storage to serialize patterns (Default: true
).
takes a Dataset and checks to see if all the required annotation types are present.
takes a Dataset and checks to see if all the required annotation types are present.
to be validated
True if all the required types are present, else false
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.
Required input and expected output annotator types
Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.
There are multiple ways and formats to set the extraction resource. It is possible to set it either as a "JSON", "JSONL" or "CSV" file. A path to the file needs to be provided to
setPatternsResource
. The file format needs to be set as the "format" field in theoption
parameter map and depending on the file type, additional parameters might need to be set.To enable regex extraction,
setEnablePatternRegex(true)
needs to be called.If the file is in a JSON format, then the rule definitions need to be given in a list with the fields "id", "label" and "patterns":
The same fields also apply to a file in the JSONL format:
In order to use a CSV file, an additional parameter "delimiter" needs to be set. In this case, the delimiter might be set by using
.setPatternsResource("patterns.csv", ReadAs.TEXT, Map("format"->"csv", "delimiter" -> "\\|"))
Example
In this example, the entities file as the form of
where each line represents an entity and the associated string delimited by "|".