Bidirectional MEMM sequence tagger User: mihais Date: 8/27/17
Lexicon-based NER which efficiently recognizes entities from large dictionaries by combining like matchers
Lexicon-based NER similar to CombinedLexiconNER but which also adds efficient serialization, deserialization, and storage by using the CompactTrie
Lexicon-based NER similar to CombinedLexiconNER but which also adds efficient serialization, deserialization, and storage by using the CompactTrie
A class that builds either a
A class that builds either a
depending on the value of useCompact.
The building performed here works on a text file. Both kinds of NERs are also Serializable and can be loaded as objects without the text parsing.
Implements common features used in sequence tagging Created by mihais on 6/8/17.
Generates all accepted lexical variations for an entity User: mihais Date: 10/3/17
Generates all accepted lexical variations for an entity User: mihais Date: 10/3/17
The abstract base class for several concrete child classes used for Named Entity Recognition (NER) based on the contents of lexica, which are lists of words and phrases representing named entities
The abstract base class for several concrete child classes used for Named Entity Recognition (NER) based on the contents of lexica, which are lists of words and phrases representing named entities
For all of these classes, NER labels are derived from the file names of the lexica or the records in overrideKBs by the LexiconNERBuilders. This class, via variables USE_FAST and USE_COMPACT, controls which builder use used.
The collection of child classes is small:
- The SeparatedLexiconNER is closest to the original implementation. It has a BooleanHashTrie for each label and in that trie, Boolean values indicate that the sequence of strings leading there is a named entity. Each trie structure must be searched for potential named entities.
- The CombinedLexiconNER stores instead of the Boolean in the BooleanHashTrie an Int in an IntHashTrie. The Int indicates which of the labels is the one to use for the entity just found. In this way, only one trie (or two if there are different case sensitivity settings) needs to be searched no matter how many labels there are (at least until Integer.MAX_VALUE).
- The CompactLexiconNER uses the same strategy to minimize the number of tries, but also converts the tries into CompactTries which consist of arrays of integers indicating offsets into other arrays. In this way the time it takes to de/serialize the NER is reduced, and some lookup operations are made more efficient.
Concrete subclasses are responsible for building various NERs.
Concrete subclasses are responsible for building various NERs. The mapping is as follows:
For an explanation of how the NERs differ from each other, see their superclass, LexiconNER.
Sequence tagger using a maximum entrop Markov model (MEMM) User: mihais Date: 8/26/17
Stores training data for sequence modeling Mandatory columns: 0 - word, 1 - label Optional columns: 2 - POS tag, 3+ SRL arguments
Lexicon-based NER, which efficiently recognizes entities from large dictionaries
Lexicon-based NER, which efficiently recognizes entities from large dictionaries
Note: This is a cleaned-up version of the old RuleNER. It may have been known simply as LexiconNER at one point, but was renamed to emphasize the fact that each KB is stored in a separate matcher (BooleanHashTrie). Other variations get by with fewer matchers.
Create a SeparatedLexiconNER object using either LexiconNER.apply() or SlowLexiconNERBuilder.build() rather than by the constructor if at all possible. Use it by calling the find() method on a single sentence.
Computes P, R, F1 scores for the complete mentions produced by a sequence tagger, in the BIO notation User: mihais Date: 2/27/15
Trait for all sequence taggers User: mihais Date: 8/25/17
Implements evaluation of a sequence tagger Created by mihais on 6/8/17.
Logger holder User: mihais Date: 8/26/17
A class that builds a SeparatedLexiconNER
A class that builds a SeparatedLexiconNER
The building performed here works on a text file. The SeparatedLexiconNER is also Serializable and can be loaded as an object without the text parsing.
High-level trait for a sequence tagger User: mihais Date: 10/12/17
Detects the case of a word
Reads the CoNLL-like column format
Converts the CoNLLX column-based format to our Document by reading only words and POS tags Created by mihais on 6/8/17.
Converts the CoNLLX column-based format to our Document by reading only words and POS tags Created by mihais on 6/8/17. Last Modified: Fix compiler issue: import scala.io.Source.
Transforms -LRB-, -LCB-, etc.
Transforms -LRB-, -LCB-, etc. tokens back into "(", "{", etc. This is necessary because the POS WSJ dataset uses the -LRB- conventions to replace words in the dataset, whereas all the others datasets we use (NER, syntax) do not. Note that we continue to keep the *POS tags* as -LRB-, -LCB-, etc., because these are standard Penn Treebank tags. We just replace the words.
Simple shell for sequence taggers Created by mihais on 6/7/17.
Lexicon-based NER which efficiently recognizes entities from large dictionaries by combining like matchers
Case insensitive matching is performed by one matcher and case sensitive by the other. Each can account for multiple KBs. Each IntHashTrie stores Ints which indicate which of the KBs an entry comes from. The KBs, either from the kbs or overrideKBs in LexiconNER.apply, have priorities, and the one with highest priority is recorded.