The BrownClustersTagger tags the tokens of a sentence with their Brown clusters.
format: OFF A class that parses Google N-Gram data (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html) to provide information about a requested n-gram.
Maps feature names to integers.
Maps feature names to integers. Useful for serializing TrainingData instances for consumption by command-line machine learning tools.
an indexed sequence of feature names
The name of a feature, represented as a list of Symbols.
The name of a feature, represented as a list of Symbols.
the list of symbols comprising the feature name
A mapping from feature names to values.
A mapping from feature names to values.
Unspecified feature names are assumed to correspond to a value of zero.
the map from feature names to values
A SentenceTagger that tags tokens with unigram features from Google N-grams.
A SentenceTagger that tags tokens with unigram features from Google N-grams.
the Google N-grams resource
the Google N-grams tag type you want to create features for
A weighted linear combination of features.
A weighted linear combination of features.
map from feature names to weight coefficients
Utility case classes to represent information associated with an ngram in the Google Ngram corpus.
Abstraction for a set of labeled feature vectors.
Abstraction for a set of labeled feature vectors.
Provides various serialization options for different machine learning tools.
a sequence of feature vectors labeled with integer outcomes
Encapsulates unigram info pertaining to a word.
Encapsulates unigram info pertaining to a word. Instead of a seq of SyntacticInfo objects in the general purpose NgramInfo class, here we have a single SyntacticInfo representing the info for a single gram.
A class that uses JVerbnet, a 3rd party Wrapper library for Verbnet data (http://projects.csail.mit.edu/jverbnet/), to quickly look up various verbnet features for a verb.
A SentenceTagger that tags sentence tokens using Verbnet frames.
A SentenceTagger that tags sentence tokens using Verbnet frames.
the associated Verbnet resource
set to true if you want secondary (rather than primary) frames
A WrapperClassifier wraps a ProbabilisticClassifier (which uses integer-based feature names) in an interface that allows you to use the more natural org.allenai.nlpstack.parse.poly.ml FeatureVector format for classification.
A WrapperClassifier wraps a ProbabilisticClassifier (which uses integer-based feature names) in an interface that allows you to use the more natural org.allenai.nlpstack.parse.poly.ml FeatureVector format for classification. This is a trait that specific wrappers can extend.
Trains a WrapperClassifier from training data.
Companion object.
Object containing utility methods to parse a Google Ngram corpus.
Object containing utility methods to parse a Google Ngram corpus. This is not specific to the type of corpus, i.e. whether unigram, bigram, etc.
Object encapsulating some functionality specific to unigrams.
Object encapsulating some functionality specific to unigrams. Used wherever features need to be constructed based on unigrams (Google Ngram Nodes).
Provide Serialization and Deserialization methods based on the runtime type of WrapperClassifier.
format: OFF A class that parses Google N-Gram data (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html) to provide information about a requested n-gram. Takes the datastore location details for a data directory and parses each file, expected to be in the following format (from https://docs.google.com/document/d/14PWeoTkrnKk9H8_7CfVbdvuoFZ7jYivNTkBX2Hj7qLw/edit) - head_word<TAB>syntactic-ngram<TAB>total_count<TAB>counts_by_year The counts_by_year format is a tab-separated list of year<comma>count items. Years are sorted in ascending order, and only years with non-zero counts are included. The syntactic-ngram format is a space-separated list of tokens, each token format is: “word/pos-tag/dep-label/head-index”. The word field can contain any non-whitespace character. The other fields can contain any non-whitespace character except for ‘/’. pos-tag is a Penn-Treebank part-of-speech tag. dep-label is a stanford-basic-dependencies label. head-index is an integer, pointing to the head of the current token. “1” refers to the first token in the list, 2 the second, and 0 indicates that the head is the root of the fragment. format: ON