CLI to build an Elastic Search index on Aristo corpora.
Factory for elasticsearch BulkProcessor.
A document that has been broken up into (potentially nested) segments.
A document that has been broken up into (potentially nested) segments. Note that there's a notion of a segment and segmenter in the nlpstack, but those are used exclusively for sentences. This class aims to capture higher-level document structure than sentences.
Indexing main object.
Indexing main object. Configuration specified in indexing.conf in org.allenai.common.indexing. See common/Readme for details.
Utility object that takes config parameters from application config file and constructs a transport client to talk to ElasticSearch.
Script used to segment waterloo corpus on a sentence level.
Script used to segment waterloo corpus on a sentence level. Splits docs based on <DOC> ... </DOC> tags, determines whether the doc is in "English" by counting the fraction of stop words, and throws out the doc if it is not. Sentence segments the doc using nlp stack, wraps each sentence in <SENT> ... </SENT> tags, and then rewrites the entire doc to file.
CLI to build an Elastic Search index on Aristo corpora. In order to build the index, you need to have elasticsearch running. Download latest version of elasticsearch, go to the 'bin' folder and run it: ./elasticsearch Refer http://joelabrahamsson.com/elasticsearch-101/ to get started. Takes in Config object containing corpus and other information necessary to build the index.