indexing

class BarronsDocumentReader extends AnyRef
class BuildCorpusIndex extends Logging

CLI to build an Elastic Search index on Aristo corpora.
CLI to build an Elastic Search index on Aristo corpora. In order to build the index, you need to have elasticsearch running. Download latest version of elasticsearch, go to the 'bin' folder and run it: ./elasticsearch Refer http://joelabrahamsson.com/elasticsearch-101/ to get started. Takes in Config object containing corpus and other information necessary to build the index.
class BulkProcessorUtility extends Logging

Factory for elasticsearch BulkProcessor.
case class NonTerminalSegment(segmentType: String, segments: Seq[Segment]) extends Segment with Product with Serializable
case class ParsedConfig(path: Path, isDirectory: Boolean, encoding: String, documentFormat: String) extends Product with Serializable
sealed abstract class Segment extends AnyRef
class SegmentedDocument extends Document

A document that has been broken up into (potentially nested) segments.
A document that has been broken up into (potentially nested) segments. Note that there's a notion of a segment and segmenter in the nlpstack, but those are used exclusively for sentences. This class aims to capture higher-level document structure than sentences.
class SegmentedDocumentBuilder extends AnyRef
case class TerminalSegment(segmentType: String, text: String) extends Segment with Product with Serializable

object BuildCorpusIndex
object BuildCorpusIndexRunner extends App

Indexing main object.
Indexing main object. Configuration specified in indexing.conf in org.allenai.common.indexing. See common/Readme for details.
object ElasticSearchTransportClientUtil extends Logging

Utility object that takes config parameters from application config file and constructs a transport client to talk to ElasticSearch.
object ParsingUtils
object WaterlooSegmentScript extends App with Logging

Script used to segment waterloo corpus on a sentence level.
Script used to segment waterloo corpus on a sentence level. Splits docs based on <DOC> ... </DOC> tags, determines whether the doc is in "English" by counting the fraction of stop words, and throws out the doc if it is not. Sentence segments the doc using nlp stack, wraps each sentence in <SENT> ... </SENT> tags, and then rewrites the entire doc to file.