org.allenai.common

indexing

package indexing

Visibility
  1. Public
  2. All

Type Members

  1. class BarronsDocumentReader extends AnyRef

  2. class BuildCorpusIndex extends Logging

    CLI to build an Elastic Search index on Aristo corpora.

    CLI to build an Elastic Search index on Aristo corpora. In order to build the index, you need to have elasticsearch running. Download latest version of elasticsearch, go to the 'bin' folder and run it: ./elasticsearch Refer http://joelabrahamsson.com/elasticsearch-101/ to get started. Takes in Config object containing corpus and other information necessary to build the index.

  3. class BulkProcessorUtility extends Logging

    Factory for elasticsearch BulkProcessor.

  4. case class NonTerminalSegment(segmentType: String, segments: Seq[Segment]) extends Segment with Product with Serializable

  5. case class ParsedConfig(path: Path, isDirectory: Boolean, encoding: String, documentFormat: String) extends Product with Serializable

  6. sealed abstract class Segment extends AnyRef

  7. class SegmentedDocument extends Document

    A document that has been broken up into (potentially nested) segments.

    A document that has been broken up into (potentially nested) segments. Note that there's a notion of a segment and segmenter in the nlpstack, but those are used exclusively for sentences. This class aims to capture higher-level document structure than sentences.

  8. class SegmentedDocumentBuilder extends AnyRef

  9. case class TerminalSegment(segmentType: String, text: String) extends Segment with Product with Serializable

Value Members

  1. object BuildCorpusIndex

  2. object BuildCorpusIndexRunner extends App

    Indexing main object.

    Indexing main object. Configuration specified in indexing.conf in org.allenai.common.indexing. See common/Readme for details.

  3. object ElasticSearchTransportClientUtil extends Logging

    Utility object that takes config parameters from application config file and constructs a transport client to talk to ElasticSearch.

  4. object ParsingUtils

  5. object WaterlooSegmentScript extends App with Logging

    Script used to segment waterloo corpus on a sentence level.

    Script used to segment waterloo corpus on a sentence level. Splits docs based on <DOC> ... </DOC> tags, determines whether the doc is in "English" by counting the fraction of stop words, and throws out the doc if it is not. Sentence segments the doc using nlp stack, wraps each sentence in <SENT> ... </SENT> tags, and then rewrites the entire doc to file.

Ungrouped