org.allenai.common.indexing

BuildCorpusIndex

Related Docs: object BuildCorpusIndex | package indexing

class BuildCorpusIndex extends Logging

CLI to build an Elastic Search index on Aristo corpora. In order to build the index, you need to have elasticsearch running. Download latest version of elasticsearch, go to the 'bin' folder and run it: ./elasticsearch Refer http://joelabrahamsson.com/elasticsearch-101/ to get started. Takes in Config object containing corpus and other information necessary to build the index.

Linear Supertypes
Logging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. BuildCorpusIndex
  2. Logging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new BuildCorpusIndex(config: com.typesafe.config.Config)

Value Members

  1. final def !=(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  4. def addFileToIndex(file: File, bulkProcessor: BulkProcessor, codec: Codec, documentFormat: String): Unit

    Index a single file into elasticsearch.

    Index a single file into elasticsearch.

    file

    to be indexed

    bulkProcessor

    to communicate with the elasticsearch instance

  5. def addSegmentToIndex(segment: String, documentFormat: String, source: String, segmentIndex: Int, bulkProcessor: BulkProcessor): Unit

    Index a single segment into elasticsearch.

    Index a single segment into elasticsearch.

    segment

    to be indexed

    documentFormat

    also describes the format of the segment

    source

    name of source for reference

    segmentIndex

    index of segment in file (for deduplication)

    bulkProcessor

    to communicate with the elasticsearch instance

  6. def addTreeToIndex(fileTree: Iterator[Path], codec: Codec, documentFormat: String): Seq[Future[Unit]]

    Index a file tree into the elasticSearch instance.

    Index a file tree into the elasticSearch instance. Divides work into nThreads*4 Futures. Each future syncs on currentFile which is a logging variable, and then grabs the next file from the stream if it is not empty.

    fileTree

    file stream to be indexed

    returns

    a sequence of Futures each representing the work done by a thread on this file tree.

  7. def addWaterlooFileToIndex(inputFile: File, documentFormat: String, bulkProcessor: BulkProcessor, codec: Codec): Unit

    Index a file into the elasticsearch instance, following the convention of the waterloo corpus.

    Index a file into the elasticsearch instance, following the convention of the waterloo corpus. Sentences are encapsulated by <SENT> ... </SENT> tags.

    inputFile

    path to the input directory

    bulkProcessor

    to communicate with the elasticsearch instace

  8. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  9. def buildElasticSearchIndex(): Unit

    Build an index in ElasticSearch using the corpora specified in config.

  10. val buildFromScratch: Boolean

  11. val bulkProcessorUtility: BulkProcessorUtility

  12. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  13. val dumpFolderPath: String

    On failure, dump serialized requests to this path.

  14. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  15. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  16. val esConfig: com.typesafe.config.Config

    Get Index Name and Index Type.

  17. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  18. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  19. def getDatastorePathFromConfig(corpusConfig: com.typesafe.config.Config): (Path, Boolean)

  20. def getDirectoryFromDatastore(privacy: String, group: String, directory: String, version: Int): Path

  21. def getFileFromDatastore(privacy: String, group: String, directory: Option[String], file: String, version: Int): Path

  22. def getLocalPathFromConfig(corpusConfig: com.typesafe.config.Config): (Path, Boolean)

  23. def getSegmentsFromDocument(document: SegmentedDocument): Iterator[String]

  24. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  25. val indexName: String

  26. val indexType: String

  27. val internalLogger: Logger

    Definition Classes
    Logging
  28. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  29. object logger

    Definition Classes
    Logging
  30. val nThreads: Int

  31. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  32. final def notify(): Unit

    Definition Classes
    AnyRef
  33. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  34. def parseCorpusConfig(corpusConfig: com.typesafe.config.Config): ParsedConfig

    Take the config for a corpus, resolve paths, and return a simple object containing information about the corpus.

  35. def segmentFile(file: File, codec: Codec, documentFormat: String): Iterator[String]

  36. def segmentPlainTextFile(file: File, codec: Codec): Iterator[String]

  37. def segmentWikipediaFile(file: File, codec: Codec): Iterator[String]

  38. val sentenceSplitRegex: UnanchoredRegex

    Regex used to split sentences in waterloo corpus.

  39. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  40. def toString(): String

    Definition Classes
    AnyRef → Any
  41. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  42. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  43. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped