org.allenai.common.indexing

BuildCorpusIndex

Related Docs: object BuildCorpusIndex | package indexing

class BuildCorpusIndex extends Logging

CLI to build an Elastic Search index on Aristo corpora. In order to build the index, you need to have elasticsearch running. Download latest version of elasticsearch, go to the 'bin' folder and run it: ./elasticsearch Refer http://joelabrahamsson.com/elasticsearch-101/ to get started. Takes in Config object containing corpus and other information necessary to build the index.

Linear Supertypes
Logging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. BuildCorpusIndex
  2. Logging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new BuildCorpusIndex(config: com.typesafe.config.Config)

Value Members

  1. final def !=(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  4. def addFileToIndex(file: File, bulkProcessor: BulkProcessor, codec: Codec): Unit

    Index a single file into elasticsearch.

    Index a single file into elasticsearch.

    file

    to be indexed

    bulkProcessor

    to communicate with the elasticsearch instance

  5. def addSentenceToIndex(sentence: String, source: String, sentenceIndex: Int, bulkProcessor: BulkProcessor): Unit

    Index a single sentence into elasticsearch.

    Index a single sentence into elasticsearch.

    sentence

    to be indexed

    source

    name of source for reference

    sentenceIndex

    index of sentence in file (for deduplication)

    bulkProcessor

    to communicate with the elasticsearch instance

  6. def addTreeToIndex(fileTree: Iterator[Path], codec: Codec): Seq[Future[Unit]]

    Index a file tree into the elasticSearch instance.

    Index a file tree into the elasticSearch instance. Divides work into nThreads*4 Futures. Each future syncs on currentFile which is a logging variable, and then grabs the next file from the stream if it is not empty.

    fileTree

    file stream to be indexed

    returns

    a sequence of Futures each representing the work done by a thread on this file tree.

  7. def addWaterlooDirectoryToIndex(indirPath: String, codec: Codec): Seq[Future[Unit]]

    Index a folder into the elasticsearch instance, following the convention of the waterloo corpus.

    Index a folder into the elasticsearch instance, following the convention of the waterloo corpus. Sentences are encapsulated by <SENT> ... </SENT> tags.

    indirPath

    path to the input directory

  8. def addWaterlooFileToIndex(inputFile: File, bulkProcessor: BulkProcessor, codec: Codec): Unit

    Index a file into the elasticsearch instance, following the convention of the waterloo corpus.

    Index a file into the elasticsearch instance, following the convention of the waterloo corpus. Sentences are encapsulated by <SENT> ... </SENT> tags.

    inputFile

    path to the input directory

    bulkProcessor

    to communicate with the elasticsearch instace

  9. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  10. def buildElasticSearchIndex(): Unit

    Build an index in ElasticSearch using the corpora specified in config.

  11. val buildFromScratch: Boolean

  12. val bulkProcessorUtility: BulkProcessorUtility

  13. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  14. val dumpFolderPath: String

    On failure, dump serialized requests to this path.

  15. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  16. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  17. val esConfig: com.typesafe.config.Config

    Get Index Name and Index Type.

  18. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  19. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  20. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  21. val indexName: String

  22. val indexType: String

  23. val internalLogger: Logger

    Definition Classes
    Logging
  24. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  25. object logger

    Definition Classes
    Logging
  26. val nThreads: Int

  27. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  28. final def notify(): Unit

    Definition Classes
    AnyRef
  29. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  30. val splitRegex: UnanchoredRegex

    Regex used to split sentences in waterloo corpus.

  31. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  32. def toString(): String

    Definition Classes
    AnyRef → Any
  33. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  34. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  35. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped