Package org.terrier.indexing

Provides classes and interfaces related to the indexing of documents. There are three main abstract concepts that are related to the code of this package.

The first is the concept of a Collection of documents. This can be a standard TREC test collection, or a connection to a database from where the documents are extracted.

The second abstraction is the concept of a Document. An implementation of a collection should iterate through the documents in the collection and return one at a time. The document encapsulates the parser required to extract the information to index. Implementations of documents are provided for TREC documents, PDF documents and standard Microsoft Office formats, such as MS Word, MS Powerpoint and MS Excel.

The third abstraction is related to the Indexer, the process that iterates through the documents of a collection and creates the necessary data structures. There are several implemented indexers:

  • Interface Summary 
    Interface Description
    Collection
    This interface encapsulates the most fundamental concept to indexing with Terrier - a Collection.
    Document
    This interface encapsulates the concept of a document during indexing.
    DocumentExtractor Deprecated.
    Tokenizer
    The specification of the interface implemented by tokeniser classes.
  • Class Summary 
    Class Description
    CollectionFactory
    Implements a factory for Collection objects.
    FileDocument
    Models a document which corresponds to one file.
    FlatJSONDocument
    This is a Terrier Document implementation of a document stored in JSON format.
    MSExcelDocument Deprecated.
    MSPowerPointDocument Deprecated.
    MSWordDocument Deprecated.
    PDFDocument
    Implements a Document object for reading PDF documents, using Apache PDFBox.
    POIDocument
    Represents Microsoft Office documents, which are parsed by the Apache POI library
    SimpleFileCollection
    Implements a collection that can read arbitrary files on disk.
    SimpleMedlineXMLCollection
    Initial implementation of a class that generates a Collection with Documents from a series of XML files in the Medline format.
    SimpleXMLCollection
    Initial implementation of a class that generates a Collection with Documents from a series of XML files.
    TaggedDocument
    Models a tagged document (e.g., an HTML or TREC document).
    TwitterJSONCollection
    This class represents a collection of tweets stored in JSON format.
    TwitterJSONDocument
    This is a Terrier Document implementation of a Tweet stored in JSON format.