MimirIndex (mimir-core 6.2 API)

java.lang.Object
- gate.mimir.MimirIndex

```
public class MimirIndex
extends Object
```
A Mímir index which can index documents and answer queries. This class is the main entry point to the Mímir API.
A Mímir index is a compound index comprising the following data elements:
- one or more sub-indexes (implemented by classes that extend AtomicIndex.
- a document collection containing the document textual content and metadata
Each sub-index indexes either a certain feature of token annotations (AtomicTokenIndex) or one or more annotation types (AtomicAnnotationIndex).

A Mímir index is continually accepting documents to be indexed (through calls to indexDocument(Document)) and can answer queries though the QueryEngine instance returned by getQueryEngine().

Documents submitted for indexing are initially accumulated in RAM, during which time they are not available for being searched. After documents in RAM are written to disk (a sync-to-disk operation), they become searchable. In-RAM documents are synced to disk after a certain amount of data has been accumulated (see setOccurrencesPerBatch(long)) and also at regular time intervals (see setTimeBetweenBatches(int)).

Client code can request a sync to disk operation by calling requestSyncToDisk().

Every sync-to-disk operation causes a new index batch to be created. All the batches are merged into a IndexCluster which is then used to serve queries. If the number of clusters gets too large, it can harm efficiency or the system can run into problems due to too large a number of files being open. To avoid this, the index batches can be compacted into a single batch. The index will automatically do that once the number of batches exceeds IndexConfig.setMaximumBatches(int).

Client code can request a compact operation by calling requestCompactIndex().

In order to keep its consistency, a Mímir index must be closed orderly by calling close() before the JVM is shut down.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`protected class`	`MimirIndex.IndexMaintenanceRunner` A `Runnable` used in a background thread to perform various index maintenance tasks: check that the documents are being returned from the sub-indexers in the same order as they were submitted for indexing; update the `occurrencesInRam` value by adding the occurrences produced by indexing new documents. delete indexed documents from GATE
`protected class`	`MimirIndex.IndexMaintenanceRunner2` A `Runnable` used in a background thread to perform various index maintenance tasks: update the `occurrencesInRam` value by subtracting the occurrence counts for all the batches that have recently been written to disk.
`protected class`	`MimirIndex.SyncToDiskTask` `TimerTask` used to regularly dump the latest document to an on-disk batch, allowing them to become searchable.

Field Summary

Fields
Modifier and Type	Field and Description
`protected boolean`	`closed`
`static int`	`DEFAULT_INDEXING_QUEUE_SIZE` The default length for the buffer input / output queues for sub-indexers.
`static int`	`DEFAULT_OCCURRENCES_PER_BATCH` How many occurrences to index in each batch.
`static String`	`DELETED_DOCUMENT_IDS_FILE_NAME` The name for the file (stored in the root index directory) containing the serialised version of the `deletedDocumentIds`.
`protected DocumentCollection`	`documentCollection` The zipped document collection from MG4J (built during the indexing of the first token feature).
`static String`	`INDEX_CONFIG_FILENAME` The name of the file in the index directory where the index config is saved.
`protected IndexConfig`	`indexConfig` The `IndexConfig` used for this index.
`protected File`	`indexDirectory` The top level directory containing this index.
`protected int`	`indexingQueueSize`
`protected Thread`	`maintenanceThread` The thread used to clean-up GATE documents after they have been indexed.
`protected Thread`	`maintenanceThread2` Background thread used to subtract occurrence counts for batches that have recently been dumped to disk.
`protected AtomicAnnotationIndex[]`	`mentionIndexes` The annotation indexes, in the order they are listed in the `indexConfig`.
`protected static Future<Long>`	`NO_MORE_TASKS` Special value used to indicate that the index is closing and there will be no more sync tasks to process (an END_OF_QUEUE value for `syncRequests`).
`protected long`	`occurrencesInRam` The total number of occurrences in all sub-indexes that have not yet been written to disk.
`protected long`	`occurrencesPerBatch` How many occurrences to be accumulated in RAM before a new tail batch is written to disk.
`protected QueryEngine`	`queryEngine` The `QueryEngine` used to run searches on this index.
`protected AtomicIndex[]`	`subIndexes` The `tokenIndexes` and `mentionIndexes` in one single array.
`protected BlockingQueue<Future<Long>>`	`syncRequests` A list of futures representing sync-to-disk operations currently in-progress in all of the sub-indexes.
`protected AtomicTokenIndex[]`	`tokenIndexes` The token indexes, in the order they are listed in the `indexConfig`.

Constructor Summary

Constructors
Constructor and Description

MimirIndex(File indexDirectory)
Open and existing Mímir index.

MimirIndex(IndexConfig indexConfig)
Creates a new Mímir index.

Constructors
Constructor and Description
`MimirIndex(File indexDirectory)` Open and existing Mímir index.
`MimirIndex(IndexConfig indexConfig)` Creates a new Mímir index.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`close()` Stops this index from accepting any further document for indexing, stops this index from accepting any more queries, finishes indexing all the currently queued documents, writes all the files to disk, after which it returns control to the calling thread.
`void`	`compactDocumentCollection()` Requests that the `DocumentCollection` contained by this index is compacted.
`void`	`deleteDocument(long documentId)` Marks a given document (identified by its ID) as deleted.
`void`	`deleteDocuments(Collection<? extends Number> documentIds)` Marks the given batch of documents (identified by ID) as deleted.
`AtomicAnnotationIndex`	`getAnnotationIndex(String annotationType)` Returns the `AtomicAnnotationIndex` instance responsible for indexing annotations of the type specified.
`DocumentCollection`	`getDocumentCollection()` Gets the `DocumentCollection` instance used by this index.
`DocumentData`	`getDocumentData(long documentID)` Gets the `DocumentData` for a given document ID, from the on disk document collection.
`int`	`getDocumentSize(long documentId)` Gets the size (number of tokens) for a document.
`IndexConfig`	`getIndexConfig()` Gets the `IndexConfig` value for this index.
`File`	`getIndexDirectory()` Gets the top level directory for this index.
`long`	`getIndexedDocumentsCount()` Gets the total number of documents currently searcheable
`int`	`getIndexingQueueSize()` Returns the size of the indexing queue.
`long`	`getOccurrencesInRam()` Gets the current estimated number of occurrences in RAM.
`long`	`getOccurrencesPerBatch()` Gets the number of occurrences that should be used as a trigger for a sync to disk operation, leading to the creation of a new index batch.
`QueryEngine`	`getQueryEngine()` Returns the `QueryEngine` instance that can be used to post queries to this index.
`int`	`getTimeBetweenBatches()` Gets the time interval (in milliseconds) between sync-to-disk operations.
`AtomicTokenIndex`	`getTokenIndex(String featureName)` Returns the `AtomicTokenIndex` responsible for indexing a particular feature on token annotations.
`void`	`indexDocument(gate.Document document)` Queues a new document for indexing.
`boolean`	`isDeleted(long documentId)` Checks whether a given document (specified by its ID) is marked as deleted.
`protected void`	`openIndex()` Opens the index files, if any, prepares all the sub-indexers specified in the index config, and gets this index ready to start indexing documents and answer queries.
`protected void`	`readDeletedDocs()` Reads the list of deleted documents from disk.
`List<Future<Void>>`	`requestCompactIndex()` Asks each of the sub-indexes in this index to compact all their batches into a single index.
`List<Future<Long>>`	`requestSyncToDisk()` Asks this index to write to disk all the index data currently stored in RAM so that it can become searchable.
`void`	`setIndexingQueueSize(int indexingQueueSize)` Sets the size of the indexing queue(s) used by this index.
`void`	`setOccurrencesPerBatch(long occurrencesPerBatch)` Sets the number of occurrences that should trigger a sync-to-disk operation leading to a new batch being created from the data previously stored in RAM.
`void`	`setTimeBetweenBatches(int timeBetweenBatches)` Sets the time interval (in milliseconds) between sync-to-disk operations.
`void`	`undeleteDocument(long documentId)` Mark the given document (identified by ID) as not deleted.
`void`	`undeleteDocuments(Collection<? extends Number> documentIds)` Mark the given documents (identified by ID) as not deleted.
`protected void`	`writeDeletedDocsLater()` Writes the set of deleted document to disk in a background thread, after a short delay.
`void`	`writeZipDocumentData(DocumentData docData)` Called by the first token indexer when a new document has been indexed to ask the main index to save the necessary zip collection data

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - INDEX_CONFIG_FILENAME
```
public static final String INDEX_CONFIG_FILENAME
```
    The name of the file in the index directory where the index config is saved.
    
    See Also:
    
    Constant Field Values
  - DELETED_DOCUMENT_IDS_FILE_NAME
```
public static final String DELETED_DOCUMENT_IDS_FILE_NAME
```
    The name for the file (stored in the root index directory) containing the serialised version of the deletedDocumentIds.
    
    See Also:
    
    Constant Field Values
  - DEFAULT_OCCURRENCES_PER_BATCH
```
public static final int DEFAULT_OCCURRENCES_PER_BATCH
```
    How many occurrences to index in each batch. This metric is more reliable, than document counts, as it does not depend on average document size.
    
    See Also:
    
    Constant Field Values
  - DEFAULT_INDEXING_QUEUE_SIZE
```
public static final int DEFAULT_INDEXING_QUEUE_SIZE
```
    The default length for the buffer input / output queues for sub-indexers.
    
    See Also:
    
    Constant Field Values
  - NO_MORE_TASKS
```
protected static final Future<Long> NO_MORE_TASKS
```
    Special value used to indicate that the index is closing and there will be no more sync tasks to process (an END_OF_QUEUE value for syncRequests).
  - occurrencesPerBatch
```
protected long occurrencesPerBatch
```
    How many occurrences to be accumulated in RAM before a new tail batch is written to disk.
  - indexConfig
```
protected IndexConfig indexConfig
```
    The IndexConfig used for this index.
  - indexDirectory
```
protected File indexDirectory
```
    The top level directory containing this index.
  - documentCollection
```
protected DocumentCollection documentCollection
```
    The zipped document collection from MG4J (built during the indexing of the first token feature). This can be used to obtain the document text and to display the content of the hits.
  - maintenanceThread
```
protected Thread maintenanceThread
```
    The thread used to clean-up GATE documents after they have been indexed.
  - maintenanceThread2
```
protected Thread maintenanceThread2
```
    Background thread used to subtract occurrence counts for batches that have recently been dumped to disk.
  - closed
```
protected volatile boolean closed
```
  - syncRequests
```
protected BlockingQueue<Future<Long>> syncRequests
```
    A list of futures representing sync-to-disk operations currently in-progress in all of the sub-indexes.
  - tokenIndexes
```
protected AtomicTokenIndex[] tokenIndexes
```
    The token indexes, in the order they are listed in the indexConfig.
  - mentionIndexes
```
protected AtomicAnnotationIndex[] mentionIndexes
```
    The annotation indexes, in the order they are listed in the indexConfig.
  - subIndexes
```
protected AtomicIndex[] subIndexes
```
    The tokenIndexes and mentionIndexes in one single array.
  - indexingQueueSize
```
protected int indexingQueueSize
```
  - occurrencesInRam
```
protected volatile long occurrencesInRam
```
    The total number of occurrences in all sub-indexes that have not yet been written to disk.
  - queryEngine
```
protected QueryEngine queryEngine
```
    The QueryEngine used to run searches on this index.
- Constructor Detail
  - MimirIndex
```
public MimirIndex(IndexConfig indexConfig)
           throws IOException,
                  IndexException
```
    Creates a new Mímir index.
    
    Parameters:
    
    indexConfig - the configuration for the index.
    
    Throws:
    
    IOException
    
    IndexException
  - MimirIndex
```
public MimirIndex(File indexDirectory)
           throws IOException,
                  IndexException
```
    Open and existing Mímir index.
    
    Parameters:
    
    indexDirectory - the on-disk directory containing the index to be opened.
    
    Throws:
    
    IndexException - if the index cannot be opened
    
    IllegalArgumentException - if an index cannot be found at the specified location.
    
    IOException - if the index cannot be opened.
- Method Detail
  - openIndex
```
protected void openIndex()
                  throws IOException,
                         IndexException
```
    Opens the index files, if any, prepares all the sub-indexers specified in the index config, and gets this index ready to start indexing documents and answer queries.
    
    Throws:
    
    IOException
    
    IndexException
  - indexDocument
```
public void indexDocument(gate.Document document)
                   throws InterruptedException
```
    Queues a new document for indexing. The document will first go into the indexing queue, from where the various sub-indexes take their input. Once processed, the document data is stored in RAM until a sync-to-disk operation occurs. Only after that does the document become searchable.
    
    Parameters:
    
    document - the document to be indexed.
    
    Throws:
    
    InterruptedException - if the process of posting the new document to all the input queues is interrupted.
    
    IllegalStateException - if the index has already been closed.
  - requestSyncToDisk
```
public List<Future<Long>> requestSyncToDisk()
                                     throws InterruptedException
```
    Asks this index to write to disk all the index data currently stored in RAM so that it can become searchable. The work happens in several background threads (one for each sub-index) at the earliest opportunity.
    
    Returns:
    
    a list of futures that can be used to find out when the operation has completed.
    
    Throws:
    
    InterruptedException - if the current thread has been interrupted while trying to queue the sync request.
  - requestCompactIndex
```
public List<Future<Void>> requestCompactIndex()
                                       throws InterruptedException
```
    Asks each of the sub-indexes in this index to compact all their batches into a single index. This reduces the number of open file handles required. The work happens in several background threads (one for each sub-index) at the earliest opportunity.
    
    Returns:
    
    a list of futures (one for each sub-index) that can be used to find out when the operation has completed.
    
    Throws:
    
    InterruptedException - if the current thread has been interrupted while trying to queue the compaction request.
  - compactDocumentCollection
```
public void compactDocumentCollection()
                               throws ZipException,
                                      IOException,
                                      IndexException
```
    Requests that the DocumentCollection contained by this index is compacted. This method blocks until the compaction has completed. In normal operation, the index maintains the collection, which includes regular compactions, so there should be no reason to call this method.
    
    Throws:
    
    ZipException
    
    IOException
    
    IndexException
  - writeZipDocumentData
```
public void writeZipDocumentData(DocumentData docData)
                          throws IndexException
```
    Called by the first token indexer when a new document has been indexed to ask the main index to save the necessary zip collection data
    
    Parameters:
    
    gDocument -
    
    Throws:
    
    IndexException
  - close
```
public void close()
           throws InterruptedException,
                  IOException
```
    Stops this index from accepting any further document for indexing, stops this index from accepting any more queries, finishes indexing all the currently queued documents, writes all the files to disk, after which it returns control to the calling thread. This may be a lengthy operation, depending on the amount of data that still needs to be written to disk.
    
    Throws:
    
    InterruptedException
    
    IOException
  - getIndexConfig
```
public IndexConfig getIndexConfig()
```
    Gets the IndexConfig value for this index.
    
    Returns:
  - getQueryEngine
```
public QueryEngine getQueryEngine()
```
    Returns the QueryEngine instance that can be used to post queries to this index. Each index holds one single query engine, so the same value will always be returned by repeated calls.
    
    Returns:
  - getIndexDirectory
```
public File getIndexDirectory()
```
    Gets the top level directory for this index.
    
    Returns:
  - getOccurrencesInRam
```
public long getOccurrencesInRam()
```
    Gets the current estimated number of occurrences in RAM. An occurrence represents one term (either a token or an annotation) occurring in an indexed document. This value can be used as a good measurement of the total amount of data that is currently being stored in RAM and waiting to be synced to disk.
    
    Returns:
  - getIndexingQueueSize
```
public int getIndexingQueueSize()
```
    Returns the size of the indexing queue. See setIndexingQueueSize(int) for more comments.
    
    Returns:
  - setIndexingQueueSize
```
public void setIndexingQueueSize(int indexingQueueSize)
```
    Sets the size of the indexing queue(s) used by this index. Documents submitted for indexing are held in a queue until the indexers become ready to process them. One queue is used for each of the sub-indexes. A larger queue size can smooth out bursts of activity, but requires more memory (as a larger number of documents may need to be stored at the same time). A smaller value is more economical, but it can leads to slow-downs when certain documents take too long to index, and can clog up the queue. Defaults to 30.
    
    Parameters:
    
    indexingQueueSize -
  - getOccurrencesPerBatch
```
public long getOccurrencesPerBatch()
```
    Gets the number of occurrences that should be used as a trigger for a sync to disk operation, leading to the creation of a new index batch.
    
    Returns:
  - setOccurrencesPerBatch
```
public void setOccurrencesPerBatch(long occurrencesPerBatch)
```
    Sets the number of occurrences that should trigger a sync-to-disk operation leading to a new batch being created from the data previously stored in RAM. An occurrence represents one term (either a token or an annotation) occurring in an indexed document. This value can be used as a good measurement of the total amount of data that is currently being stored in RAM and waiting to be synced to disk.
    
    Parameters:
    
    occurrencesPerBatch -
  - getTimeBetweenBatches
```
public int getTimeBetweenBatches()
```
    Gets the time interval (in milliseconds) between sync-to-disk operations. This is approximately the maximum amount of time that a document can spend being stored in RAM (and thus not searchable) after having been submitted for indexing. The measurement is not precise because of the time spent by the document in the indexing queue (after being received but before being processed) and the time take to write a new index batch to disk.
    
    Returns:
  - setTimeBetweenBatches
```
public void setTimeBetweenBatches(int timeBetweenBatches)
```
    Sets the time interval (in milliseconds) between sync-to-disk operations. This is approximately the maximum amount of time that a document can spend being stored in RAM (and thus not searchable) after having been submitted for indexing. The measurement is not precise because of the time spent by the document in the indexing queue (after being received but before being processed) and the time take to write a new index batch to disk.
  - getDocumentCollection
```
public DocumentCollection getDocumentCollection()
```
    Gets the DocumentCollection instance used by this index. The document collection is normally fully managed by the index, so there should be no need to access it directly through this method.
    
    Returns:
  - getIndexedDocumentsCount
```
public long getIndexedDocumentsCount()
```
    Gets the total number of documents currently searcheable
    
    Returns:
  - getDocumentData
```
public DocumentData getDocumentData(long documentID)
                             throws IndexException,
                                    IOException
```
    Gets the DocumentData for a given document ID, from the on disk document collection. In memory caching is performed to reduce the cost of this call.
    
    Parameters:
    
    documentID - the ID of the document to be obtained.
    
    Returns:
    
    the DocumentData associated with the given document ID.
    
    Throws:
    
    IOException
    
    IndexException
  - getDocumentSize
```
public int getDocumentSize(long documentId)
```
    Gets the size (number of tokens) for a document.
    
    Parameters:
    
    documentId - the document being requested.
    
    Returns:
  - deleteDocument
```
public void deleteDocument(long documentId)
```
    Marks a given document (identified by its ID) as deleted. Deleted documents are never returned as search results.
    
    Parameters:
    
    documentId -
  - deleteDocuments
```
public void deleteDocuments(Collection<? extends Number> documentIds)
```
    Marks the given batch of documents (identified by ID) as deleted. Deleted documents are never returned as search results.
    
    Parameters:
    
    documentIds -
  - isDeleted
```
public boolean isDeleted(long documentId)
```
    Checks whether a given document (specified by its ID) is marked as deleted.
    
    Parameters:
    
    documentId -
    
    Returns:
  - undeleteDocument
```
public void undeleteDocument(long documentId)
```
    Mark the given document (identified by ID) as not deleted. Calling this method for a document ID that is not currently marked as deleted has no effect.
  - undeleteDocuments
```
public void undeleteDocuments(Collection<? extends Number> documentIds)
```
    Mark the given documents (identified by ID) as not deleted. Calling this method for a document ID that is not currently marked as deleted has no effect.
  - writeDeletedDocsLater
```
protected void writeDeletedDocsLater()
```
    Writes the set of deleted document to disk in a background thread, after a short delay. If a previous request has not started yet, this new request will replace it.
  - readDeletedDocs
```
protected void readDeletedDocs()
                        throws IOException
```
    Reads the list of deleted documents from disk.
    
    Throws:
    
    IOException
  - getTokenIndex
```
public AtomicTokenIndex getTokenIndex(String featureName)
```
    Returns the AtomicTokenIndex responsible for indexing a particular feature on token annotations.
    
    Parameters:
    
    featureName -
    
    Returns:
  - getAnnotationIndex
```
public AtomicAnnotationIndex getAnnotationIndex(String annotationType)
```
    Returns the AtomicAnnotationIndex instance responsible for indexing annotations of the type specified.
    
    Parameters:
    
    annotationType -
    
    Returns:

Class MimirIndex

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

INDEX_CONFIG_FILENAME

DELETED_DOCUMENT_IDS_FILE_NAME

DEFAULT_OCCURRENCES_PER_BATCH

DEFAULT_INDEXING_QUEUE_SIZE

NO_MORE_TASKS

occurrencesPerBatch

indexConfig

indexDirectory

documentCollection

maintenanceThread

maintenanceThread2

closed

syncRequests

tokenIndexes

mentionIndexes

subIndexes

indexingQueueSize

occurrencesInRam

queryEngine

Constructor Detail

MimirIndex

MimirIndex

Method Detail

openIndex

indexDocument

requestSyncToDisk

requestCompactIndex

compactDocumentCollection

writeZipDocumentData

close

getIndexConfig

getQueryEngine

getIndexDirectory

getOccurrencesInRam

getIndexingQueueSize

setIndexingQueueSize

getOccurrencesPerBatch

setOccurrencesPerBatch

getTimeBetweenBatches

setTimeBetweenBatches

getDocumentCollection

getIndexedDocumentsCount

getDocumentData

getDocumentSize

deleteDocument

deleteDocuments

isDeleted

undeleteDocument

undeleteDocuments

writeDeletedDocsLater

readDeletedDocs

getTokenIndex

getAnnotationIndex