public class MimirIndex extends Object
A Mímir index which can index documents and answer queries. This class is the main entry point to the Mímir API.
A Mímir index is a compound index comprising the following data elements:AtomicIndex
.
Each sub-index indexes either a certain feature of token annotations
(AtomicTokenIndex
) or one or more annotation types
(AtomicAnnotationIndex
).
A Mímir index is continually accepting documents to be indexed (through calls
to indexDocument(Document)
) and can answer queries though the
QueryEngine
instance returned by getQueryEngine()
.
Documents submitted for indexing are initially accumulated in RAM, during
which time they are not available for being searched. After documents in RAM
are written to disk (a sync-to-disk operation), they become
searchable. In-RAM documents are synced to disk after a certain amount of
data has been accumulated (see setOccurrencesPerBatch(long)
) and
also at regular time intervals (see setTimeBetweenBatches(int)
).
Client code can request a sync to disk operation by calling
requestSyncToDisk()
.
Every sync-to-disk operation causes a new index batch to be created.
All the batches are merged into a IndexCluster
which is then used to
serve queries. If the number of clusters gets too large, it can harm
efficiency or the system can run into problems due to too large a number of
files being open. To avoid this, the index batches can be compacted
into a single batch. The index will automatically do that once the number of
batches exceeds IndexConfig.setMaximumBatches(int)
.
Client code can request a compact operation by calling
requestCompactIndex()
.
In order to keep its consistency, a Mímir index must be
closed orderly by calling close()
before the JVM is shut down.
Modifier and Type | Class and Description |
---|---|
protected class |
MimirIndex.IndexMaintenanceRunner
A
Runnable used in a background thread to perform various index
maintenance tasks:
check that the documents are being returned from the sub-indexers
in the same order as they were submitted for indexing;
update the occurrencesInRam value by adding the
occurrences produced by indexing new documents.
delete indexed documents from GATE
|
protected class |
MimirIndex.IndexMaintenanceRunner2
A
Runnable used in a background thread to perform various index
maintenance tasks:
update the occurrencesInRam value by subtracting
the occurrence counts for all the batches that have recently been written
to disk. |
protected class |
MimirIndex.SyncToDiskTask
TimerTask used to regularly dump the latest document to an on-disk
batch, allowing them to become searchable. |
Modifier and Type | Field and Description |
---|---|
protected boolean |
closed |
static int |
DEFAULT_INDEXING_QUEUE_SIZE
The default length for the buffer input / output queues for sub-indexers.
|
static int |
DEFAULT_OCCURRENCES_PER_BATCH
How many occurrences to index in each batch.
|
static String |
DELETED_DOCUMENT_IDS_FILE_NAME
The name for the file (stored in the root index directory) containing
the serialised version of the
deletedDocumentIds . |
protected DocumentCollection |
documentCollection
The zipped document collection from MG4J (built during the indexing of the
first token feature).
|
static String |
INDEX_CONFIG_FILENAME
The name of the file in the index directory where the index config is
saved.
|
protected IndexConfig |
indexConfig
The
IndexConfig used for this index. |
protected File |
indexDirectory
The top level directory containing this index.
|
protected int |
indexingQueueSize |
protected Thread |
maintenanceThread
The thread used to clean-up GATE documents after they have been indexed.
|
protected Thread |
maintenanceThread2
Background thread used to subtract occurrence counts for batches that have
recently been dumped to disk.
|
protected AtomicAnnotationIndex[] |
mentionIndexes
The annotation indexes, in the order they are listed in the
indexConfig . |
protected static Future<Long> |
NO_MORE_TASKS
Special value used to indicate that the index is closing and there will be
no more sync tasks to process (an END_OF_QUEUE value for
syncRequests ). |
protected long |
occurrencesInRam
The total number of occurrences in all sub-indexes that have not yet been
written to disk.
|
protected long |
occurrencesPerBatch
How many occurrences to be accumulated in RAM before a new tail batch is
written to disk.
|
protected QueryEngine |
queryEngine
The
QueryEngine used to run searches on this index. |
protected AtomicIndex[] |
subIndexes
The
tokenIndexes and mentionIndexes in one single array. |
protected BlockingQueue<Future<Long>> |
syncRequests
A list of futures representing sync-to-disk operations currently
in-progress in all of the sub-indexes.
|
protected AtomicTokenIndex[] |
tokenIndexes
The token indexes, in the order they are listed in the
indexConfig . |
Constructor and Description |
---|
MimirIndex(File indexDirectory)
Open and existing Mímir index.
|
MimirIndex(IndexConfig indexConfig)
Creates a new Mímir index.
|
Modifier and Type | Method and Description |
---|---|
void |
close()
Stops this index from accepting any further document for indexing, stops
this index from accepting any more queries, finishes indexing all the
currently queued documents, writes all the files to disk, after which it
returns control to the calling thread.
|
void |
compactDocumentCollection()
Requests that the
DocumentCollection contained by this index is
compacted. |
void |
deleteDocument(long documentId)
Marks a given document (identified by its ID) as deleted.
|
void |
deleteDocuments(Collection<? extends Number> documentIds)
Marks the given batch of documents (identified by ID) as deleted.
|
AtomicAnnotationIndex |
getAnnotationIndex(String annotationType)
Returns the
AtomicAnnotationIndex instance responsible for indexing
annotations of the type specified. |
DocumentCollection |
getDocumentCollection()
Gets the
DocumentCollection instance used by this index. |
DocumentData |
getDocumentData(long documentID)
Gets the
DocumentData for a given document ID, from the on disk
document collection. |
int |
getDocumentSize(long documentId)
Gets the size (number of tokens) for a document.
|
IndexConfig |
getIndexConfig()
Gets the
IndexConfig value for this index. |
File |
getIndexDirectory()
Gets the top level directory for this index.
|
long |
getIndexedDocumentsCount()
Gets the total number of documents currently searcheable
|
int |
getIndexingQueueSize()
Returns the size of the indexing queue.
|
long |
getOccurrencesInRam()
Gets the current estimated number of occurrences in RAM.
|
long |
getOccurrencesPerBatch()
Gets the number of occurrences that should be used as a trigger for a sync
to disk operation, leading to the creation of a new index batch.
|
QueryEngine |
getQueryEngine()
Returns the
QueryEngine instance that can be used to post queries
to this index. |
int |
getTimeBetweenBatches()
Gets the time interval (in milliseconds) between sync-to-disk operations.
|
AtomicTokenIndex |
getTokenIndex(String featureName)
Returns the
AtomicTokenIndex responsible for indexing a particular
feature on token annotations. |
void |
indexDocument(gate.Document document)
Queues a new document for indexing.
|
boolean |
isDeleted(long documentId)
Checks whether a given document (specified by its ID) is marked as deleted.
|
protected void |
openIndex()
Opens the index files, if any, prepares all the sub-indexers specified in
the index config, and gets this index ready to start indexing documents and
answer queries.
|
protected void |
readDeletedDocs()
Reads the list of deleted documents from disk.
|
List<Future<Void>> |
requestCompactIndex()
Asks each of the sub-indexes in this index to compact all their batches
into a single index.
|
List<Future<Long>> |
requestSyncToDisk()
Asks this index to write to disk all the index data currently stored in
RAM so that it can become searchable.
|
void |
setIndexingQueueSize(int indexingQueueSize)
Sets the size of the indexing queue(s) used by this index.
|
void |
setOccurrencesPerBatch(long occurrencesPerBatch)
Sets the number of occurrences that should trigger a sync-to-disk operation
leading to a new batch being created from the data previously stored in
RAM.
|
void |
setTimeBetweenBatches(int timeBetweenBatches)
Sets the time interval (in milliseconds) between sync-to-disk operations.
|
void |
undeleteDocument(long documentId)
Mark the given document (identified by ID) as not deleted.
|
void |
undeleteDocuments(Collection<? extends Number> documentIds)
Mark the given documents (identified by ID) as not deleted.
|
protected void |
writeDeletedDocsLater()
Writes the set of deleted document to disk in a background thread, after a
short delay.
|
void |
writeZipDocumentData(DocumentData docData)
Called by the first token indexer when a new document has been indexed
to ask the main index to save the necessary zip collection data
|
public static final String INDEX_CONFIG_FILENAME
public static final String DELETED_DOCUMENT_IDS_FILE_NAME
deletedDocumentIds
.public static final int DEFAULT_OCCURRENCES_PER_BATCH
public static final int DEFAULT_INDEXING_QUEUE_SIZE
protected static final Future<Long> NO_MORE_TASKS
syncRequests
).protected long occurrencesPerBatch
protected IndexConfig indexConfig
IndexConfig
used for this index.protected File indexDirectory
protected DocumentCollection documentCollection
protected Thread maintenanceThread
protected Thread maintenanceThread2
protected volatile boolean closed
protected BlockingQueue<Future<Long>> syncRequests
protected AtomicTokenIndex[] tokenIndexes
indexConfig
.protected AtomicAnnotationIndex[] mentionIndexes
indexConfig
.protected AtomicIndex[] subIndexes
tokenIndexes
and mentionIndexes
in one single array.protected int indexingQueueSize
protected volatile long occurrencesInRam
protected QueryEngine queryEngine
QueryEngine
used to run searches on this index.public MimirIndex(IndexConfig indexConfig) throws IOException, IndexException
indexConfig
- the configuration for the index.IOException
IndexException
public MimirIndex(File indexDirectory) throws IOException, IndexException
indexDirectory
- the on-disk directory containing the index to be
opened.IndexException
- if the index cannot be openedIllegalArgumentException
- if an index cannot be found at the
specified location.IOException
- if the index cannot be opened.protected void openIndex() throws IOException, IndexException
IOException
IndexException
public void indexDocument(gate.Document document) throws InterruptedException
document
- the document to be indexed.InterruptedException
- if the process of posting the new document
to all the input queues is interrupted.IllegalStateException
- if the index has already been closed.public List<Future<Long>> requestSyncToDisk() throws InterruptedException
InterruptedException
- if the current thread has been interrupted
while trying to queue the sync request.public List<Future<Void>> requestCompactIndex() throws InterruptedException
InterruptedException
- if the current thread has been interrupted while trying to queue
the compaction request.public void compactDocumentCollection() throws ZipException, IOException, IndexException
DocumentCollection
contained by this index is
compacted. This method blocks until the compaction has completed.
In normal operation, the index maintains the collection, which includes
regular compactions, so there should be no reason to call this method.ZipException
IOException
IndexException
public void writeZipDocumentData(DocumentData docData) throws IndexException
gDocument
- IndexException
public void close() throws InterruptedException, IOException
InterruptedException
IOException
public IndexConfig getIndexConfig()
IndexConfig
value for this index.public QueryEngine getQueryEngine()
QueryEngine
instance that can be used to post queries
to this index. Each index holds one single query engine, so the same value
will always be returned by repeated calls.public File getIndexDirectory()
public long getOccurrencesInRam()
public int getIndexingQueueSize()
setIndexingQueueSize(int)
for more comments.public void setIndexingQueueSize(int indexingQueueSize)
indexingQueueSize
- public long getOccurrencesPerBatch()
public void setOccurrencesPerBatch(long occurrencesPerBatch)
occurrencesPerBatch
- public int getTimeBetweenBatches()
public void setTimeBetweenBatches(int timeBetweenBatches)
public DocumentCollection getDocumentCollection()
DocumentCollection
instance used by this index. The
document collection is normally fully managed by the index, so there should
be no need to access it directly through this method.public long getIndexedDocumentsCount()
public DocumentData getDocumentData(long documentID) throws IndexException, IOException
DocumentData
for a given document ID, from the on disk
document collection. In memory caching is performed to reduce the cost of
this call.documentID
- the ID of the document to be obtained.DocumentData
associated with the given document ID.IOException
IndexException
public int getDocumentSize(long documentId)
documentId
- the document being requested.public void deleteDocument(long documentId)
documentId
- public void deleteDocuments(Collection<? extends Number> documentIds)
documentIds
- public boolean isDeleted(long documentId)
documentId
- public void undeleteDocument(long documentId)
public void undeleteDocuments(Collection<? extends Number> documentIds)
protected void writeDeletedDocsLater()
protected void readDeletedDocs() throws IOException
IOException
public AtomicTokenIndex getTokenIndex(String featureName)
AtomicTokenIndex
responsible for indexing a particular
feature on token annotations.featureName
- public AtomicAnnotationIndex getAnnotationIndex(String annotationType)
AtomicAnnotationIndex
instance responsible for indexing
annotations of the type specified.annotationType
- Copyright © 2021 GATE. All rights reserved.