public abstract class AtomicIndex extends Object implements Runnable
An inverted index associating terms with documents. Terms can be either token feature values, or annotations. Optionally, a direct index may also be present.
An atomic index manages a head index (the principal data) and a set of tail indexes (batches containing updates). Additionally, the data representing all the new documents that have been queued for indexing since the last tail was written are stored in RAM.
When direct indexing is enabled, the term IDs in the direct index are different from the term IDs in the inverted index. In the inverted index the term IDs are their position in the lexicographically sorted list of all terms. In the directed index, the term IDs are their position in the list sorted by the time they were first seen during indexing.
The head and tail batches can be combined into a new head by a compact operation.
Modifier and Type | Class and Description |
---|---|
protected static class |
AtomicIndex.MG4JIndex
Class representing an MG4J index batch, such as the head or any of the
tails.
|
protected static class |
AtomicIndex.PostingsList
An in-RAM representation of a postings list
|
Modifier and Type | Field and Description |
---|---|
protected it.unimi.dsi.util.Properties |
additionalDirectProperties
A set of properties added to the ones obtained from the direct index writer
when writing out batches.
|
protected it.unimi.dsi.util.Properties |
additionalProperties
A set of properties added to the ones obtained from the index writer when
writing out batches.
|
protected List<AtomicIndex.MG4JIndex> |
batches
A list containing the head and tails of this index.
|
protected RunnableFuture<Long> |
batchWriteTask
If a request was made to write the in-RAM index data to disk this value
will be not null.
|
protected RunnableFuture<Void> |
compactIndexTask
If a request was made to compress the index (combine all sub-indexes
into a new head) this value will be non-null.
|
protected it.unimi.dsi.lang.MutableString |
currentTerm
A mutable string used to create instances of MutableString on the cheap.
|
static String |
DIRECT_INDEX_NAME_SUFFIX
FIles belonging to teh direct index get this suffix added to their
basename.
|
static String |
DIRECT_TERMS_FILENAME |
protected it.unimi.di.big.mg4j.index.Index |
directIndex
The direct index for this atomic index.
|
protected it.unimi.dsi.fastutil.objects.Object2LongMap<String> |
directTermIds
This map associates direct index terms with their IDs.
|
protected it.unimi.dsi.fastutil.objects.ObjectBigList<String> |
directTerms
The terms in the direct index, in the order they were first seen during
indexing.
|
static String |
DOCUMENTS_QUEUE_FILE_NAME
The file name (under the current directory for this atomic index) for the
directory containing the documents that have been queued for indexing, but
not yet indexed.
|
protected int |
documentsInRAM
The number of documents currently stored in RAM.
|
protected it.unimi.dsi.fastutil.ints.IntArrayList |
documentSizesInRAM
The sizes (numbers of terms) for all the documents indexed in RAM.
|
protected boolean |
hasDirectIndex
Is the direct indexing enabled? Direct indexes are used to find terms
occurring in given documents.
|
static String |
HEAD_FILE_NAME
The file name (under the current directory for this atomic index) which
stores the principal index.
|
static String |
HEAD_NEW_EXT
The file extension used for the temporary directory where the updated head
is being built.
|
static String |
HEAD_OLD_EXT
The file extension used for the temporary directory where the old head
index is being stored while the newly updated one is being installed.
|
protected File |
indexDirectory
The directory where this atomic index stores its files.
|
protected Thread |
indexingThread
The single thread used to index documents.
|
protected BlockingQueue<GATEDocument> |
inputQueue
Documents to be indexed are queued in this queue.
|
protected it.unimi.di.big.mg4j.index.Index |
invertedIndex
The cluster-view of all the MG4J indexes that are part of this index (i.e.
|
protected int |
maxDocSizeInRAM
The size (number of terms) for the longest document indexed but not yet
saved.
|
protected String |
name
The name of this atomic index.
|
protected long |
occurrencesInRAM
The number of occurrences represented in RAM and not yet written to disk.
|
protected BlockingQueue<GATEDocument> |
outputQueue
Documents that have been indexed are passed on to this queue.
|
protected MimirIndex |
parent
The
MimirIndex that this atomic index is a member of. |
static String |
TAIL_FILE_NAME_PREFIX
The prefix used for file names (under the current directory for this
atomic index) for updates to the head index.
|
protected static com.google.common.io.PatternFilenameFilter |
TAILS_FILENAME_FILTER |
protected it.unimi.dsi.fastutil.objects.Object2ReferenceOpenHashMap<it.unimi.dsi.lang.MutableString,AtomicIndex.PostingsList> |
termMap
An in-memory inverted index that gets dumped to files for each batch.
|
protected it.unimi.di.big.mg4j.index.TermProcessor |
termProcessor
The term processor used to process the feature values being indexed.
|
protected int |
tokenPosition
The position of the current (or most-recently used) token in the current
document.
|
Modifier | Constructor and Description |
---|---|
protected |
AtomicIndex(MimirIndex parent,
String name,
boolean hasDirectIndex,
it.unimi.di.big.mg4j.index.TermProcessor termProcessor,
BlockingQueue<GATEDocument> inputQueue,
BlockingQueue<GATEDocument> outputQueue)
Creates a new AtomicIndex
|
Modifier and Type | Method and Description |
---|---|
protected abstract void |
calculateStartPositionForAnnotation(gate.Annotation ann,
GATEDocument gateDocument)
Calculate the starting position for the given annotation, storing
it in
tokenPosition . |
protected abstract String[] |
calculateTermStringForAnnotation(gate.Annotation ann,
GATEDocument gateDocument)
Determine the string (or strings, if there are alternatives) that should
be stored in the index for the given annotation.
|
void |
close()
Notifies this index to stop its indexing operations, and waits for all data
to be written.
|
protected static void |
combineDirectIndexes(List<AtomicIndex.MG4JIndex> inputIndexes,
String outputBasename)
Given a set of direct indexes (MG4J indexes, with counts, but no positions,
that form a lexical cluster) this method produces one single output index
containing the data from all the input indexes.
|
protected void |
compactIndex()
Combines all the currently existing batches, generating a new head index.
|
protected void |
documentEnding(GATEDocument gateDocument)
Hook for subclasses, called after annotations for this document
have been processed.
|
protected void |
documentStarting(GATEDocument gateDocument)
Hook for subclasses, called before processing the annotations
for this document.
|
protected abstract void |
flush()
Closes all file-based resources.
|
static void |
generateTermMap(File termsFile,
File termmapFile,
File bloomFilterFile)
Given a terms file (text file with one term per line) this method generates
the corresponding termmap file (binary representation of a StringMap).
|
protected abstract gate.Annotation[] |
getAnnotsToProcess(GATEDocument gateDocument)
Get the annotations that are to be processed for a document,
in increasing order of offset.
|
int |
getBatchCount()
Returns the number of batches in this atomic index.
|
it.unimi.di.big.mg4j.index.Index |
getDirectIndex()
Gets the direct index for this atomic index.
|
CharSequence |
getDirectTerm(long termId)
Gets the term string for a given direct term ID.
|
long |
getDirectTermOccurenceCount(long directTermId)
Gets the occurrence count in the whole index for a given direct term,
specified by a direct term ID (which must have been obtained from the
direct index of this index).
|
it.unimi.dsi.fastutil.objects.ObjectBigList<? extends CharSequence> |
getDirectTerms()
Gets the list of direct terms for this index.
|
it.unimi.di.big.mg4j.index.Index |
getIndex()
Gets the inverted index (an
Index value) that can be used to
search this atomic index. |
File |
getIndexDirectory()
Gets the top level directory for this atomic index.
|
BlockingQueue<GATEDocument> |
getInputQueue()
Gets the input queue used by this atomic index.
|
String |
getName()
Gets the name of this atomic index.
|
BlockingQueue<GATEDocument> |
getOutputQueue()
Gets the output queue used by this atomic index.
|
MimirIndex |
getParent()
Gets the top level
MimirIndex to which this atomic index belongs. |
boolean |
hasDirectIndex()
Is a direct index configured for this atomic index.
|
protected void |
indexCurrentTerm()
Adds the value in
currentTerm to the index. |
protected void |
initIndex()
Opens the index and prepares it for indexing and searching.
|
static String |
longToTerm(long value)
Converts a long value into a String containing a zero-padded Hex
representation of the input value.
|
protected void |
newBatch()
Starts a new MG4J batch.
|
protected static it.unimi.di.big.mg4j.index.Index |
openDirectIndexCluster(List<AtomicIndex.MG4JIndex> batches)
Opens the direct index files from all the batches and combines them into
a
LexicalCluster . |
protected static it.unimi.di.big.mg4j.index.Index |
openInvertedIndexCluster(List<AtomicIndex.MG4JIndex> batches,
it.unimi.di.big.mg4j.index.TermProcessor termProcessor)
Creates a documental cluster from a list of
AtomicIndex.MG4JIndex values. |
protected AtomicIndex.MG4JIndex |
openSubIndex(String subIndexDirname)
Opens one sub-index, specified as a directory inside this Atomic Index's
index directory.
|
protected void |
processAnnotation(gate.Annotation ann,
GATEDocument gateDocument)
Indexes one annotation (either a Token or a semantic annotation).
|
protected void |
processDocument(GATEDocument gateDocument)
Adds the supplied document to the in-RAM index.
|
Future<Void> |
requestCompactIndex()
Requests this atomic index to compact its on-disk batches into a single
batch.
|
Future<Long> |
requestSyncToDisk()
Instructs this index to dump to disk all the in-RAM index data at the fist
opportunity.
|
void |
run()
Runnable implementation: the logic of this run method is simply indexing
documents queued to the input queue.
|
protected long |
writeCurrentBatch()
Writes all the data currently stored in RAM to a new index batch.
|
protected void |
writeDirectIndex(File batchDir)
Writes the in-RAM data to a new direct index batch.
|
public static final String HEAD_FILE_NAME
public static final String HEAD_NEW_EXT
public static final String HEAD_OLD_EXT
public static final String TAIL_FILE_NAME_PREFIX
public static final String DIRECT_TERMS_FILENAME
public static final String DIRECT_INDEX_NAME_SUFFIX
public static final String DOCUMENTS_QUEUE_FILE_NAME
protected static final com.google.common.io.PatternFilenameFilter TAILS_FILENAME_FILTER
protected String name
protected File indexDirectory
protected it.unimi.di.big.mg4j.index.TermProcessor termProcessor
protected int maxDocSizeInRAM
protected long occurrencesInRAM
protected MimirIndex parent
MimirIndex
that this atomic index is a member of.protected List<AtomicIndex.MG4JIndex> batches
protected it.unimi.di.big.mg4j.index.Index invertedIndex
protected it.unimi.di.big.mg4j.index.Index directIndex
hasDirectIndex()
is false, then this index will be
null
.protected it.unimi.dsi.util.Properties additionalProperties
protected it.unimi.dsi.util.Properties additionalDirectProperties
protected boolean hasDirectIndex
protected it.unimi.dsi.fastutil.objects.Object2LongMap<String> directTermIds
protected it.unimi.dsi.fastutil.objects.ObjectBigList<String> directTerms
protected Thread indexingThread
protected BlockingQueue<GATEDocument> inputQueue
protected BlockingQueue<GATEDocument> outputQueue
protected int tokenPosition
protected it.unimi.dsi.lang.MutableString currentTerm
protected int documentsInRAM
protected it.unimi.dsi.fastutil.objects.Object2ReferenceOpenHashMap<it.unimi.dsi.lang.MutableString,AtomicIndex.PostingsList> termMap
protected it.unimi.dsi.fastutil.ints.IntArrayList documentSizesInRAM
protected RunnableFuture<Void> compactIndexTask
protected RunnableFuture<Long> batchWriteTask
protected AtomicIndex(MimirIndex parent, String name, boolean hasDirectIndex, it.unimi.di.big.mg4j.index.TermProcessor termProcessor, BlockingQueue<GATEDocument> inputQueue, BlockingQueue<GATEDocument> outputQueue) throws IOException, IndexException
parent
- the MimirIndex
containing this atomic index.name
- the name of the sub-index, e.g. token-i or
mentions-jindexDirectory
- the directory where this index should store all its
files.hasDirectIndex
- should a direct index be used?inputQueue
- the input queue for documents to be indexed.outputQueue
- the output queue for documents that have been indexed.IndexException
IOException
public static void generateTermMap(File termsFile, File termmapFile, File bloomFilterFile) throws IOException
BloomFilter
can also be generated, if the suitable
target file is provided.termsFile
- the input filetermmapFile
- the output termmap file, or null
if a
termmap is not required.bloomFilterFile
- the file to be used for writing the
BloomFilter
for the index, or null
if a Bloom filter
is not required.IOException
protected static final it.unimi.di.big.mg4j.index.Index openInvertedIndexCluster(List<AtomicIndex.MG4JIndex> batches, it.unimi.di.big.mg4j.index.TermProcessor termProcessor)
AtomicIndex.MG4JIndex
values.batches
- the indexes to be combined into a clustertermProcessor
- the term processor to be used (can be null)protected static final it.unimi.di.big.mg4j.index.Index openDirectIndexCluster(List<AtomicIndex.MG4JIndex> batches)
LexicalCluster
.batches
- the batches to be opened.public static final String longToTerm(long value)
value
- the value to convert.protected void initIndex() throws IOException, IndexException
IndexException
IOException
public String getName()
public boolean hasDirectIndex()
protected void newBatch()
protected long writeCurrentBatch() throws IOException, IndexException
IOException
IndexException
protected void writeDirectIndex(File batchDir) throws IOException, IndexException
batchDir
- IOException
IndexException
protected void compactIndex() throws IndexException, IOException, org.apache.commons.configuration.ConfigurationException
IndexException
IOException
org.apache.commons.configuration.ConfigurationException
protected static void combineDirectIndexes(List<AtomicIndex.MG4JIndex> inputIndexes, String outputBasename) throws IOException, org.apache.commons.configuration.ConfigurationException
inputIndexes
- outputBasename
- IOException
org.apache.commons.configuration.ConfigurationException
public Future<Long> requestSyncToDisk() throws InterruptedException
Future
value that, upon completion, will return the
number of occurrences written to disk.InterruptedException
- if this thread is interrupted while trying to
queue the dump request.public Future<Void> requestCompactIndex() throws InterruptedException
Future
which can be used to find out when the compaction
operation has completed.InterruptedException
- if this thread is interrupted while trying to
queue the compaction request.protected AtomicIndex.MG4JIndex openSubIndex(String subIndexDirname) throws IOException, IndexException
subIndexDirname
- IOException
IndexException
public void run()
GATEDocument.END_OF_QUEUE
value to the input queue.protected abstract void flush() throws IOException
IOException
public void close() throws InterruptedException
InterruptedException
- is the waiting thread is interrupted before
the indexing thread has finished writing all the data.protected void documentStarting(GATEDocument gateDocument) throws IndexException
IndexException
protected void documentEnding(GATEDocument gateDocument) throws IndexException
IndexException
protected abstract gate.Annotation[] getAnnotsToProcess(GATEDocument gateDocument) throws IndexException
IndexException
protected abstract void calculateStartPositionForAnnotation(gate.Annotation ann, GATEDocument gateDocument) throws IndexException
tokenPosition
. The starting position is the
index of the token within the document where the annotation starts,
and must be >= the previous value of tokenPosition.ann
- gateDocument
- IndexException
protected abstract String[] calculateTermStringForAnnotation(gate.Annotation ann, GATEDocument gateDocument) throws IndexException
currentTerm
, in which case null
should
be returned instead.
If the current term should not be indexed (e.g. it's a stop word), then
the implementation should return an empty String array.ann
- gateDocument
- IndexException
protected void processDocument(GATEDocument gateDocument) throws IndexException
gateDocument
- the document to indexIndexException
protected void processAnnotation(gate.Annotation ann, GATEDocument gateDocument) throws IndexException
ann
- the annotation to be indexedgateDocument
- the GATEDocument containing the annotationIndexException
IOException
protected void indexCurrentTerm()
currentTerm
to the index.IOException
public File getIndexDirectory()
MimirIndex
which includes this atomic index.public MimirIndex getParent()
MimirIndex
to which this atomic index belongs.public BlockingQueue<GATEDocument> getInputQueue()
public BlockingQueue<GATEDocument> getOutputQueue()
GATEDocument.getOccurrences()
) increased by the number of
occurrences generated by indexing the document in this atomic index.public it.unimi.di.big.mg4j.index.Index getIndex()
Index
value) that can be used to
search this atomic index. This will normally be a
DocumentalCluster
view over all the batches contained.public it.unimi.di.big.mg4j.index.Index getDirectIndex()
non-null
only if the atomic index was configured to have a
direct index upon its construction (see
#AtomicIndex(MimirIndex, String, File, boolean, TermProcessor, BlockingQueue, BlockingQueue)
.).
You can check if a direct index has been configured by calling
hasDirectIndex()
.longToTerm(long)
).
The search results is a set of "document IDs", which are actually
term IDs. The actual term string corresponding to the returned term IDs can
be obtained by calling getDirectTerm(long)
.public CharSequence getDirectTerm(long termId)
termId
- the ID for the term being sought.public it.unimi.dsi.fastutil.objects.ObjectBigList<? extends CharSequence> getDirectTerms()
public long getDirectTermOccurenceCount(long directTermId) throws IOException
directTermId
- IOException
public int getBatchCount()
Copyright © 2021 GATE. All rights reserved.