org.apache.lucene.index (The Adobe Experience Manager SDK 2022.7.8005.20220711T194049Z-220600)

Code to maintain and access indices.

Postings APIs
- Fields
- Terms
- Documents
- Positions
Index Statistics

Postings APIs

Fields

Fields is the initial entry point into the postings APIs, this can be obtained in several ways:

// access indexed fields for an index segment
Fields fields = reader.fields();
// access term vector fields for a specified document
Fields fields = reader.getTermVectors(docid);

Fields implements Java's Iterable interface, so its easy to enumerate the list of fields:

// enumerate list of fields
for (String field : fields) {
  // access the terms for this field
  Terms terms = fields.terms(field);
}

Terms

Terms represents the collection of terms within a field, exposes some metadata and statistics, and an API for enumeration.

// metadata about the field
System.out.println("positions? " + terms.hasPositions());
System.out.println("offsets? " + terms.hasOffsets());
System.out.println("payloads? " + terms.hasPayloads());
// iterate through terms
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
while ((term = termsEnum.next()) != null) {
  doSomethingWith(termsEnum.term());
}

TermsEnum provides an iterator over the list of terms within a field, some statistics about the term, and methods to access the term's documents and positions.

// seek to a specific term
boolean found = termsEnum.seekExact(new BytesRef("foobar"));
if (found) {
  // get the document frequency
  System.out.println(termsEnum.docFreq());
  // enumerate through documents
  DocsEnum docs = termsEnum.docs(null, null);
  // enumerate through documents and positions
  DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null);
}

Documents

DocsEnum is an extension of DocIdSetIteratorthat iterates over the list of documents for a term, along with the term frequency within that document.

int docid;
while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  System.out.println(docsEnum.freq());
}

Positions

DocsAndPositionsEnum is an extension of DocsEnum that additionally allows iteration of the positions a term occurred within the document, and any additional per-position information (offsets and payload)

int docid;
while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  int freq = docsAndPositionsEnum.freq();
  for (int i = 0; i < freq; i++) {
     System.out.println(docsAndPositionsEnum.nextPosition());
     System.out.println(docsAndPositionsEnum.startOffset());
     System.out.println(docsAndPositionsEnum.endOffset());
     System.out.println(docsAndPositionsEnum.getPayload());
  }
}

Index Statistics

Term statistics

TermsEnum.docFreq(): Returns the number of documents that contain at least one occurrence of the term. This statistic is always available for an indexed term. Note that it will also count deleted documents, when segments are merged the statistic is updated as those deleted documents are merged away.
TermsEnum.totalTermFreq(): Returns the number of occurrences of this term across all documents. Note that this statistic is unavailable (returns -1) if term frequencies were omitted from the index (DOCS_ONLY) for the field. Like docFreq(), it will also count occurrences that appear in deleted documents.

Field statistics

Terms.size(): Returns the number of unique terms in the field. This statistic may be unavailable (returns -1) for some Terms implementations such as MultiTerms, where it cannot be efficiently computed. Note that this count also includes terms that appear only in deleted documents: when segments are merged such terms are also merged away and the statistic is then updated.
Terms.getDocCount(): Returns the number of documents that contain at least one occurrence of any term for this field. This can be thought of as a Field-level docFreq(). Like docFreq() it will also count deleted documents.
Terms.getSumDocFreq(): Returns the number of postings (term-document mappings in the inverted index) for the field. This can be thought of as the sum of TermsEnum.docFreq() across all terms in the field, and like docFreq() it will also count postings that appear in deleted documents.
Terms.getSumTotalTermFreq(): Returns the number of tokens for the field. This can be thought of as the sum of TermsEnum.totalTermFreq() across all terms in the field, and like totalTermFreq() it will also count occurrences that appear in deleted documents, and will be unavailable (returns -1) if term frequencies were omitted from the index (DOCS_ONLY) for the field.

Segment statistics

IndexReader.maxDoc(): Returns the number of documents (including deleted documents) in the index.
IndexReader.numDocs(): Returns the number of live documents (excluding deleted documents) in the index.
IndexReader.numDeletedDocs(): Returns the number of deleted documents in the index.
Fields.size(): Returns the number of indexed fields.
Fields.getUniqueTermCount(): Returns the number of indexed terms, the sum of Terms.size() across all fields.

Document statistics

Document statistics are available during the indexing process for an indexed field: typically a Similarity implementation will store some of these values (possibly in a lossy way), into the normalization value for the document in its Similarity.computeNorm(org.apache.lucene.index.FieldInvertState) method.

FieldInvertState.getLength(): Returns the number of tokens for this field in the document. Note that this is just the number of times that TokenStream.incrementToken() returned true, and is unrelated to the values in PositionIncrementAttribute.
FieldInvertState.getNumOverlap(): Returns the number of tokens for this field in the document that had a position increment of zero. This can be used to compute a document length that discounts artificial tokens such as synonyms.
FieldInvertState.getPosition(): Returns the accumulated position value for this field in the document: computed from the values of PositionIncrementAttribute and including Analyzer.getPositionIncrementGap(java.lang.String)s across multivalued fields.
FieldInvertState.getOffset(): Returns the total character offset value for this field in the document: computed from the values of OffsetAttribute returned by TokenStream.end(), and including Analyzer.getOffsetGap(java.lang.String)s across multivalued fields.
FieldInvertState.getUniqueTermCount(): Returns the number of unique terms encountered for this field in the document.
FieldInvertState.getMaxTermFrequency(): Returns the maximum frequency across all unique terms encountered for this field in the document.

Additional user-supplied statistics can be added to the document as DocValues fields and accessed via AtomicReader.getNumericDocValues(java.lang.String).

Interface Summary
Interface	Description
IndexableField	Represents a single field for indexing.
IndexableFieldType	Describes the properties of a field.
IndexReader.ReaderClosedListener	A custom listener that's invoked when the IndexReader is closed.
SegmentReader.CoreClosedListener	Called when the shared core for this SegmentReader is closed.
TwoPhaseCommit	An interface for implementations that support 2-phase commit.

Class Summary
Class	Description
AtomicReader	`AtomicReader` is an abstract class, providing an interface for accessing an index.
AtomicReaderContext	`IndexReaderContext` for `AtomicReader` instances.
BaseCompositeReader<R extends IndexReader>	Base class for implementing `CompositeReader`s based on an array of sub-readers.
BinaryDocValues	A per-document byte[]
CheckIndex	Basic tool and API to check the health of an index and write a new segments file that removes reference to problematic segments.
CheckIndex.Status	Returned from `CheckIndex.checkIndex()` detailing the health and status of the index.
CheckIndex.Status.DocValuesStatus	Status from testing DocValues
CheckIndex.Status.FieldNormStatus	Status from testing field norms.
CheckIndex.Status.SegmentInfoStatus	Holds the status of each segment in the index.
CheckIndex.Status.StoredFieldStatus	Status from testing stored fields.
CheckIndex.Status.TermIndexStatus	Status from testing term index.
CheckIndex.Status.TermVectorStatus	Status from testing stored fields.
CompositeReader	Instances of this reader type can only be used to get stored fields from the underlying AtomicReaders, but it is not possible to directly retrieve postings.
CompositeReaderContext	`IndexReaderContext` for `CompositeReader` instance.
CompoundFileExtractor	Command-line tool for extracting sub-files out of a compound file.
ConcurrentMergeScheduler	A `MergeScheduler` that runs each merge using a separate thread.
DirectoryReader	DirectoryReader is an implementation of `CompositeReader` that can read indexes in a `Directory`.
DocsAndPositionsEnum	Also iterates through positions.
DocsEnum	Iterates through the documents and term freqs.
DocTermOrds	This class enables fast access to multiple term ords for a specified field across all docIDs.
FieldInfo	Access to the Field Info file that describes document fields and whether or not they are indexed.
FieldInfos	Collection of `FieldInfo`s (accessible by number or by name).
FieldInvertState	This class tracks the number and position / offset parameters of terms being added to the index.
Fields	Flex API for access to fields and terms
FilterAtomicReader	A `FilterAtomicReader` contains another AtomicReader, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.
FilterAtomicReader.FilterDocsAndPositionsEnum	Base class for filtering `DocsAndPositionsEnum` implementations.
FilterAtomicReader.FilterDocsEnum	Base class for filtering `DocsEnum` implementations.
FilterAtomicReader.FilterFields	Base class for filtering `Fields` implementations.
FilterAtomicReader.FilterTerms	Base class for filtering `Terms` implementations.
FilterAtomicReader.FilterTermsEnum	Base class for filtering `TermsEnum` implementations.
FilterDirectoryReader	A FilterDirectoryReader wraps another DirectoryReader, allowing implementations to transform or extend it.
FilterDirectoryReader.StandardReaderWrapper	A no-op SubReaderWrapper that simply returns the parent DirectoryReader's original subreaders.
FilterDirectoryReader.SubReaderWrapper	Factory class passed to FilterDirectoryReader constructor that allows subclasses to wrap the filtered DirectoryReader's subreaders.
FilteredTermsEnum	Abstract class for enumerating a subset of all terms.
IndexCommit	Expert: represents a single commit into an index as seen by the `IndexDeletionPolicy` or `IndexReader`.
IndexDeletionPolicy	Expert: policy for deletion of stale `index commits`.
IndexFileNames	This class contains useful constants representing filenames and extensions used by lucene, as well as convenience methods for querying whether a file name matches an extension (`matchesExtension`), as well as generating file names from a segment name, generation and extension ( `fileNameFromGeneration`, `segmentFileName`).
IndexReader	IndexReader is an abstract class, providing an interface for accessing an index.
IndexReaderContext	A struct like class that represents a hierarchical relationship between `IndexReader` instances.
IndexSplitter	Command-line tool that enables listing segments in an index, copying specific segments to another index, and deleting segments from an index.
IndexUpgrader	This is an easy-to-use tool that upgrades all segments of an index from previous Lucene versions to the current segment file format.
IndexWriter	An `IndexWriter` creates and maintains an index.
IndexWriter.IndexReaderWarmer	If `DirectoryReader.open(IndexWriter,boolean)` has been called (ie, this writer is in near real-time mode), then after a merge completes, this class can be invoked to warm the reader on the newly merged segment, before the merge commits.
IndexWriterConfig	Holds all the configuration that is used to create an `IndexWriter`.
KeepOnlyLastCommitDeletionPolicy	This `IndexDeletionPolicy` implementation that keeps only the most recent commit and immediately removes all prior commits after a new commit is done.
LiveIndexWriterConfig	Holds all the configuration used by `IndexWriter` with few setters for settings that can be changed on an `IndexWriter` instance "live".
LogByteSizeMergePolicy	This is a `LogMergePolicy` that measures size of a segment as the total byte size of the segment's files.
LogDocMergePolicy	This is a `LogMergePolicy` that measures size of a segment as the number of documents (not taking deletions into account).
LogMergePolicy	This class implements a `MergePolicy` that tries to merge segments into levels of exponentially increasing size, where each level has fewer segments than the value of the merge factor.
MergePolicy	Expert: a MergePolicy determines the sequence of primitive merge operations.
MergePolicy.DocMap	A map of doc IDs.
MergePolicy.MergeSpecification	A MergeSpecification instance provides the information necessary to perform multiple merges.
MergePolicy.OneMerge	OneMerge provides the information necessary to perform an individual primitive merge operation, resulting in a single new segment.
MergeScheduler	Expert: `IndexWriter` uses an instance implementing this interface to execute the merges selected by a `MergePolicy`.
MergeState	Holds common state used during segment merging.
MergeState.CheckAbort	Class for recording units of work when merging segments.
MergeState.DocMap	Remaps docids around deletes during merge
MultiDocsAndPositionsEnum	Exposes flex API, merged from flex API of sub-segments.
MultiDocsAndPositionsEnum.EnumWithSlice	Holds a `DocsAndPositionsEnum` along with the corresponding `ReaderSlice`.
MultiDocsEnum	Exposes `DocsEnum`, merged from `DocsEnum` API of sub-segments.
MultiDocsEnum.EnumWithSlice	Holds a `DocsEnum` along with the corresponding `ReaderSlice`.
MultiDocValues	A wrapper for CompositeIndexReader providing access to DocValues.
MultiDocValues.MultiSortedDocValues	Implements SortedDocValues over n subs, using an OrdinalMap
MultiDocValues.MultiSortedSetDocValues	Implements MultiSortedSetDocValues over n subs, using an OrdinalMap
MultiDocValues.OrdinalMap	maps per-segment ordinals to/from global ordinal space
MultiFields	Exposes flex API, merged from flex API of sub-segments.
MultiPassIndexSplitter	This tool splits input index into multiple equal parts.
MultiReader	A `CompositeReader` which reads multiple indexes, appending their content.
MultiTerms	Exposes flex API, merged from flex API of sub-segments.
MultiTermsEnum	Exposes `TermsEnum` API, merged from `TermsEnum` API of sub-segments.
NoDeletionPolicy	An `IndexDeletionPolicy` which keeps all index commits around, never deleting them.
NoMergePolicy	A `MergePolicy` which never returns merges to execute (hence it's name).
NoMergeScheduler	A `MergeScheduler` which never executes any merges.
NumericDocValues	A per-document numeric value.
OrdTermState	An ordinal based `TermState`
ParallelAtomicReader	An `AtomicReader` which reads multiple, parallel indexes.
ParallelCompositeReader	An `CompositeReader` which reads multiple, parallel indexes.
PersistentSnapshotDeletionPolicy	A `SnapshotDeletionPolicy` which adds a persistence layer so that snapshots can be maintained across the life of an application.
PKIndexSplitter	Split an index based on a `Filter`.
ReaderManager	Utility class to safely share `DirectoryReader` instances across multiple threads, while periodically reopening.
ReaderSlice	Subreader slice from a parent composite reader.
ReaderUtil	Common util methods for dealing with `IndexReader`s and `IndexReaderContext`s.
SegmentCommitInfo	Embeds a [read-only] SegmentInfo and adds per-commit fields.
SegmentInfo	Information about a segment such as it's name, directory, and files related to the segment.
SegmentInfos	A collection of segmentInfo objects with methods for operating on those segments in relation to the file system.
SegmentInfos.FindSegmentsFile	Utility class for executing code that needs to do something with the current segments file.
SegmentReader	IndexReader implementation over a single segment.
SegmentReadState	Holder class for common parameters used during read.
SegmentWriteState	Holder class for common parameters used during write.
SerialMergeScheduler	A `MergeScheduler` that simply does each merge sequentially, using the current thread.
SimpleMergedSegmentWarmer	A very simple merged segment warmer that just ensures data structures are initialized.
SingleTermsEnum	Subclass of FilteredTermsEnum for enumerating a single term.
SingletonSortedSetDocValues	Exposes multi-valued view over a single-valued instance.
SlowCompositeReaderWrapper	This class forces a composite reader (eg a `MultiReader` or `DirectoryReader`) to emulate an atomic reader.
SnapshotDeletionPolicy	An `IndexDeletionPolicy` that wraps any other `IndexDeletionPolicy` and adds the ability to hold and later release snapshots of an index.
SortedDocValues	A per-document byte[] with presorted values.
SortedSetDocValues	A per-document set of presorted byte[] values.
StoredFieldVisitor	Expert: provides a low-level means of accessing the stored field values in an index.
Term	A Term represents a word from text.
TermContext	Maintains a `IndexReader` `TermState` view over `IndexReader` instances containing a single term.
Terms	Access to the terms in a specific field.
TermsEnum	Iterator to seek (`TermsEnum.seekCeil(BytesRef)`, `TermsEnum.seekExact(BytesRef)`) or step through (`BytesRefIterator.next()` terms to obtain frequency information (`TermsEnum.docFreq()`), `DocsEnum` or `DocsAndPositionsEnum` for the current term (`TermsEnum.docs(org.apache.lucene.util.Bits, org.apache.lucene.index.DocsEnum)`.
TermState	Encapsulates all required internal state to position the associated `TermsEnum` without re-seeking.
TieredMergePolicy	Merges segments of approximately equal size, subject to an allowed number of segments per tier.
TrackingIndexWriter	Class that tracks changes to a delegated IndexWriter, used by `ControlledRealTimeReopenThread` to ensure specific changes are visible.
TwoPhaseCommitTool	A utility for executing 2-phase commit on several objects.
UpgradeIndexMergePolicy	This `MergePolicy` is used for upgrading all existing segments of an index when calling `IndexWriter.forceMerge(int)`.

Enum Summary
Enum	Description
FieldInfo.DocValuesType	DocValues types.
FieldInfo.IndexOptions	Controls how much information is stored in the postings lists.
IndexWriterConfig.OpenMode	Specifies the open mode for `IndexWriter`.
MergePolicy.MergeTrigger	MergeTrigger is passed to `MergePolicy.findMerges(MergeTrigger, SegmentInfos)` to indicate the event that triggered the merge.
StoredFieldVisitor.Status	Enumeration of possible return values for `StoredFieldVisitor.needsField(org.apache.lucene.index.FieldInfo)`.
TermsEnum.SeekStatus	Represents returned result from `TermsEnum.seekCeil(org.apache.lucene.util.BytesRef)`.

Exception Summary
Exception	Description
CorruptIndexException	This exception is thrown when Lucene detects an inconsistency in the index.
IndexFormatTooNewException	This exception is thrown when Lucene detects an index that is newer than this Lucene version.
IndexFormatTooOldException	This exception is thrown when Lucene detects an index that is too old for this Lucene version
IndexNotFoundException	Signals that no index was found in the Directory.
MergePolicy.MergeAbortedException	Thrown when a merge was explicity aborted because `IndexWriter.close(boolean)` was called with `false`.
MergePolicy.MergeException	Exception thrown if there are any problems while executing a merge.
TwoPhaseCommitTool.CommitFailException	Thrown by `TwoPhaseCommitTool.execute(TwoPhaseCommit...)` when an object fails to commit().
TwoPhaseCommitTool.PrepareCommitFailException	Thrown by `TwoPhaseCommitTool.execute(TwoPhaseCommit...)` when an object fails to prepareCommit().

Package org.apache.lucene.index

Table Of Contents