Package org.apache.lucene.index
Table Of Contents
Postings APIs
Fields
Fields
is the initial entry point into the
postings APIs, this can be obtained in several ways:
// access indexed fields for an index segment Fields fields = reader.fields(); // access term vector fields for a specified document Fields fields = reader.getTermVectors(docid);Fields implements Java's Iterable interface, so its easy to enumerate the list of fields:
// enumerate list of fields for (String field : fields) { // access the terms for this field Terms terms = fields.terms(field); }
Terms
Terms
represents the collection of terms
within a field, exposes some metadata and statistics,
and an API for enumeration.
// metadata about the field System.out.println("positions? " + terms.hasPositions()); System.out.println("offsets? " + terms.hasOffsets()); System.out.println("payloads? " + terms.hasPayloads()); // iterate through terms TermsEnum termsEnum = terms.iterator(null); BytesRef term = null; while ((term = termsEnum.next()) != null) { doSomethingWith(termsEnum.term()); }
TermsEnum
provides an iterator over the list
of terms within a field, some statistics about the term,
and methods to access the term's documents and
positions.
// seek to a specific term boolean found = termsEnum.seekExact(new BytesRef("foobar")); if (found) { // get the document frequency System.out.println(termsEnum.docFreq()); // enumerate through documents DocsEnum docs = termsEnum.docs(null, null); // enumerate through documents and positions DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null); }
Documents
DocsEnum
is an extension of
DocIdSetIterator
that iterates over the list of
documents for a term, along with the term frequency within that document.
int docid; while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { System.out.println(docid); System.out.println(docsEnum.freq()); }
Positions
DocsAndPositionsEnum
is an extension of
DocsEnum
that additionally allows iteration
of the positions a term occurred within the document, and any additional
per-position information (offsets and payload)
int docid; while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { System.out.println(docid); int freq = docsAndPositionsEnum.freq(); for (int i = 0; i < freq; i++) { System.out.println(docsAndPositionsEnum.nextPosition()); System.out.println(docsAndPositionsEnum.startOffset()); System.out.println(docsAndPositionsEnum.endOffset()); System.out.println(docsAndPositionsEnum.getPayload()); } }
Index Statistics
Term statistics
TermsEnum.docFreq()
: Returns the number of documents that contain at least one occurrence of the term. This statistic is always available for an indexed term. Note that it will also count deleted documents, when segments are merged the statistic is updated as those deleted documents are merged away.TermsEnum.totalTermFreq()
: Returns the number of occurrences of this term across all documents. Note that this statistic is unavailable (returns-1
) if term frequencies were omitted from the index (DOCS_ONLY
) for the field. Like docFreq(), it will also count occurrences that appear in deleted documents.
Field statistics
Terms.size()
: Returns the number of unique terms in the field. This statistic may be unavailable (returns-1
) for some Terms implementations such asMultiTerms
, where it cannot be efficiently computed. Note that this count also includes terms that appear only in deleted documents: when segments are merged such terms are also merged away and the statistic is then updated.Terms.getDocCount()
: Returns the number of documents that contain at least one occurrence of any term for this field. This can be thought of as a Field-level docFreq(). Like docFreq() it will also count deleted documents.Terms.getSumDocFreq()
: Returns the number of postings (term-document mappings in the inverted index) for the field. This can be thought of as the sum ofTermsEnum.docFreq()
across all terms in the field, and like docFreq() it will also count postings that appear in deleted documents.Terms.getSumTotalTermFreq()
: Returns the number of tokens for the field. This can be thought of as the sum ofTermsEnum.totalTermFreq()
across all terms in the field, and like totalTermFreq() it will also count occurrences that appear in deleted documents, and will be unavailable (returns-1
) if term frequencies were omitted from the index (DOCS_ONLY
) for the field.
Segment statistics
IndexReader.maxDoc()
: Returns the number of documents (including deleted documents) in the index.IndexReader.numDocs()
: Returns the number of live documents (excluding deleted documents) in the index.IndexReader.numDeletedDocs()
: Returns the number of deleted documents in the index.Fields.size()
: Returns the number of indexed fields.Fields.getUniqueTermCount()
: Returns the number of indexed terms, the sum ofTerms.size()
across all fields.
Document statistics
Document statistics are available during the indexing process for an indexed field: typically
a Similarity
implementation will store some
of these values (possibly in a lossy way), into the normalization value for the document in
its Similarity.computeNorm(org.apache.lucene.index.FieldInvertState)
method.
FieldInvertState.getLength()
: Returns the number of tokens for this field in the document. Note that this is just the number of times thatTokenStream.incrementToken()
returned true, and is unrelated to the values inPositionIncrementAttribute
.FieldInvertState.getNumOverlap()
: Returns the number of tokens for this field in the document that had a position increment of zero. This can be used to compute a document length that discounts artificial tokens such as synonyms.FieldInvertState.getPosition()
: Returns the accumulated position value for this field in the document: computed from the values ofPositionIncrementAttribute
and includingAnalyzer.getPositionIncrementGap(java.lang.String)
s across multivalued fields.FieldInvertState.getOffset()
: Returns the total character offset value for this field in the document: computed from the values ofOffsetAttribute
returned byTokenStream.end()
, and includingAnalyzer.getOffsetGap(java.lang.String)
s across multivalued fields.FieldInvertState.getUniqueTermCount()
: Returns the number of unique terms encountered for this field in the document.FieldInvertState.getMaxTermFrequency()
: Returns the maximum frequency across all unique terms encountered for this field in the document.
Additional user-supplied statistics can be added to the document as DocValues fields and
accessed via AtomicReader.getNumericDocValues(java.lang.String)
.
-
Interface Summary Interface Description IndexableField Represents a single field for indexing.IndexableFieldType Describes the properties of a field.IndexReader.ReaderClosedListener A custom listener that's invoked when the IndexReader is closed.SegmentReader.CoreClosedListener Called when the shared core for this SegmentReader is closed.TwoPhaseCommit An interface for implementations that support 2-phase commit. -
Class Summary Class Description AtomicReader AtomicReader
is an abstract class, providing an interface for accessing an index.AtomicReaderContext IndexReaderContext
forAtomicReader
instances.BaseCompositeReader<R extends IndexReader> Base class for implementingCompositeReader
s based on an array of sub-readers.BinaryDocValues A per-document byte[]CheckIndex Basic tool and API to check the health of an index and write a new segments file that removes reference to problematic segments.CheckIndex.Status Returned fromCheckIndex.checkIndex()
detailing the health and status of the index.CheckIndex.Status.DocValuesStatus Status from testing DocValuesCheckIndex.Status.FieldNormStatus Status from testing field norms.CheckIndex.Status.SegmentInfoStatus Holds the status of each segment in the index.CheckIndex.Status.StoredFieldStatus Status from testing stored fields.CheckIndex.Status.TermIndexStatus Status from testing term index.CheckIndex.Status.TermVectorStatus Status from testing stored fields.CompositeReader Instances of this reader type can only be used to get stored fields from the underlying AtomicReaders, but it is not possible to directly retrieve postings.CompositeReaderContext IndexReaderContext
forCompositeReader
instance.CompoundFileExtractor Command-line tool for extracting sub-files out of a compound file.ConcurrentMergeScheduler AMergeScheduler
that runs each merge using a separate thread.DirectoryReader DirectoryReader is an implementation ofCompositeReader
that can read indexes in aDirectory
.DocsAndPositionsEnum Also iterates through positions.DocsEnum Iterates through the documents and term freqs.DocTermOrds This class enables fast access to multiple term ords for a specified field across all docIDs.FieldInfo Access to the Field Info file that describes document fields and whether or not they are indexed.FieldInfos Collection ofFieldInfo
s (accessible by number or by name).FieldInvertState This class tracks the number and position / offset parameters of terms being added to the index.Fields Flex API for access to fields and termsFilterAtomicReader AFilterAtomicReader
contains another AtomicReader, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.FilterAtomicReader.FilterDocsAndPositionsEnum Base class for filteringDocsAndPositionsEnum
implementations.FilterAtomicReader.FilterDocsEnum Base class for filteringDocsEnum
implementations.FilterAtomicReader.FilterFields Base class for filteringFields
implementations.FilterAtomicReader.FilterTerms Base class for filteringTerms
implementations.FilterAtomicReader.FilterTermsEnum Base class for filteringTermsEnum
implementations.FilterDirectoryReader A FilterDirectoryReader wraps another DirectoryReader, allowing implementations to transform or extend it.FilterDirectoryReader.StandardReaderWrapper A no-op SubReaderWrapper that simply returns the parent DirectoryReader's original subreaders.FilterDirectoryReader.SubReaderWrapper Factory class passed to FilterDirectoryReader constructor that allows subclasses to wrap the filtered DirectoryReader's subreaders.FilteredTermsEnum Abstract class for enumerating a subset of all terms.IndexCommit Expert: represents a single commit into an index as seen by theIndexDeletionPolicy
orIndexReader
.IndexDeletionPolicy Expert: policy for deletion of staleindex commits
.IndexFileNames This class contains useful constants representing filenames and extensions used by lucene, as well as convenience methods for querying whether a file name matches an extension (matchesExtension
), as well as generating file names from a segment name, generation and extension (fileNameFromGeneration
,segmentFileName
).IndexReader IndexReader is an abstract class, providing an interface for accessing an index.IndexReaderContext A struct like class that represents a hierarchical relationship betweenIndexReader
instances.IndexSplitter Command-line tool that enables listing segments in an index, copying specific segments to another index, and deleting segments from an index.IndexUpgrader This is an easy-to-use tool that upgrades all segments of an index from previous Lucene versions to the current segment file format.IndexWriter AnIndexWriter
creates and maintains an index.IndexWriter.IndexReaderWarmer IfDirectoryReader.open(IndexWriter,boolean)
has been called (ie, this writer is in near real-time mode), then after a merge completes, this class can be invoked to warm the reader on the newly merged segment, before the merge commits.IndexWriterConfig Holds all the configuration that is used to create anIndexWriter
.KeepOnlyLastCommitDeletionPolicy ThisIndexDeletionPolicy
implementation that keeps only the most recent commit and immediately removes all prior commits after a new commit is done.LiveIndexWriterConfig Holds all the configuration used byIndexWriter
with few setters for settings that can be changed on anIndexWriter
instance "live".LogByteSizeMergePolicy This is aLogMergePolicy
that measures size of a segment as the total byte size of the segment's files.LogDocMergePolicy This is aLogMergePolicy
that measures size of a segment as the number of documents (not taking deletions into account).LogMergePolicy This class implements aMergePolicy
that tries to merge segments into levels of exponentially increasing size, where each level has fewer segments than the value of the merge factor.MergePolicy Expert: a MergePolicy determines the sequence of primitive merge operations.MergePolicy.DocMap A map of doc IDs.MergePolicy.MergeSpecification A MergeSpecification instance provides the information necessary to perform multiple merges.MergePolicy.OneMerge OneMerge provides the information necessary to perform an individual primitive merge operation, resulting in a single new segment.MergeScheduler Expert:IndexWriter
uses an instance implementing this interface to execute the merges selected by aMergePolicy
.MergeState Holds common state used during segment merging.MergeState.CheckAbort Class for recording units of work when merging segments.MergeState.DocMap Remaps docids around deletes during mergeMultiDocsAndPositionsEnum Exposes flex API, merged from flex API of sub-segments.MultiDocsAndPositionsEnum.EnumWithSlice Holds aDocsAndPositionsEnum
along with the correspondingReaderSlice
.MultiDocsEnum MultiDocsEnum.EnumWithSlice Holds aDocsEnum
along with the correspondingReaderSlice
.MultiDocValues A wrapper for CompositeIndexReader providing access to DocValues.MultiDocValues.MultiSortedDocValues Implements SortedDocValues over n subs, using an OrdinalMapMultiDocValues.MultiSortedSetDocValues Implements MultiSortedSetDocValues over n subs, using an OrdinalMapMultiDocValues.OrdinalMap maps per-segment ordinals to/from global ordinal spaceMultiFields Exposes flex API, merged from flex API of sub-segments.MultiPassIndexSplitter This tool splits input index into multiple equal parts.MultiReader ACompositeReader
which reads multiple indexes, appending their content.MultiTerms Exposes flex API, merged from flex API of sub-segments.MultiTermsEnum NoDeletionPolicy AnIndexDeletionPolicy
which keeps all index commits around, never deleting them.NoMergePolicy AMergePolicy
which never returns merges to execute (hence it's name).NoMergeScheduler AMergeScheduler
which never executes any merges.NumericDocValues A per-document numeric value.OrdTermState An ordinal basedTermState
ParallelAtomicReader AnAtomicReader
which reads multiple, parallel indexes.ParallelCompositeReader AnCompositeReader
which reads multiple, parallel indexes.PersistentSnapshotDeletionPolicy ASnapshotDeletionPolicy
which adds a persistence layer so that snapshots can be maintained across the life of an application.PKIndexSplitter Split an index based on aFilter
.ReaderManager Utility class to safely shareDirectoryReader
instances across multiple threads, while periodically reopening.ReaderSlice Subreader slice from a parent composite reader.ReaderUtil Common util methods for dealing withIndexReader
s andIndexReaderContext
s.SegmentCommitInfo Embeds a [read-only] SegmentInfo and adds per-commit fields.SegmentInfo Information about a segment such as it's name, directory, and files related to the segment.SegmentInfos A collection of segmentInfo objects with methods for operating on those segments in relation to the file system.SegmentInfos.FindSegmentsFile Utility class for executing code that needs to do something with the current segments file.SegmentReader IndexReader implementation over a single segment.SegmentReadState Holder class for common parameters used during read.SegmentWriteState Holder class for common parameters used during write.SerialMergeScheduler AMergeScheduler
that simply does each merge sequentially, using the current thread.SimpleMergedSegmentWarmer A very simple merged segment warmer that just ensures data structures are initialized.SingleTermsEnum Subclass of FilteredTermsEnum for enumerating a single term.SingletonSortedSetDocValues Exposes multi-valued view over a single-valued instance.SlowCompositeReaderWrapper This class forces a composite reader (eg aMultiReader
orDirectoryReader
) to emulate an atomic reader.SnapshotDeletionPolicy AnIndexDeletionPolicy
that wraps any otherIndexDeletionPolicy
and adds the ability to hold and later release snapshots of an index.SortedDocValues A per-document byte[] with presorted values.SortedSetDocValues A per-document set of presorted byte[] values.StoredFieldVisitor Expert: provides a low-level means of accessing the stored field values in an index.Term A Term represents a word from text.TermContext Terms Access to the terms in a specific field.TermsEnum Iterator to seek (TermsEnum.seekCeil(BytesRef)
,TermsEnum.seekExact(BytesRef)
) or step through (BytesRefIterator.next()
terms to obtain frequency information (TermsEnum.docFreq()
),DocsEnum
orDocsAndPositionsEnum
for the current term (TermsEnum.docs(org.apache.lucene.util.Bits, org.apache.lucene.index.DocsEnum)
.TermState Encapsulates all required internal state to position the associatedTermsEnum
without re-seeking.TieredMergePolicy Merges segments of approximately equal size, subject to an allowed number of segments per tier.TrackingIndexWriter Class that tracks changes to a delegated IndexWriter, used byControlledRealTimeReopenThread
to ensure specific changes are visible.TwoPhaseCommitTool A utility for executing 2-phase commit on several objects.UpgradeIndexMergePolicy ThisMergePolicy
is used for upgrading all existing segments of an index when callingIndexWriter.forceMerge(int)
. -
Enum Summary Enum Description FieldInfo.DocValuesType DocValues types.FieldInfo.IndexOptions Controls how much information is stored in the postings lists.IndexWriterConfig.OpenMode Specifies the open mode forIndexWriter
.MergePolicy.MergeTrigger MergeTrigger is passed toMergePolicy.findMerges(MergeTrigger, SegmentInfos)
to indicate the event that triggered the merge.StoredFieldVisitor.Status Enumeration of possible return values forStoredFieldVisitor.needsField(org.apache.lucene.index.FieldInfo)
.TermsEnum.SeekStatus Represents returned result fromTermsEnum.seekCeil(org.apache.lucene.util.BytesRef)
. -
Exception Summary Exception Description CorruptIndexException This exception is thrown when Lucene detects an inconsistency in the index.IndexFormatTooNewException This exception is thrown when Lucene detects an index that is newer than this Lucene version.IndexFormatTooOldException This exception is thrown when Lucene detects an index that is too old for this Lucene versionIndexNotFoundException Signals that no index was found in the Directory.MergePolicy.MergeAbortedException Thrown when a merge was explicity aborted becauseIndexWriter.close(boolean)
was called withfalse
.MergePolicy.MergeException Exception thrown if there are any problems while executing a merge.TwoPhaseCommitTool.CommitFailException Thrown byTwoPhaseCommitTool.execute(TwoPhaseCommit...)
when an object fails to commit().TwoPhaseCommitTool.PrepareCommitFailException Thrown byTwoPhaseCommitTool.execute(TwoPhaseCommit...)
when an object fails to prepareCommit().