Package org.apache.lucene.codecs.bloom
Class FuzzySet
- java.lang.Object
-
- org.apache.lucene.codecs.bloom.FuzzySet
-
- All Implemented Interfaces:
Accountable
public class FuzzySet extends Object implements Accountable
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.
This class is NOT threadsafe.
Internally a Bitset is used to record values and once a client has finished recording a stream of values the
downsize(float)
method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
FuzzySet.ContainsResult
Result fromcontains(BytesRef)
: can never return definitively YES (always MAYBE), but can sometimes definitely return NO.
-
Field Summary
Fields Modifier and Type Field Description static int
VERSION_CURRENT
static int
VERSION_SPI
static int
VERSION_START
-
Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addValue(BytesRef value)
Records a value in the set.FuzzySet.ContainsResult
contains(BytesRef value)
The main method required for a Bloom filter which, given a value determines set membership.static FuzzySet
createSetBasedOnMaxMemory(int maxNumBytes)
static FuzzySet
createSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
static FuzzySet
deserialize(DataInput in)
FuzzySet
downsize(float targetMaxSaturation)
static int
getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
int
getEstimatedUniqueValues()
static int
getNearestSetSize(int maxNumberOfBits)
Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number.static int
getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.float
getSaturation()
static HashFunction
hashFunctionForVersion(int version)
long
ramBytesUsed()
void
serialize(DataOutput out)
Serializes the data set to file using the following format: FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords HashFunctionName -->String
The name of a ServiceProvider registeredHashFunction
FuzzySetVersion -->Uint32
The version number of theFuzzySet
class BloomSize -->Uint32
The modulo value used to project hashes into the field's Bitset NumBitSetWords -->Uint32
The number of longs (as returned fromFixedBitSet.getBits()
) BitSetWord -->Long
A long from the array returned byFixedBitSet.getBits()
String
toString()
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface org.apache.lucene.util.Accountable
getChildResources
-
-
-
-
Field Detail
-
VERSION_SPI
public static final int VERSION_SPI
- See Also:
- Constant Field Values
-
VERSION_START
public static final int VERSION_START
- See Also:
- Constant Field Values
-
VERSION_CURRENT
public static final int VERSION_CURRENT
- See Also:
- Constant Field Values
-
-
Method Detail
-
hashFunctionForVersion
public static HashFunction hashFunctionForVersion(int version)
-
getNearestSetSize
public static int getNearestSetSize(int maxNumberOfBits)
Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number. Use this method where controlling memory use is paramount.
-
getNearestSetSize
public static int getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.- Parameters:
desiredSaturation
- A number between 0 and 1 expressing the % of bits set once all values have been recorded- Returns:
- The size of the set nearest to the required size
-
createSetBasedOnMaxMemory
public static FuzzySet createSetBasedOnMaxMemory(int maxNumBytes)
-
createSetBasedOnQuality
public static FuzzySet createSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
-
contains
public FuzzySet.ContainsResult contains(BytesRef value)
The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false.- Returns:
- NO or MAYBE
-
serialize
public void serialize(DataOutput out) throws IOException
Serializes the data set to file using the following format:- FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords
- HashFunctionName -->
String
The name of a ServiceProvider registeredHashFunction
- FuzzySetVersion -->
Uint32
The version number of theFuzzySet
class - BloomSize -->
Uint32
The modulo value used to project hashes into the field's Bitset - NumBitSetWords -->
Uint32
The number of longs (as returned fromFixedBitSet.getBits()
) - BitSetWord -->
Long
A long from the array returned byFixedBitSet.getBits()
- Parameters:
out
- Data output stream- Throws:
IOException
- If there is a low-level I/O error
-
deserialize
public static FuzzySet deserialize(DataInput in) throws IOException
- Throws:
IOException
-
addValue
public void addValue(BytesRef value) throws IOException
Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset.- Parameters:
value
- the key value to be hashed- Throws:
IOException
- If there is a low-level I/O error
-
downsize
public FuzzySet downsize(float targetMaxSaturation)
- Parameters:
targetMaxSaturation
- A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better accuracy but require more space.- Returns:
- a smaller FuzzySet or null if the current set is already over-saturated
-
getEstimatedUniqueValues
public int getEstimatedUniqueValues()
-
getEstimatedNumberUniqueValuesAllowingForCollisions
public static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
-
getSaturation
public float getSaturation()
-
ramBytesUsed
public long ramBytesUsed()
- Specified by:
ramBytesUsed
in interfaceAccountable
-
-