Package org.apache.lucene.codecs.bloom
Class FuzzySet
- java.lang.Object
-
- org.apache.lucene.codecs.bloom.FuzzySet
-
- All Implemented Interfaces:
Accountable
public class FuzzySet extends Object implements Accountable
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.
This class is NOT threadsafe.
Internally a Bitset is used to record values and once a client has finished recording a stream of values the
downsize(float)
method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
FuzzySet.ContainsResult
Result fromcontains(BytesRef)
: can never return definitively YES (always MAYBE), but can sometimes definitely return NO.
-
Field Summary
-
Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addValue(BytesRef value)
Records a value in the set.FuzzySet.ContainsResult
contains(BytesRef value)
The main method required for a Bloom filter which, given a value determines set membership.static FuzzySet
createOptimalSet(int maxNumUniqueValues, float targetMaxFpp)
static FuzzySet
createSetBasedOnMaxMemory(int maxNumBytes)
static FuzzySet
createSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation, int version)
static FuzzySet
deserialize(DataInput in)
FuzzySet
downsize(float targetMaxSaturation)
static int
getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
int
getEstimatedUniqueValues()
static int
getNearestSetSize(int maxNumberOfBits)
Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number.static int
getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.float
getSaturation()
float
getTargetMaxSaturation()
long
ramBytesUsed()
void
serialize(DataOutput out)
Serializes the data set to file using the following format: FuzzySet -->hashCount,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords hashCount -->Uint32
The number of hash functions (k).String
toString()
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface org.apache.lucene.util.Accountable
getChildResources
-
-
-
-
Method Detail
-
getNearestSetSize
public static int getNearestSetSize(int maxNumberOfBits)
Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number. Use this method where controlling memory use is paramount.
-
getNearestSetSize
public static int getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.- Parameters:
desiredSaturation
- A number between 0 and 1 expressing the % of bits set once all values have been recorded- Returns:
- The size of the set nearest to the required size
-
createSetBasedOnMaxMemory
public static FuzzySet createSetBasedOnMaxMemory(int maxNumBytes)
-
createSetBasedOnQuality
public static FuzzySet createSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation, int version)
-
createOptimalSet
public static FuzzySet createOptimalSet(int maxNumUniqueValues, float targetMaxFpp)
-
contains
public FuzzySet.ContainsResult contains(BytesRef value)
The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false. Hash generation follows the same principles asaddValue(BytesRef)
- Returns:
- NO or MAYBE
-
serialize
public void serialize(DataOutput out) throws IOException
Serializes the data set to file using the following format:- FuzzySet -->hashCount,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords
- hashCount -->
Uint32
The number of hash functions (k). - BloomSize -->
Uint32
The modulo value used to project hashes into the field's Bitset - NumBitSetWords -->
Uint32
The number of longs (as returned fromFixedBitSet.getBits()
) - BitSetWord -->
Long
A long from the array returned byFixedBitSet.getBits()
- Parameters:
out
- Data output stream- Throws:
IOException
- If there is a low-level I/O error
-
deserialize
public static FuzzySet deserialize(DataInput in) throws IOException
- Throws:
IOException
-
addValue
public void addValue(BytesRef value) throws IOException
Records a value in the set. The referenced bytes are hashed. From the 64-bit generated hash, two 32-bit hashes are derived from the msb and lsb which can be used to derive more hashes (see https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf). Finally, each generated hash is modulo n'd where n is the chosen size of the internal bitset.- Parameters:
value
- the key value to be hashed- Throws:
IOException
- If there is a low-level I/O error
-
downsize
public FuzzySet downsize(float targetMaxSaturation)
- Parameters:
targetMaxSaturation
- A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better accuracy but require more space.- Returns:
- a smaller FuzzySet or null if the current set is already over-saturated
-
getEstimatedUniqueValues
public int getEstimatedUniqueValues()
-
getEstimatedNumberUniqueValuesAllowingForCollisions
public static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
-
getTargetMaxSaturation
public float getTargetMaxSaturation()
-
getSaturation
public float getSaturation()
-
ramBytesUsed
public long ramBytesUsed()
- Specified by:
ramBytesUsed
in interfaceAccountable
-
-