Function<T,java.lang.Long>
, Object2LongFunction<T>
, Size64
, java.io.Serializable
, java.util.function.Function<T,java.lang.Long>
, java.util.function.ToLongFunction<T>
public class GOVMinimalPerfectHashFunction<T> extends AbstractHashFunction<T> implements java.io.Serializable
Given a list of keys without duplicates, the builder of this class finds a minimal
perfect hash function for the list. Subsequent calls to the getLong(Object)
method will
return a distinct number for each key in the list. For keys out of the list, the
resulting number is not specified. In some (rare) cases it might be possible to establish that a
key was not in the original list, and in that case -1 will be returned;
by signing the function (see below), you can guarantee with a prescribed probability
that -1 will be returned on keys not in the original list. The class can then be
saved by serialisation and reused later.
This class uses a chunked hash store to provide highly scalable construction. Note that at construction time
you can pass a ChunkedHashStore
containing the keys (associated with any value); however, if the store is rebuilt because of a
ChunkedHashStore.DuplicateException
it will be rebuilt associating with each key its ordinal position.
For convenience, this class provides a main method that reads from standard input a (possibly
gzip
'd) sequence of newline-separated strings, and writes a serialised minimal
perfect hash function for the given list.
Optionally, it is possible to sign the minimal perfect hash function. A w-bit signature will
be associated with each key, so that getLong(Object)
will return -1 on strings that are not
in the original key set. As usual, false positives are possible with probability 2-w.
This implementation is multithreaded: each chunk returned by the ChunkedHashStore
is processed independently. By
default, this class uses Runtime.availableProcessors()
parallel threads, but never more than 16. If you wish to
set a specific number of threads, you can do so through the system property "it.unimi.dsi.sux4j.mph.threads".
The detail of the data structure can be found in “Fast Scalable Construction of (Minimal Perfect Hash) Functions”, by Marco Genuzio, Giuseppe Ottaviano and Sebastiano Vigna, 15th International Symposium on Experimental Algorithms — SEA 2016, Lecture Notes in Computer Science, Springer, 2016. We generate a random 3-regular hypergraph and give it an orientation. From the orientation, we generate a random linear system on F3, where the variables in the k-th equation are the vertices of the k-th hyperedge, and the known term of the k-th equation is the vertex giving orientation to the k-th hyperedge. Then, we solve the system and store the solution, which provides a perfect hash function.
To obtain a minimal perfect hash function, we simply notice that we whenever we have to assign a value
to a vertex, we can take care of using the number 3 instead of 0 if the vertex is actually the
output value for some key. The final value of the minimal perfect hash function is the number
of nonzero pairs of bits that precede the perfect hash value for the key. To compute this
number, we use use in each chunk broadword programming.
Since the system must have ≈10% more variables than equations to be solvable,
a GOVMinimalPerfectHashFunction
on n keys requires 2.2n
bits.
Modifier and Type | Class | Description |
---|---|---|
static class |
GOVMinimalPerfectHashFunction.Builder<T> |
A builder class for
GOVMinimalPerfectHashFunction . |
Modifier and Type | Field | Description |
---|---|---|
protected long[] |
array |
The bit array supporting
bitVector . |
protected LongArrayBitVector |
bitVector |
The bit vector underlying
values . |
protected long[] |
edgeOffsetAndSeed |
A long containing the cumulating function of the chunk edges (i.e., keys) in the lower 56 bits,
and the local seed of each chunk in the upper 8 bits.
|
protected long |
globalSeed |
The seed used to generate the initial hash triple.
|
static int |
LOG2_CHUNK_SIZE |
The logarithm of the desired chunk size.
|
protected long |
n |
The number of keys.
|
static java.lang.String |
NUMBER_OF_THREADS_PROPERTY |
The system property used to set the number of parallel threads.
|
static long |
serialVersionUID |
|
protected long |
signatureMask |
The mask to compare signatures, or zero for no signatures.
|
protected LongBigList |
signatures |
The signatures.
|
protected TransformationStrategy<? super T> |
transform |
The transformation strategy.
|
protected LongBigList |
values |
The final magick—the list of modulo-3 values that define the output of the minimal perfect hash function.
|
defRetValue
Modifier | Constructor | Description |
---|---|---|
protected |
GOVMinimalPerfectHashFunction(java.lang.Iterable<? extends T> keys,
TransformationStrategy<? super T> transform,
int signatureWidth,
java.io.File tempDir,
ChunkedHashStore<T> chunkedHashStore) |
Creates a new minimal perfect hash function for the given keys.
|
Modifier and Type | Method | Description |
---|---|---|
static int |
countNonzeroPairs(long x) |
Counts the number of nonzero pairs of bits in a long.
|
void |
dump(java.lang.String file) |
|
long |
getLong(java.lang.Object key) |
|
long |
getLongByTriple(long[] triple) |
Low-level access to the output of this minimal perfect hash function.
|
static void |
main(java.lang.String[] arg) |
|
long |
numBits() |
Returns the number of bits used by this structure.
|
long |
size64() |
|
protected static long |
vertexOffset(long edgeOffsetSeed) |
containsKey, size
defaultReturnValue, defaultReturnValue
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
applyAsLong, get, put, put, remove, removeLong
public static final long serialVersionUID
public static final java.lang.String NUMBER_OF_THREADS_PROPERTY
public static final int LOG2_CHUNK_SIZE
protected final long n
protected final long globalSeed
protected final long[] edgeOffsetAndSeed
vertexOffset(long)
returns the chunk (i.e., vertex) cumulative value starting from the edge cumulative value.protected final LongBigList values
protected final LongArrayBitVector bitVector
values
.protected transient long[] array
bitVector
.protected final TransformationStrategy<? super T> transform
protected final long signatureMask
protected final LongBigList signatures
protected GOVMinimalPerfectHashFunction(java.lang.Iterable<? extends T> keys, TransformationStrategy<? super T> transform, int signatureWidth, java.io.File tempDir, ChunkedHashStore<T> chunkedHashStore) throws java.io.IOException
keys
- the keys to hash, or null
.transform
- a transformation strategy for the keys.signatureWidth
- a signature width, or 0 for no signature.tempDir
- a temporary directory for the store files, or null
for the standard temporary directory.chunkedHashStore
- a chunked hash store containing the keys, or null
; the store
can be unchecked, but in this case keys
and transform
must be non-null
.java.io.IOException
public static final int countNonzeroPairs(long x)
x
- a long.x
.protected static long vertexOffset(long edgeOffsetSeed)
public long numBits()
public long getLong(java.lang.Object key)
getLong
in interface Object2LongFunction<T>
public long getLongByTriple(long[] triple)
This method makes it possible to build several kind of functions on the same ChunkedHashStore
and
then retrieve the resulting values by generating a single triple of hashes. The method
TwoStepsGOV3Function.getLong(Object)
is a good example of this technique.
triple
- a triple generated as documented in ChunkedHashStore
.public long size64()
size64
in interface Size64
size64
in class AbstractHashFunction<T>
public void dump(java.lang.String file) throws java.io.IOException
java.io.IOException
public static void main(java.lang.String[] arg) throws java.lang.NoSuchMethodException, java.io.IOException, JSAPException
java.lang.NoSuchMethodException
java.io.IOException
JSAPException