Function<T,java.lang.Long>
, Object2LongFunction<T>
, Size64
, java.io.Serializable
, java.util.function.Function<T,java.lang.Long>
, java.util.function.ToLongFunction<T>
public class CHDMinimalPerfectHashFunction<T> extends AbstractHashFunction<T> implements java.io.Serializable
Given a list of keys without duplicates, the builder of this class finds a minimal
perfect hash function for the list. Subsequent calls to the getLong(Object)
method will
return a distinct number for each key in the list. For keys out of the list, the
resulting number is not specified. In some (rare) cases it might be possible to establish that a
key was not in the original list, and in that case -1 will be returned;
by signing the function (see below), you can guarantee with a prescribed probability
that -1 will be returned on keys not in the original list. The class can then be
saved by serialisation and reused later.
This class uses a chunked hash store to provide highly scalable construction. Note that at construction time
you can pass a ChunkedHashStore
containing the keys (associated with any value); however, if the store is rebuilt because of a
DuplicateException
it will be rebuilt associating with each key its ordinal position.
The memory requirements for the algorithm we use are ≈2 bits per key for load factor
equal to one and λ = 5. Thus, this class
can use ≈10% less memory than a GOVMinimalPerfectHashFunction
.
However, its construction time is an order of magnitude larger, and query time is about 50% slower. Different tradeoffs between construction time, query time and space can be obtained by tweaking the load factor and the parameter λ (see the paper below for their exact meaning).
For convenience, this class provides a main method that reads from standard input a (possibly
gzip
'd) sequence of newline-separated strings, and writes a serialised minimal
perfect hash function for the given list.
Optionally, it is possible to sign the minimal perfect hash function. A w-bit signature will
be associated with each key, so that getLong(Object)
will return -1 on strings that are not
in the original key set. As usual, false positives are possible with probability 2-w.
The technique used is described by Djamal Belazzougui, Fabiano C. Botelho and Martin Dietzfelbinger
in “Hash, displace and compress”, Algorithms - ESA 2009, LNCS 5757, pages 682−693, 2009.
However, with respect to the algorithm described in the paper, this implementation
is much more scalable, as it uses a ChunkedHashStore
to split the generation of large key sets into generation of smaller functions for each chunk (of size
approximately 216).
Modifier and Type | Class | Description |
---|---|---|
static class |
CHDMinimalPerfectHashFunction.Builder<T> |
A builder class for
CHDMinimalPerfectHashFunction . |
Modifier and Type | Field | Description |
---|---|---|
protected EliasFanoLongBigList |
coefficients |
The displacement coefficients.
|
protected long |
globalSeed |
The seed used to generate the initial hash triple.
|
static int |
LOG2_CHUNK_SIZE |
The logarithm of the desired chunk size.
|
protected long |
n |
The number of keys.
|
protected SparseRank |
rank |
The sparse ranking structure containing the unused entries.
|
static long |
serialVersionUID |
|
protected long |
signatureMask |
The mask to compare signatures, or zero for no signatures.
|
protected LongBigList |
signatures |
The signatures.
|
protected TransformationStrategy<? super T> |
transform |
The transformation strategy.
|
defRetValue
Modifier | Constructor | Description |
---|---|---|
protected |
CHDMinimalPerfectHashFunction(java.lang.Iterable<? extends T> keys,
TransformationStrategy<? super T> transform,
int lambda,
double loadFactor,
int signatureWidth,
java.io.File tempDir,
ChunkedHashStore<T> chunkedHashStore) |
Creates a new CHD minimal perfect hash function for the given keys.
|
Modifier and Type | Method | Description |
---|---|---|
long |
getLong(java.lang.Object key) |
|
static void |
main(java.lang.String[] arg) |
|
long |
numBits() |
Returns the number of bits used by this structure.
|
long |
size64() |
containsKey, size
defaultReturnValue, defaultReturnValue
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
applyAsLong, get, put, put, remove, removeLong
public static final long serialVersionUID
public static final int LOG2_CHUNK_SIZE
protected final long n
protected final long globalSeed
protected final TransformationStrategy<? super T> transform
protected final EliasFanoLongBigList coefficients
protected final SparseRank rank
protected final long signatureMask
protected final LongBigList signatures
protected CHDMinimalPerfectHashFunction(java.lang.Iterable<? extends T> keys, TransformationStrategy<? super T> transform, int lambda, double loadFactor, int signatureWidth, java.io.File tempDir, ChunkedHashStore<T> chunkedHashStore) throws java.io.IOException
keys
- the keys to hash, or null
.transform
- a transformation strategy for the keys.lambda
- the average bucket size.loadFactor
- the load factor.signatureWidth
- a signature width, or 0 for no signature.tempDir
- a temporary directory for the store files, or null
for the standard temporary directory.chunkedHashStore
- a chunked hash store containing the keys, or null
; the store
can be unchecked, but in this case keys
and transform
must be non-null
.java.io.IOException
public long numBits()
public long getLong(java.lang.Object key)
getLong
in interface Object2LongFunction<T>
public long size64()
size64
in interface Size64
size64
in class AbstractHashFunction<T>
public static void main(java.lang.String[] arg) throws java.lang.NoSuchMethodException, java.io.IOException, JSAPException
java.lang.NoSuchMethodException
java.io.IOException
JSAPException