it.unimi.dsi.util
Class ImmutableExternalPrefixMap

java.lang.Object
  extended by it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<CharSequence>
      extended by it.unimi.dsi.util.AbstractPrefixMap
          extended by it.unimi.dsi.util.ImmutableExternalPrefixMap
All Implemented Interfaces:
Function<CharSequence,Long>, Object2LongFunction<CharSequence>, PrefixMap<MutableString>, StringMap<MutableString>, Serializable

public class ImmutableExternalPrefixMap
extends AbstractPrefixMap
implements Serializable

An immutable prefix map mostly stored in external memory. An ImmutableExternalPrefixMap compresses words using a HuTuckerCodec and approximates intervals using an ImmutableBinaryTrie that uses the same codec.

This class releases on a dump stream most of the data that would be contained in the corresponding internal-memory map. More precisely, each block (with user-definable length, possibly the size of a basic disk I/O operation) is filled as much as possible with strings front coded and compressed with a HuTuckerCodec. Each block starts with the length of the first string in unary, followed by the encoding of the string. Then, for each string we write in unary the length of the common prefix (in characters) with the previous string, the length of the remaining suffix (in characters) and finally the encoded suffix. Note that if the encoding of a string is longer than a block, the string will occupy more than one block.

We keep track using an ImmutableBinaryTrie of the strings at the start of each block: thus, we are able to retrieve the interval corresponding to a given prefix by calling getApproximatedInterval() and scanning at most two blocks.

Self-contained or non-self-contained

There are two kinds of external prefix maps: self-contained and non-self-contained. In the first case, you get a serialised object that you can load at any time. The dump stream is serialised with the object and expanded at each deserialisation in the Java temporary directory. If you deserialise a map several times, you will get correspondingly many copies of the dump stream in the temporary directory. The dump streams are deleted when the JVM exits. This mechanism is not very efficient, but since this class implements several interfaces it is essential that clients can make the thing work in a standard way.

Alternatively, you can give at creation time a filename for the dump stream. The resulting non-self-contained external prefix map can be serialised, but after deserialisation you need to set back the dump stream filename or even directly the dump stream (for instance, to an input bit stream wrapping a byte array where the dump stream has been loaded). You can deserialise many copies of an external prefix map, letting all copies share the same dump stream.

This data structure is not synchronised, and concurrent reads may cause problems because of clashes in the usage of the underlying input bit stream. It would not be a good idea in any case to open a new stream for each caller, as that would certainly lead to disk thrashing.

The main method of this class helps in building large external prefix maps.

Since:
0.9.3
Author:
Sebastiano Vigna
See Also:
Serialized Form

Field Summary
protected  int[] blockOffset
          An array parallel to blockStart giving the offset in blocks in the dump file of the corresponding word in blockStart.
protected  long blockSize
          The block size of this (in bits).
protected  int[] blockStart
          The index of the first word in each block, plus an additional entry containing size.
protected  Object2IntLinkedOpenHashMap<MutableString> cache
          A cache for the most recent queries
static int CACHE_MAX_SIZE
          The maximum number of entry in the cache map.
protected  Char2IntOpenHashMap char2symbol
          A map from characters to symbols of the coder.
protected  Decoder decoder
          A decoder used to read data from the dump stream.
protected  InputBitStream dumpStream
          A reference to the dump stream.
protected  ImmutableBinaryTrie<CharSequence> intervalApproximator
          The in-memory data structure used to approximate intervals..
protected  boolean iteratorIsUsable
          If true, the creation of the last DumpStreamIterator was not followed by a call to any get method.
protected  boolean selfContained
          Whether this map is self-contained.
static long serialVersionUID
           
protected  int size
          The number of terms in this map.
static int STD_BLOCK_SIZE
          The standard block size (in bytes).
protected  char[] symbol2char
          A map (given by an array) from symbols in the coder to characters.
 
Fields inherited from class it.unimi.dsi.util.AbstractPrefixMap
list, prefixMap, rangeMap
 
Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defRetValue
 
Constructor Summary
ImmutableExternalPrefixMap(Iterable<? extends CharSequence> terms)
          Creates an external prefix map with block size STD_BLOCK_SIZE.
ImmutableExternalPrefixMap(Iterable<? extends CharSequence> terms, CharSequence dumpStreamFilename)
          Creates an external map with block size STD_BLOCK_SIZE and specified dump stream.
ImmutableExternalPrefixMap(Iterable<? extends CharSequence> terms, int blockSizeInBytes)
          Creates an external map with specified block size.
ImmutableExternalPrefixMap(Iterable<? extends CharSequence> terms, int blockSizeInBytes, CharSequence dumpStreamFilename)
          map external map.
 
Method Summary
 boolean containsKey(Object term)
           
 Interval getInterval(CharSequence prefix)
          Returns the range of strings having a given prefix.
 long getLong(Object o)
           
protected  MutableString getTerm(int index, MutableString s)
          Writes a string specified by index into a MutableString.
 ObjectIterator<CharSequence> iterator()
          Returns an iterator over the map.
static void main(String[] arg)
           
 void setDumpStream(CharSequence dumpStreamFilename)
          Sets the dump stream of this external prefix map to a given filename.
 void setDumpStream(InputBitStream dumpStream)
          Sets the dump stream of this external prefix map to a given input bit stream.
 int size()
           
 
Methods inherited from class it.unimi.dsi.util.AbstractPrefixMap
list, prefixMap, rangeMap
 
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
clear, defaultReturnValue, defaultReturnValue, get, put, put, remove, removeLong
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction
defaultReturnValue, defaultReturnValue, put, removeLong
 
Methods inherited from interface it.unimi.dsi.fastutil.Function
clear, get, put, remove
 

Field Detail

serialVersionUID

public static final long serialVersionUID
See Also:
Constant Field Values

STD_BLOCK_SIZE

public static final int STD_BLOCK_SIZE
The standard block size (in bytes).

See Also:
Constant Field Values

CACHE_MAX_SIZE

public static final int CACHE_MAX_SIZE
The maximum number of entry in the cache map.

See Also:
Constant Field Values

intervalApproximator

protected final ImmutableBinaryTrie<CharSequence> intervalApproximator
The in-memory data structure used to approximate intervals..


blockSize

protected final long blockSize
The block size of this (in bits).


decoder

protected final Decoder decoder
A decoder used to read data from the dump stream.


symbol2char

protected final char[] symbol2char
A map (given by an array) from symbols in the coder to characters.


char2symbol

protected final Char2IntOpenHashMap char2symbol
A map from characters to symbols of the coder.


size

protected final int size
The number of terms in this map.


blockStart

protected final int[] blockStart
The index of the first word in each block, plus an additional entry containing size.


blockOffset

protected final int[] blockOffset
An array parallel to blockStart giving the offset in blocks in the dump file of the corresponding word in blockStart. If there are no overflows, this will just be an initial segment of the natural numbers, but overflows cause jumps.


selfContained

protected final boolean selfContained
Whether this map is self-contained.


iteratorIsUsable

protected transient boolean iteratorIsUsable
If true, the creation of the last DumpStreamIterator was not followed by a call to any get method.


dumpStream

protected transient InputBitStream dumpStream
A reference to the dump stream.


cache

protected transient Object2IntLinkedOpenHashMap<MutableString> cache
A cache for the most recent queries

Constructor Detail

ImmutableExternalPrefixMap

public ImmutableExternalPrefixMap(Iterable<? extends CharSequence> terms,
                                  int blockSizeInBytes,
                                  CharSequence dumpStreamFilename)
                           throws IOException
map external map.

This constructor does not assume that strings returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

Parameters:
terms - an iterable whose iterator will enumerate in lexicographical order the terms for the map.
blockSizeInBytes - the block size (in bytes).
dumpStreamFilename - the name of the dump stream, or null for a map with an automatic dump stream.
Throws:
IOException

ImmutableExternalPrefixMap

public ImmutableExternalPrefixMap(Iterable<? extends CharSequence> terms,
                                  CharSequence dumpStreamFilename)
                           throws IOException
Creates an external map with block size STD_BLOCK_SIZE and specified dump stream.

This constructor does not assume that strings returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

Parameters:
terms - a collection whose iterator will enumerate in lexicographical order the terms for the map.
dumpStreamFilename - the name of the dump stream, or null for a map with an automatic dump stream.
Throws:
IOException

ImmutableExternalPrefixMap

public ImmutableExternalPrefixMap(Iterable<? extends CharSequence> terms,
                                  int blockSizeInBytes)
                           throws IOException
Creates an external map with specified block size.

This constructor does not assume that strings returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

Parameters:
blockSizeInBytes - the block size (in bytes).
terms - a collection whose iterator will enumerate in lexicographical order the terms for the map.
Throws:
IOException

ImmutableExternalPrefixMap

public ImmutableExternalPrefixMap(Iterable<? extends CharSequence> terms)
                           throws IOException
Creates an external prefix map with block size STD_BLOCK_SIZE.

This constructor does not assume that strings returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

Parameters:
terms - a collection whose iterator will enumerate in lexicographical order the terms for the map.
Throws:
IOException
Method Detail

setDumpStream

public void setDumpStream(CharSequence dumpStreamFilename)
                   throws FileNotFoundException
Sets the dump stream of this external prefix map to a given filename.

This method sets the dump file used by this map, and should be only called after deserialisation, providing exactly the file generated at creation time. Essentially anything can happen if you do not follow the rules.

Note that this method will attempt to close the old stream, if present.

Parameters:
dumpStreamFilename - the name of the dump file.
Throws:
FileNotFoundException
See Also:
setDumpStream(InputBitStream)

setDumpStream

public void setDumpStream(InputBitStream dumpStream)
Sets the dump stream of this external prefix map to a given input bit stream.

This method sets the dump file used by this map, and should be only called after deserialisation, providing a repositionable stream containing exactly the file generated at creation time. Essentially anything can happen if you do not follow the rules.

Using this method you can load an external prefix map in core memory, enjoying the compactness of the data structure, but getting much more speed.

Note that this method will attemp to close the old stream, if present.

Parameters:
dumpStream - a repositionable input bit stream containing exactly the dump stream generated at creation time.
See Also:
setDumpStream(CharSequence)

getInterval

public Interval getInterval(CharSequence prefix)
Description copied from class: AbstractPrefixMap
Returns the range of strings having a given prefix.

Specified by:
getInterval in class AbstractPrefixMap
Parameters:
prefix - a prefix.
Returns:
the corresponding range of strings as an interval.

getTerm

protected MutableString getTerm(int index,
                                MutableString s)
Description copied from class: AbstractPrefixMap
Writes a string specified by index into a MutableString.

Specified by:
getTerm in class AbstractPrefixMap
Parameters:
index - the index of a string.
s - a mutable string.
Returns:
string.

containsKey

public boolean containsKey(Object term)
Specified by:
containsKey in interface Function<CharSequence,Long>

getLong

public long getLong(Object o)
Specified by:
getLong in interface Object2LongFunction<CharSequence>

iterator

public ObjectIterator<CharSequence> iterator()
Returns an iterator over the map.

The iterator returned by this method scans directly the dump stream.

Note that the returned iterator uses the same stream as all get methods. Calling such methods while the iterator is being used will produce an IllegalStateException.

Returns:
an iterator over the map that just scans the dump stream.

size

public int size()
Specified by:
size in interface Function<CharSequence,Long>

main

public static void main(String[] arg)
                 throws ClassNotFoundException,
                        IOException,
                        JSAPException,
                        SecurityException,
                        NoSuchMethodException
Throws:
ClassNotFoundException
IOException
JSAPException
SecurityException
NoSuchMethodException