it.unimi.dsi.util
Class FrontCodedStringList

java.lang.Object
  extended by java.util.AbstractCollection<K>
      extended by it.unimi.dsi.fastutil.objects.AbstractObjectCollection<K>
          extended by it.unimi.dsi.fastutil.objects.AbstractObjectList<MutableString>
              extended by it.unimi.dsi.util.FrontCodedStringList
All Implemented Interfaces:
ObjectCollection<MutableString>, ObjectIterable<MutableString>, ObjectList<MutableString>, Stack<MutableString>, Serializable, Comparable<List<? extends MutableString>>, Iterable<MutableString>, Collection<MutableString>, List<MutableString>

public class FrontCodedStringList
extends AbstractObjectList<MutableString>
implements Serializable

Compact storage of strings using front-coding compression.

This class stores a list of strings using front-coding compression (of course, the compression will be reasonable only if the list is sorted, but you could also use instances of this class just as a handy way to manage a large amount of strings). It implements an immutable ObjectList that returns the i-th string (as a MutableString) when the get(int) method is called with argument i. The returned mutable string may be freely modified.

As a commodity, this class provides a main method that reads from standard input a sequence of newline-separated words, and writes a corresponding serialized front-coded string list.

Implementation Details

To store the list of strings, we use either a UTF-8 coded ByteArrayFrontCodedList, or a CharArrayFrontCodedList, depending on the value of the utf8 parameter at creation time. In the first case, if the strings are ASCII-oriented the resulting array will be much smaller, but access times will increase manifold, as each string must be UTF-8 encoded before being returned.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectList
AbstractObjectList.ObjectSubList<K>
 
Field Summary
protected  ByteArrayFrontCodedList byteFrontCodedList
          The underlying ByteArrayFrontCodedList, or null.
protected  CharArrayFrontCodedList charFrontCodedList
          The underlying CharArrayFrontCodedList, or null.
static long serialVersionUID
           
protected  boolean utf8
          Whether this front-coded list is UTF-8 encoded.
 
Constructor Summary
FrontCodedStringList(Collection<? extends CharSequence> c, int ratio, boolean utf8)
          Creates a new front-coded string list containing the character sequences contained in the given collection.
FrontCodedStringList(Iterator<? extends CharSequence> words, int ratio, boolean utf8)
          Creates a new front-coded string list containing the character sequences returned by the given iterator.
 
Method Summary
protected static char[] byte2Char(byte[] a, char[] s)
           
protected static int countUTF8Chars(byte[] a)
           
 MutableString get(int index)
          Returns the element at the specified position in this front-coded as a mutable string.
 void get(int index, MutableString s)
          Returns the element at the specified position in this front-coded list by storing it in a mutable string.
 ObjectListIterator<MutableString> listIterator(int k)
           
static void main(String[] arg)
           
 int ratio()
          Returns the ratio of the underlying front-coded list.
 int size()
           
 boolean utf8()
          Returns whether this front-coded string list is storing its strings as UTF-8 encoded bytes.
 
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectList
add, add, addAll, addAll, addElements, addElements, compareTo, contains, ensureIndex, ensureRestrictedIndex, equals, getElements, hashCode, indexOf, iterator, lastIndexOf, listIterator, objectListIterator, objectListIterator, objectSubList, peek, pop, push, remove, removeElements, set, size, subList, top, toString
 
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectCollection
containsAll, isEmpty, objectIterator, removeAll, retainAll, toArray, toArray
 
Methods inherited from class java.util.AbstractCollection
clear, remove
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.List
clear, containsAll, isEmpty, remove, removeAll, retainAll, toArray, toArray
 
Methods inherited from interface it.unimi.dsi.fastutil.objects.ObjectCollection
objectIterator, toArray
 
Methods inherited from interface it.unimi.dsi.fastutil.Stack
isEmpty
 

Field Detail

serialVersionUID

public static final long serialVersionUID
See Also:
Constant Field Values

byteFrontCodedList

protected final ByteArrayFrontCodedList byteFrontCodedList
The underlying ByteArrayFrontCodedList, or null.


charFrontCodedList

protected final CharArrayFrontCodedList charFrontCodedList
The underlying CharArrayFrontCodedList, or null.


utf8

protected final boolean utf8
Whether this front-coded list is UTF-8 encoded.

Constructor Detail

FrontCodedStringList

public FrontCodedStringList(Iterator<? extends CharSequence> words,
                            int ratio,
                            boolean utf8)
Creates a new front-coded string list containing the character sequences returned by the given iterator.

Parameters:
words - an iterator returning character sequences.
ratio - the desired ratio.
utf8 - if true, the strings will be stored as UTF-8 byte arrays.

FrontCodedStringList

public FrontCodedStringList(Collection<? extends CharSequence> c,
                            int ratio,
                            boolean utf8)
Creates a new front-coded string list containing the character sequences contained in the given collection.

Parameters:
c - a collection containing character sequences.
ratio - the desired ratio.
utf8 - if true, the strings will be stored as UTF-8 byte arrays.
Method Detail

utf8

public boolean utf8()
Returns whether this front-coded string list is storing its strings as UTF-8 encoded bytes.

Returns:
true if this front-coded string list is keeping its data as an array of UTF-8 encoded bytes.

ratio

public int ratio()
Returns the ratio of the underlying front-coded list.

Returns:
the ratio of the underlying front-coded list.

get

public MutableString get(int index)
Returns the element at the specified position in this front-coded as a mutable string.

Specified by:
get in interface List<MutableString>
Parameters:
index - an index in the list.
Returns:
a MutableString that will contain the string at the specified position. The string may be freely modified.

get

public void get(int index,
                MutableString s)
Returns the element at the specified position in this front-coded list by storing it in a mutable string.

Parameters:
index - an index in the list.
s - a mutable string that will contain the string at the specified position.

countUTF8Chars

protected static int countUTF8Chars(byte[] a)

byte2Char

protected static char[] byte2Char(byte[] a,
                                  char[] s)

listIterator

public ObjectListIterator<MutableString> listIterator(int k)
Specified by:
listIterator in interface ObjectList<MutableString>
Specified by:
listIterator in interface List<MutableString>
Overrides:
listIterator in class AbstractObjectList<MutableString>

size

public int size()
Specified by:
size in interface Collection<MutableString>
Specified by:
size in interface List<MutableString>
Specified by:
size in class AbstractCollection<MutableString>

main

public static void main(String[] arg)
                 throws IOException,
                        JSAPException,
                        NoSuchMethodException
Throws:
IOException
JSAPException
NoSuchMethodException