Class CompressionAlgorithm

  • All Implemented Interfaces:
    org.apache.hadoop.conf.Configurable

    public class CompressionAlgorithm
    extends org.apache.hadoop.conf.Configured
    There is a static initializer in Compression that finds all implementations of CompressionAlgorithmConfiguration and initializes a CompressionAlgorithm instance. This promotes a model of the following call graph of initialization by the static initializer, followed by calls to getCodec(), createCompressionStream(OutputStream, Compressor, int), and createDecompressionStream(InputStream, Decompressor, int). In some cases, the compression and decompression call methods will include a different buffer size for the stream. Note that if the compressed buffer size requested in these calls is zero, we will not set the buffer size for that algorithm. Instead, we will use the default within the codec.

    The buffer size is configured in the Codec by way of a Hadoop Configuration reference. One approach may be to use the same Configuration object, but when calls are made to createCompressionStream and createDecompressionStream with non default buffer sizes, the configuration object must be changed. In this case, concurrent calls to createCompressionStream and createDecompressionStream would mutate the configuration object beneath each other, requiring synchronization to avoid undesirable activity via co-modification. To avoid synchronization entirely, we will create Codecs with their own Configuration object and cache them for re-use. A default codec will be statically created, as mentioned above to ensure we always have a codec available at loader initialization.

    There is a Guava cache defined within Algorithm that allows us to cache Codecs for re-use. Since they will have their own configuration object and thus do not need to be mutable, there is no concern for using them concurrently; however, the Guava cache exists to ensure a maximal size of the cache and efficient and concurrent read/write access to the cache itself.

    To provide Algorithm specific details and to describe what is in code:

    LZO will always have the default LZO codec because the buffer size is never overridden within it.

    LZ4 will always have the default LZ4 codec because the buffer size is never overridden within it.

    GZ will use the default GZ codec for the compression stream, but can potentially use a different codec instance for the decompression stream if the requested buffer size does not match the default GZ buffer size of 32k.

    Snappy will use the default Snappy codec with the default buffer size of 64k for the compression stream, but will use a cached codec if the buffer size differs from the default.

    • Method Detail

      • createDecompressionStream

        public InputStream createDecompressionStream​(InputStream downStream,
                                                     org.apache.hadoop.io.compress.Decompressor decompressor,
                                                     int downStreamBufferSize)
                                              throws IOException
        Throws:
        IOException
      • getCompressor

        public org.apache.hadoop.io.compress.Compressor getCompressor()
      • returnCompressor

        public void returnCompressor​(org.apache.hadoop.io.compress.Compressor compressor)
      • getDecompressor

        public org.apache.hadoop.io.compress.Decompressor getDecompressor()
      • returnDecompressor

        public void returnDecompressor​(org.apache.hadoop.io.compress.Decompressor decompressor)
        Returns the specified Decompressor to the codec cache if it is not null.
      • getName

        public String getName()
        Returns the name of the compression algorithm.
        Returns:
        the name