Interface HtsCodec<D extends HtsDecoderOptions,E extends HtsEncoderOptions>

Type Parameters:
D - the decoder options type for this codec
E - the encoder options type for this codec
All Superinterfaces:
Upgradeable
All Known Subinterfaces:
HaploidReferenceCodec, ReadsCodec, VariantsCodec
All Known Implementing Classes:
BAMCodec, BAMCodecV1_0, CRAMCodec, CRAMCodecV2_1, CRAMCodecV3_0, FASTACodecV1_0, HtsgetBAMCodec, HtsgetBAMCodecV1_2, SAMCodec, SAMCodecV1_0, VCFCodec, VCFCodecV3_2, VCFCodecV3_3, VCFCodecV4_0, VCFCodecV4_1, VCFCodecV4_2, VCFCodecV4_3

public interface HtsCodec<D extends HtsDecoderOptions,E extends HtsEncoderOptions> extends Upgradeable
Base interface implemented by all htsjdk.beta.plugin codecs.

Codec Components

Each version of a file format supported by the htsjdk.beta.plugin framework is represented by a trio of components:

The HtsCodec is a lightweight and long-lived object that resides in an HtsCodecRegistry. A registry is used to resolve requests for an HtsEncoder or HtsDecoder that matches a given resource. The HtsEncoder and HtsDecoder objects do the work of actually writing and reading records to and from underlying resources.

A default, static, immutable HtsCodecRegistry is populated with HtsCodecs that are discovered and instantiated statically via a ServiceLoader, and can be accessed using HtsDefaultRegistry. A private, mutable registry can be created at runtime via HtsCodecRegistry.createPrivateRegistry(), and populated dynamically by calls to HtsCodecRegistry.registerCodec(HtsCodec).

The primary responsibility of an HtsCodec is to satisfy requests made by the framework during codec resolution, inspecting and recognizing input URIs and stream resources that match the supported format and version, and providing an HtsEncoder or HtsDecoder on demand, once a match is made.

Content Types

The plugin framework supports four different types of HTS data, called content types:

For each content type, there is a corresponding set of codec/decoder/encoder interfaces that are implemented by components that support that content type. These interfaces extend generic base interfaces, and provide generic parameter type instantiations appropriate for that content type. As an example, see ReadsDecoder which defines the interface for all HtsDecoders for the HtsContentType.ALIGNED_READS content type. The different implementations of component trios for a given content type all use the same content-type-specific interfaces, but each over a different combination of underlying file format and version.

The generic, base interfaces that are common to all codecs, encoders, and decoders are:

The packages containing the content type-specific interface definitions for each of the four different content types are:

Example Content Type: Reads

As an example, the htsjdk.beta.plugin.reads package defines the following interfaces that extend the generic base interfaces for codecs with content type HtsContentType.ALIGNED_READS:

Codec Resolution

The plugin framework uses registered codecs to conduct a series of probes into the structure and format of an input or output resource in order to find a matching codec that can produce an encoder or decoder for that resource. The values returned from the codec methods are used by the framework to prune a list of candidate codecs down, until a match is found. During codec resolution, the codec methods are called in the following order:

  1. ownsURI(IOPath)
  2. canDecodeURI(IOPath)
  3. canDecodeSignature(SignatureStream, String)

See the HtsCodecResolver methods for more detail on the resolution protocol:

Formats That Use a Custom URI or Protocol Scheme

Many file formats consist of a single file that resides on a file system that is supported by a java.nio file system provider. Codecs that support such formats are generally agnostic about the IOPath or URI protocol scheme used to identify their resources, and assume that file contents can be accessed directly via a single stream created via a java.nio file system provider.

However, some file formats use a specific, well known URI format or protocol scheme, often to identify a remote or otherwise specially-formatted resource, such as a local database that is distributed across multiple physical files. These codecs may bypass direct file java.nio system access, and instead use specialized code to access their underlying resources.

For example, the BAMCodecV1_0 assumes that IOPath resources can be accessed as a stream on a single file via either the "file://" protocol, or other protocols such gs:// or hdfs:// that have java.nio file system providers. It does not require or assume a particular URI format, and is agnostic about URI scheme.

In contrast, the HtsgetBAMCodecV1_2 codec is a specialized codec that handles remote resources via the "http://" protocol. It uses http to access the underlying resource, and bypasses direct java.nio file system access.

Codecs for formats that use a custom URI format or protocol scheme such as htsget must be able to determine if they can decode or encode a resource purely by inspecting the IOPath/URI, and should follow these guidelines:

Codec Implementation Guidelines

  • An HtsCodec class should implement only a single version of a single file format.
  • HtsCodec instances may be shared across multiple registries, and should generally be immutable (HtsEncoder and HtsDecoder implementations may be mutable).
  • For file formats that use a separate index resource to handle index queries, the getDecoder(Bundle, HtsDecoderOptions) implementation should not attempt to automatically resolve the companion index in order to satisfy index queries, if the index resource is not provided in the input bundle. HtsDecoders for such file formats should only satisfy index queries if the input bundle explicitly specifies the index resource. For file formats that do no use a separate index resource to be specified (such as those that rely on a remote access mechanism), it is permissible to satisfy index queries without requiring the index resource to be included in the bundle.
  • Codecs should avoid throwing exceptions from methods used during codec resolution (which includes all methods other than getDecoder(Bundle, HtsDecoderOptions) and getEncoder(Bundle, HtsEncoderOptions)).

  • Method Details

    • getContentType

      HtsContentType getContentType()
      Get the HtsContentType for this codec.

      Returns:
      the HtsContentType for this codec. The HtsContentType determines the interfaces, including the HEADER and RECORD types, used by this codec's HtsEncoder and HtsDecoder. Each implementation of a given content type exposes the same interfaces, but over a different file format or version. For example, both the BAM and HTSGET_BAM codecs have codec type HtsContentType.ALIGNED_READS, and are derived from ReadsCodec, but the serialized file formats and access mechanisms for the two codecs are different).
    • getFileFormat

      String getFileFormat()
      Get the name of the file format supported by this codec. The format name defines the underlying format handled by this codec, and also corresponds to the format of the primary bundle resource that is required when decoding or encoding (see BundleResourceType and BundleResource.getFileFormat()).
      Returns:
      the name of the underlying file format handled by this codec
    • getVersion

      HtsVersion getVersion()
      Get the version of the file format returned by getFileFormat() that is supported by this codec.
      Returns:
      the file format version (HtsVersion) supported by this codec
    • getDisplayName

      default String getDisplayName()
      Get a user-friendly display name for this codec.

      It is recommended that the display name minimally include both the name of the supported file format and the supported version.
      Returns:
      a user-friendly display name for this codec
    • ownsURI

      default boolean ownsURI(IOPath ioPath)
      Determine if this codec "owns" the URI contained in ioPath see (IOPath.getURI()).

      A codec "owns" the URI only if it has specific requirements on the URI protocol scheme, URI format, or query parameters that go beyond a simple file extension, AND it explicitly recognizes the URI as conforming to those requirements. File formats that only require a specific file extension should always return false from ownsURI(htsjdk.io.IOPath), and should instead use the extension as a filter in canDecodeURI(IOPath).

      Returning true from this method will cause the framework to bypass the stream-oriented signature probing that is used to resolve inputs to a codec handler. During codec resolution, if any registered codec returns true for this method on ioPath, the signature probing protocol will instead:

      1. immediately prune the list of candidate codecs to only those that return true for this method on ioPath
      2. not attempt to obtain an InputStream on the IOPath containing the URI, on the assumption that special handling is required in order to access the underlying resource (i.e., htsget codec would claim an "http://" URI if the rest of the URI conforms to the expected format for that codec's protocol).

      Any codec that returns true from ownsURI(IOPath) for a given IOPath must also return true from canDecodeURI(IOPath) for the same IOPath. For custom URI handlers, codecs should avoid making remote calls to determine the suitability or accessibility of the input resource; the return value for this method should be based only on the format of the URI that is presented. Operations that require remote access that can fail, such as validating server connectivity, authentication, or authorization, should be deferred until data is requested by the caller via the codec's HtsEncoder or HtsDecoder. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.

      Parameters:
      ioPath - the ioPath to inspect
      Returns:
      true if the ioPath's URI represents a custom URI that this codec handles
    • canDecodeURI

      boolean canDecodeURI(IOPath ioPath)
      Determine if the URI for ioPath (obtained via IOPath.getURI()) conforms to the expected URI format this codec's file format.

      Most implementations only look at the file extension (see IOPath.hasExtension(java.lang.String)). For codecs that implement formats that use specific, well known file extensions, the codec should reject inputs that do not conform to any of the accepted extensions. If the format does not use a specific extension, or if the codec cannot determine if it can decode the underlying resource without inspecting the underlying stream, it is safe to return true, after which the framework will subsequently call this codec's canDecodeSignature(SignatureStream, String) method, at which time the codec can inspect the actual underlying stream via the SignatureStream.

      Implementations should generally not inspect the URI's protocol scheme unless the file format supported by the codec requires the use a specific protocol scheme. For codecs that do own a specific scheme or URI format, the return values for ownsURI(IOPath) and canDecodeURI(IOPath) must always be the same (both true or both false) for a given IOPath. For codecs that do not use a custom URI (and rely on NIO access), @link #ownsURI(IOPath)} should always return false, with only the value returned from canDecodeURI(IOPath) varying based on features such as file extension probes.

      It is never safe to attempt to directly inspect the underlying stream for ioPath in this method. If the stream needs to be inspected, it should be done using the signature stream when the canDecodeSignature(SignatureStream, String) method is called.

      For custom URI handlers (see ownsURI(IOPath), codecs should avoid making remote calls to determine the suitability of the input resource; the return value for this method should be based only on the format of the URI that is presented. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.
      Parameters:
      ioPath - to be decoded
      Returns:
      true if the codec can provide a decoder to provide this URI
    • canDecodeSignature

      boolean canDecodeSignature(SignatureStream signatureStream, String sourceName)
      Determine if the codec can decode an input stream by inspecting a signature embedded within the stream.

      The probingInputStream stream will contain only a fragment of the actual input stream, taken from the start of the stream, the size of which will be the lesser of:

      1. the number of bytes returned by getSignatureProbeLength()
      2. the entire input stream, for streams that are smaller than getSignatureProbeLength()

      Codecs that handle custom URIs that reference remote resources (those that return true for ownsURI(htsjdk.io.IOPath)) should generally not inspect the stream, and should return false from this method, since the method will never be called with any resource for which ownsURI(htsjdk.io.IOPath) returned true. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.

      Parameters:
      signatureStream - the stream to be inspect for the resource's embedded signature and version
      sourceName - a display name describing the source of the input stream, for use in error messages
      Returns:
      true if this codec recognizes the stream by it's signature, and can provide a decoder to decode the stream, otherwise false
    • getSignatureLength

      int getSignatureLength()
      Get the number of bytes in the format and version signature used by the file format supported by this codec.
      Returns:
      if the file format supported by this codecs is not remote, and is accessible via a local file or stream, the size of the unique signature/version for this file format. otherwise 0.

      Note: Codecs that are custom URI handlers (those that return true for ownsURI(htsjdk.io.IOPath)), should always return 0 from this method. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.
    • getSignatureProbeLength

      default int getSignatureProbeLength()
      Get the number of bytes of needed by this codec to probe an input stream for a format/version signature, and determine if it can supply a decoder for the stream.
      Returns:
      the number of bytes this codec must consume from a stream in order to determine whether it can decode that stream. This number may differ from the actual signature size as returned by getSignatureLength() for codecs that support compressed or encrypted streams, since they may require a larger and more semantically meaningful input fragment (such as an entire encrypted or compressed block) in order to inspect the plaintext signature.

      Therefore signatureProbeLength should be expressed in "compressed/encrypted" space rather than "plaintext" space. The length returned from this method is used to determine the size of the SignatureStream that is subsequently passed to canDecodeSignature(SignatureStream, String).

      Note: Codecs that are custom URI handlers (those that return true for ownsURI(IOPath)), should always return 0 from this method when it is called. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.

    • getDecoder

      HtsDecoder<?,? extends HtsRecord> getDecoder(Bundle inputBundle, D decoderOptions)
      Get an HtsDecoder to decode the provided inputs. The input bundle must contain resources of the type required by this codec. To find a codec appropriate for decoding a given resource, use an HtsCodecResolver obtained from an HtsCodecRegistry.

      The framework will never call thi* method unless either ownsURI(IOPath), or canDecodeURI(IOPath) and canDecodeSignature(SignatureStream, String) (IOPath)} return true for inputBundle.

      Parameters:
      inputBundle - input to be decoded. To get a decoder for use with index queries that use HtsQuery methods, the bundle must contain an index resource.
      decoderOptions - options for the decoder to use
      Returns:
      an HtsDecoder that can decode the provided inputs
    • getEncoder

      HtsEncoder<?,? extends HtsRecord> getEncoder(Bundle outputBundle, E encoderOptions)
      Get an HtsEncoder to encode to the provided outputs. The output bundle must contain resources of the type required by this codec. To find a codec appropriate for encoding a given resource, use an HtsCodecResolver obtained from an HtsCodecRegistry.

      The framework will never call this method unless either ownsURI(IOPath), or canDecodeURI(IOPath) returned true for outputBundle.
      Parameters:
      outputBundle - target output for the encoder
      encoderOptions - encoder options to use
      Returns:
      an HtsEncoder suitable for writing to the provided outputs