Interface HtsCodec<D extends HtsDecoderOptions,E extends HtsEncoderOptions>
- Type Parameters:
D
- the decoder options type for this codecE
- the encoder options type for this codec
- All Superinterfaces:
Upgradeable
- All Known Subinterfaces:
HaploidReferenceCodec
,ReadsCodec
,VariantsCodec
- All Known Implementing Classes:
BAMCodec
,BAMCodecV1_0
,CRAMCodec
,CRAMCodecV2_1
,CRAMCodecV3_0
,FASTACodecV1_0
,HtsgetBAMCodec
,HtsgetBAMCodecV1_2
,SAMCodec
,SAMCodecV1_0
,VCFCodec
,VCFCodecV3_2
,VCFCodecV3_3
,VCFCodecV4_0
,VCFCodecV4_1
,VCFCodecV4_2
,VCFCodecV4_3
htsjdk.beta.plugin
codecs.
Codec Components
Each version of a file format supported by the htsjdk.beta.plugin
framework is
represented by a trio of components:
- a codec that implements
HtsCodec
- an encoder that implements
HtsEncoder
- a decoder that implements
HtsDecoder
The HtsCodec
is a lightweight and long-lived object that resides in an
HtsCodecRegistry
. A registry is used to resolve requests for
an HtsEncoder
or HtsDecoder
that matches a given resource. The HtsEncoder
and HtsDecoder
objects do the work of actually writing and reading records to and from
underlying resources.
A default, static, immutable HtsCodecRegistry
is populated with
HtsCodec
s that are discovered and instantiated statically via a ServiceLoader
,
and can be accessed using HtsDefaultRegistry
. A private, mutable
registry can be created at runtime via HtsCodecRegistry.createPrivateRegistry()
, and populated
dynamically by calls to HtsCodecRegistry.registerCodec(HtsCodec)
.
The primary responsibility of an HtsCodec
is to satisfy requests made by the framework during
codec resolution, inspecting and recognizing input URIs and stream resources that match the
supported format and version, and providing an HtsEncoder
or HtsDecoder
on demand, once
a match is made.
Content Types
The plugin framework supports four different types of HTS data, called content types:
-
HtsContentType.ALIGNED_READS
-
HtsContentType.HAPLOID_REFERENCE
-
HtsContentType.VARIANT_CONTEXTS
-
HtsContentType.FEATURES
For each content type, there is a corresponding set of codec/decoder/encoder interfaces that
are implemented by components that support that content type. These interfaces extend generic base
interfaces, and provide generic parameter type instantiations appropriate for that content type.
As an example, see ReadsDecoder
which defines the interface for
all HtsDecoder
s for the HtsContentType.ALIGNED_READS
content
type. The different implementations of component trios for a given content type all use the same
content-type-specific interfaces, but each over a different combination of underlying file format
and version.
The generic, base interfaces that are common to all codecs, encoders, and decoders are:
-
HtsCodec
: base codec interface -
HtsEncoder
: base encoder interface -
HtsEncoderOptions
: base options interface for encoders -
HtsDecoder
: base decoder interface -
HtsDecoderOptions
: base options interface for decoders - a class with string constants for each supported file format for that type
-
Bundle
: a optional type-specificBundle
implementation
The packages containing the content type-specific interface definitions for each of the four different content types are:
- For
HtsContentType.ALIGNED_READS
codecs, see thehtsjdk.beta.plugin.reads
package - For
HtsContentType.HAPLOID_REFERENCE
codecs, see thehtsjdk.beta.plugin.hapref
package - For
HtsContentType.VARIANT_CONTEXTS
codecs, see thehtsjdk.beta.plugin.variants
package - For
HtsContentType.FEATURES
codecs, see thehtsjdk.beta.plugin.features
package
Example Content Type: Reads
As an example, the htsjdk.beta.plugin.reads
package defines the following interfaces
that extend the generic base interfaces for codecs with content type HtsContentType.ALIGNED_READS
:
-
ReadsCodec
: reads codec interface, extends the genericHtsCodec
interface -
ReadsEncoder
: reads encoder, extends the genericHtsEncoder
interface -
ReadsEncoderOptions
: reads encoder options, extends the genericHtsDecoderOptions
interface -
ReadsDecoder
: reads decoder interface, extends the genericHtsDecoder
interface -
ReadsDecoderOptions
: reads decoder options, extends the genericHtsDecoderOptions
interface -
ReadsFormats
: an class with string constants for each possible supported reads file format
Codec Resolution
The plugin framework uses registered codecs to conduct a series of probes into the structure and format of an input or output resource in order to find a matching codec that can produce an encoder or decoder for that resource. The values returned from the codec methods are used by the framework to prune a list of candidate codecs down, until a match is found. During codec resolution, the codec methods are called in the following order:
See the HtsCodecResolver
methods for more detail on the resolution
protocol:
-
HtsCodecResolver.resolveForDecoding(Bundle)
-
HtsCodecResolver.resolveForEncoding(Bundle)
-
HtsCodecResolver.resolveForEncoding(Bundle, HtsVersion)
Formats That Use a Custom URI or Protocol Scheme
Many file formats consist of a single file that resides on a file system that is supported by a
java.nio
file system provider. Codecs that support such formats are generally agnostic
about the IOPath or URI protocol scheme used to identify their resources, and assume that file contents
can be accessed directly via a single stream created via a java.nio
file system provider.
However, some file formats use a specific, well known URI format or protocol scheme, often
to identify a remote or otherwise specially-formatted resource, such as a local database
that is distributed across multiple physical files. These codecs may bypass direct file java.nio
system access, and instead use specialized code to access their underlying resources.
For example, the BAMCodecV1_0
assumes that IOPath
resources can be accessed as a stream on a single file via either the "file://" protocol, or
other protocols such gs:// or hdfs:// that have java.nio
file system providers. It does
not require or assume a particular URI format, and is agnostic about URI scheme.
In contrast, the HtsgetBAMCodecV1_2
codec
is a specialized codec that handles remote resources via the "http://" protocol.
It uses http
to access the underlying resource, and bypasses direct java.nio
file system access.
Codecs for formats that use a custom URI format or protocol scheme such as htsget
must be
able to determine if they can decode or encode a resource purely by inspecting the IOPath/URI, and
should follow these guidelines:
- return true when
ownsURI(IOPath)
is presented with an IOPath with a conforming URI - return true when
canDecodeURI(IOPath)
is presented with an IOPath with a conforming URI - ensure that for a given IOPath,
ownsURI(IOPath)
==canDecodeURI(IOPath)
- always return 0 from the
getSignatureProbeLength()
method - always return 0 from the
getSignatureLength()
method
Codec Implementation Guidelines
- An HtsCodec class should implement only a single version of a single file format.
- HtsCodec instances may be shared across multiple registries, and should generally be immutable (HtsEncoder and HtsDecoder implementations may be mutable).
-
For file formats that use a separate index resource to handle index queries, the
getDecoder(Bundle, HtsDecoderOptions)
implementation should not attempt to automatically resolve the companion index in order to satisfy index queries, if the index resource is not provided in the input bundle.HtsDecoder
s for such file formats should only satisfy index queries if the input bundle explicitly specifies the index resource. For file formats that do no use a separate index resource to be specified (such as those that rely on a remote access mechanism), it is permissible to satisfy index queries without requiring the index resource to be included in the bundle. -
Codecs should avoid throwing exceptions from methods used during codec resolution (which includes all
methods other than
getDecoder(Bundle, HtsDecoderOptions)
andgetEncoder(Bundle, HtsEncoderOptions)
).
-
Method Summary
Modifier and TypeMethodDescriptionboolean
canDecodeSignature
(SignatureStream signatureStream, String sourceName) Determine if the codec can decode an input stream by inspecting a signature embedded within the stream.boolean
canDecodeURI
(IOPath ioPath) Determine if the URI forioPath
(obtained viaIOPath.getURI()
) conforms to the expected URI format this codec's file format.Get theHtsContentType
for this codec.HtsDecoder<?,
? extends HtsRecord> getDecoder
(Bundle inputBundle, D decoderOptions) Get anHtsDecoder
to decode the provided inputs.default String
Get a user-friendly display name for this codec.HtsEncoder<?,
? extends HtsRecord> getEncoder
(Bundle outputBundle, E encoderOptions) Get anHtsEncoder
to encode to the provided outputs.Get the name of the file format supported by this codec.int
Get the number of bytes in the format and version signature used by the file format supported by this codec.default int
Get the number of bytes of needed by this codec to probe an input stream for a format/version signature, and determine if it can supply a decoder for the stream.Get the version of the file format returned bygetFileFormat()
that is supported by this codec.default boolean
Determine if this codec "owns" the URI contained inioPath
see (IOPath.getURI()
).Methods inherited from interface htsjdk.beta.plugin.Upgradeable
runVersionUpgrade
-
Method Details
-
getContentType
HtsContentType getContentType()Get theHtsContentType
for this codec.- Returns:
- the
HtsContentType
for this codec. TheHtsContentType
determines the interfaces, including the HEADER and RECORD types, used by this codec'sHtsEncoder
andHtsDecoder
. Each implementation of a given content type exposes the same interfaces, but over a different file format or version. For example, both the BAM and HTSGET_BAM codecs have codec typeHtsContentType.ALIGNED_READS
, and are derived fromReadsCodec
, but the serialized file formats and access mechanisms for the two codecs are different).
-
getFileFormat
String getFileFormat()Get the name of the file format supported by this codec. The format name defines the underlying format handled by this codec, and also corresponds to the format of the primary bundle resource that is required when decoding or encoding (seeBundleResourceType
andBundleResource.getFileFormat()
).- Returns:
- the name of the underlying file format handled by this codec
-
getVersion
HtsVersion getVersion()Get the version of the file format returned bygetFileFormat()
that is supported by this codec.- Returns:
- the file format version (
HtsVersion
) supported by this codec
-
getDisplayName
Get a user-friendly display name for this codec. It is recommended that the display name minimally include both the name of the supported file format and the supported version.- Returns:
- a user-friendly display name for this codec
-
ownsURI
Determine if this codec "owns" the URI contained inioPath
see (IOPath.getURI()
).A codec "owns" the URI only if it has specific requirements on the URI protocol scheme, URI format, or query parameters that go beyond a simple file extension, AND it explicitly recognizes the URI as conforming to those requirements. File formats that only require a specific file extension should always return false from
ownsURI(htsjdk.io.IOPath)
, and should instead use the extension as a filter incanDecodeURI(IOPath)
.Returning true from this method will cause the framework to bypass the stream-oriented signature probing that is used to resolve inputs to a codec handler. During codec resolution, if any registered codec returns true for this method on
ioPath
, the signature probing protocol will instead:- immediately prune the list of candidate codecs to only those that return true for this method
on
ioPath
- not attempt to obtain an InputStream on the IOPath containing the URI, on the assumption that special handling is required in order to access the underlying resource (i.e., htsget codec would claim an "http://" URI if the rest of the URI conforms to the expected format for that codec's protocol).
Any codec that returns true from
ownsURI(IOPath)
for a given IOPath must also return true fromcanDecodeURI(IOPath)
for the same IOPath. For custom URI handlers, codecs should avoid making remote calls to determine the suitability or accessibility of the input resource; the return value for this method should be based only on the format of the URI that is presented. Operations that require remote access that can fail, such as validating server connectivity, authentication, or authorization, should be deferred until data is requested by the caller via the codec'sHtsEncoder
orHtsDecoder
. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.- Parameters:
ioPath
- the ioPath to inspect- Returns:
- true if the ioPath's URI represents a custom URI that this codec handles
- immediately prune the list of candidate codecs to only those that return true for this method
on
-
canDecodeURI
Determine if the URI forioPath
(obtained viaIOPath.getURI()
) conforms to the expected URI format this codec's file format. Most implementations only look at the file extension (seeIOPath.hasExtension(java.lang.String)
). For codecs that implement formats that use specific, well known file extensions, the codec should reject inputs that do not conform to any of the accepted extensions. If the format does not use a specific extension, or if the codec cannot determine if it can decode the underlying resource without inspecting the underlying stream, it is safe to return true, after which the framework will subsequently call this codec'scanDecodeSignature(SignatureStream, String)
method, at which time the codec can inspect the actual underlying stream via theSignatureStream
.Implementations should generally not inspect the URI's protocol scheme unless the file format supported by the codec requires the use a specific protocol scheme. For codecs that do own a specific scheme or URI format, the return values for
ownsURI(IOPath)
andcanDecodeURI(IOPath)
must always be the same (both true or both false) for a given IOPath. For codecs that do not use a custom URI (and rely on NIO access), @link #ownsURI(IOPath)} should always return false, with only the value returned fromcanDecodeURI(IOPath)
varying based on features such as file extension probes.It is never safe to attempt to directly inspect the underlying stream for
For custom URI handlers (seeioPath
in this method. If the stream needs to be inspected, it should be done using the signature stream when thecanDecodeSignature(SignatureStream, String)
method is called.ownsURI(IOPath)
, codecs should avoid making remote calls to determine the suitability of the input resource; the return value for this method should be based only on the format of the URI that is presented. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.- Parameters:
ioPath
- to be decoded- Returns:
- true if the codec can provide a decoder to provide this URI
-
canDecodeSignature
Determine if the codec can decode an input stream by inspecting a signature embedded within the stream. The probingInputStream stream will contain only a fragment of the actual input stream, taken from the start of the stream, the size of which will be the lesser of:- the number of bytes returned by
getSignatureProbeLength()
- the entire input stream, for streams that are smaller than
getSignatureProbeLength()
Codecs that handle custom URIs that reference remote resources (those that return true for
ownsURI(htsjdk.io.IOPath)
) should generally not inspect the stream, and should return false from this method, since the method will never be called with any resource for whichownsURI(htsjdk.io.IOPath)
returned true. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.- Parameters:
signatureStream
- the stream to be inspect for the resource's embedded signature and versionsourceName
- a display name describing the source of the input stream, for use in error messages- Returns:
- true if this codec recognizes the stream by it's signature, and can provide a decoder to decode the stream, otherwise false
- the number of bytes returned by
-
getSignatureLength
int getSignatureLength()Get the number of bytes in the format and version signature used by the file format supported by this codec.- Returns:
- if the file format supported by this codecs is not remote, and is accessible via a local file
or stream, the size of the unique signature/version for this file format. otherwise 0.
Note: Codecs that are custom URI handlers (those that return true for
ownsURI(htsjdk.io.IOPath)
), should always return 0 from this method. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.
-
getSignatureProbeLength
default int getSignatureProbeLength()Get the number of bytes of needed by this codec to probe an input stream for a format/version signature, and determine if it can supply a decoder for the stream.- Returns:
- the number of bytes this codec must consume from a stream in order to determine whether
it can decode that stream. This number may differ from the actual signature size
as returned by
getSignatureLength()
for codecs that support compressed or encrypted streams, since they may require a larger and more semantically meaningful input fragment (such as an entire encrypted or compressed block) in order to inspect the plaintext signature.Therefore
signatureProbeLength
should be expressed in "compressed/encrypted" space rather than "plaintext" space. The length returned from this method is used to determine the size of theSignatureStream
that is subsequently passed tocanDecodeSignature(SignatureStream, String)
.Note: Codecs that are custom URI handlers (those that return true for
ownsURI(IOPath)
), should always return 0 from this method when it is called. Since this method is used during codec resolution, implementations should avoid calling methods that may throw exceptions.
-
getDecoder
Get anHtsDecoder
to decode the provided inputs. The input bundle must contain resources of the type required by this codec. To find a codec appropriate for decoding a given resource, use anHtsCodecResolver
obtained from anHtsCodecRegistry
.The framework will never call thi* method unless either
ownsURI(IOPath)
, orcanDecodeURI(IOPath)
andcanDecodeSignature(SignatureStream, String)
(IOPath)} return true forinputBundle
.- Parameters:
inputBundle
- input to be decoded. To get a decoder for use with index queries that useHtsQuery
methods, the bundle must contain an index resource.decoderOptions
- options for the decoder to use- Returns:
- an
HtsDecoder
that can decode the provided inputs
-
getEncoder
Get anHtsEncoder
to encode to the provided outputs. The output bundle must contain resources of the type required by this codec. To find a codec appropriate for encoding a given resource, use anHtsCodecResolver
obtained from anHtsCodecRegistry
. The framework will never call this method unless eitherownsURI(IOPath)
, orcanDecodeURI(IOPath)
returned true foroutputBundle
.- Parameters:
outputBundle
- target output for the encoderencoderOptions
- encoder options to use- Returns:
- an
HtsEncoder
suitable for writing to the provided outputs
-