Class CompressedSource<T>
- java.lang.Object
-
- org.apache.beam.sdk.io.Source<T>
-
- org.apache.beam.sdk.io.BoundedSource<T>
-
- org.apache.beam.sdk.io.OffsetBasedSource<T>
-
- org.apache.beam.sdk.io.FileBasedSource<T>
-
- org.apache.beam.sdk.io.CompressedSource<T>
-
- Type Parameters:
T
- The type to read from the compressed file.
- All Implemented Interfaces:
java.io.Serializable
,HasDisplayData
@Experimental(SOURCE_SINK) public class CompressedSource<T> extends FileBasedSource<T>
A Source that reads from compressed files. ACompressedSources
wraps a delegateFileBasedSource
that is able to read the decompressed file format.For example, use the following to read from a gzip-compressed file-based source:
FileBasedSource<T> mySource = ...; PCollection<T> collection = p.apply(Read.from(CompressedSource .from(mySource) .withCompression(Compression.GZIP)));
Supported compression algorithms are
Compression.GZIP
,Compression.BZIP2
,Compression.ZIP
,Compression.ZSTD
,Compression.LZO
,Compression.LZOP
,Compression.SNAPPY
, andCompression.DEFLATE
. User-defined compression types are supported by implementing aCompressedSource.DecompressingChannelFactory
.By default, the compression algorithm is selected from those supported in
Compression
based on the file name provided to the source, namely".bz2"
indicatesCompression.BZIP2
,".gz"
indicatesCompression.GZIP
,".zip"
indicatesCompression.ZIP
,".zst"
indicatesCompression.ZSTD
,".lzo_deflate"
indicatesCompression.LZO
,".lzo"
indicatesCompression.LZOP
,".snappy"
indictedCompression.SNAPPY
, and".deflate"
indicatesCompression.DEFLATE
. If the file name does not match any of the supported algorithms, it is assumed to be uncompressed data.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CompressedSource.CompressedReader<T>
Reader for aCompressedSource
.static class
CompressedSource.CompressionMode
Deprecated.UseCompression
insteadstatic interface
CompressedSource.DecompressingChannelFactory
Factory interface for creating channels that decompress the content of an underlying channel.-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.FileBasedSource
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected FileBasedSource<T>
createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end)
Creates aCompressedSource
for a subrange of a file.protected FileBasedSource.FileBasedReader<T>
createSingleFileReader(PipelineOptions options)
Creates aFileBasedReader
to read a single file.static <T> CompressedSource<T>
from(FileBasedSource<T> sourceDelegate)
Creates aCompressedSource
from an underlyingFileBasedSource
.CompressedSource.DecompressingChannelFactory
getChannelFactory()
Coder<T>
getOutputCoder()
Returns the delegate source's output coder.protected boolean
isSplittable()
Determines whether a single file represented by this source is splittable.void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.void
validate()
Validates that the delegate source is a valid source and that the channel factory is not null.CompressedSource<T>
withCompression(Compression compression)
LikewithDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory)
but takes a canonicalCompression
.CompressedSource<T>
withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
Return aCompressedSource
that is like this one but will decompress its underlying file with the givenCompressedSource.DecompressingChannelFactory
.-
Methods inherited from class org.apache.beam.sdk.io.FileBasedSource
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, split, toString
-
Methods inherited from class org.apache.beam.sdk.io.OffsetBasedSource
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
-
Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder
-
-
-
-
Method Detail
-
from
public static <T> CompressedSource<T> from(FileBasedSource<T> sourceDelegate)
Creates aCompressedSource
from an underlyingFileBasedSource
. The type of compression used will be based on the file name extension unless explicitly configured viawithDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory)
.
-
withDecompression
public CompressedSource<T> withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
Return aCompressedSource
that is like this one but will decompress its underlying file with the givenCompressedSource.DecompressingChannelFactory
.
-
withCompression
public CompressedSource<T> withCompression(Compression compression)
LikewithDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory)
but takes a canonicalCompression
.
-
validate
public void validate()
Validates that the delegate source is a valid source and that the channel factory is not null.- Overrides:
validate
in classFileBasedSource<T>
-
createForSubrangeOfFile
protected FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata metadata, long start, long end)
Creates aCompressedSource
for a subrange of a file. Called by superclass to create a source for a single file.- Specified by:
createForSubrangeOfFile
in classFileBasedSource<T>
- Parameters:
metadata
- file backing the newFileBasedSource
.start
- starting byte offset of the newFileBasedSource
.end
- ending byte offset of the newFileBasedSource
. May be Long.MAX_VALUE, in which case it will be inferred usingFileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.
-
isSplittable
protected final boolean isSplittable()
Determines whether a single file represented by this source is splittable. Returns true if we are using the default decompression factory and it determines from the requested file name that the file is not compressed.- Overrides:
isSplittable
in classFileBasedSource<T>
-
createSingleFileReader
protected final FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
Creates aFileBasedReader
to read a single file.Uses the delegate source to create a single file reader for the delegate source. Utilizes the default decompression channel factory to not wrap the source reader if the file name does not represent a compressed file allowing for splitting of the source.
- Specified by:
createSingleFileReader
in classFileBasedSource<T>
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Description copied from class:Source
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Overrides:
populateDisplayData
in classFileBasedSource<T>
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
getOutputCoder
public final Coder<T> getOutputCoder()
Returns the delegate source's output coder.- Overrides:
getOutputCoder
in classSource<T>
-
getChannelFactory
public final CompressedSource.DecompressingChannelFactory getChannelFactory()
-
-