Class FileBasedSource<T>
- java.lang.Object
-
- org.apache.beam.sdk.io.Source<T>
-
- org.apache.beam.sdk.io.BoundedSource<T>
-
- org.apache.beam.sdk.io.OffsetBasedSource<T>
-
- org.apache.beam.sdk.io.FileBasedSource<T>
-
- Type Parameters:
T
- Type of records represented by the source.
- All Implemented Interfaces:
java.io.Serializable
,HasDisplayData
- Direct Known Subclasses:
BlockBasedSource
,CompressedSource
public abstract class FileBasedSource<T> extends OffsetBasedSource<T>
A common base class for all file-basedSource
s. Extend this class to implement your own file-based custom source.A file-based
Source
is aSource
backed by a file pattern defined as a Java glob, a single file, or a offset range for a single file. SeeOffsetBasedSource
andRangeTracker
for semantics of offset ranges.This source stores a
String
that is aFileSystems
specification for a file or file pattern. There should be aFileSystem
registered for the file specification provided. Please refer toFileSystems
andFileSystem
for more information on this.In addition to the methods left abstract from
BoundedSource
, subclasses must implement methods to create a sub-source and a reader for a range of a single file -createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long)
andcreateSingleFileReader(org.apache.beam.sdk.options.PipelineOptions)
. Please refer toTextIO.TextSource
for an example implementation ofFileBasedSource
.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
FileBasedSource.FileBasedReader<T>
Areader
that implements code common to readers ofFileBasedSource
s.static class
FileBasedSource.Mode
A givenFileBasedSource
represents a file resource of one of these types.-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source
Source.Reader<T>
-
-
Constructor Summary
Constructors Modifier Constructor Description protected
FileBasedSource(MatchResult.Metadata fileMetadata, long minBundleSize, long startOffset, long endOffset)
Create aFileBasedSource
based on a single file.protected
FileBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, long minBundleSize)
LikeFileBasedSource(ValueProvider, EmptyMatchTreatment, long)
, but uses the default value ofEmptyMatchTreatment.DISALLOW
.protected
FileBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize)
Create aFileBaseSource
based on a file or a file pattern specification, with the given strategy for treating filepatterns that do not match any files.
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected abstract FileBasedSource<T>
createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end)
Creates and returns a newFileBasedSource
of the same type as the currentFileBasedSource
backed by a given file and an offset range.BoundedSource.BoundedReader<T>
createReader(PipelineOptions options)
Returns a newBoundedSource.BoundedReader
that reads from this source.protected abstract FileBasedSource.FileBasedReader<T>
createSingleFileReader(PipelineOptions options)
Creates and returns an instance of aFileBasedReader
implementation for the current source assuming the source represents a single file.FileBasedSource<T>
createSourceForSubrange(long start, long end)
Returns anOffsetBasedSource
for a subrange of the current source.EmptyMatchTreatment
getEmptyMatchTreatment()
long
getEstimatedSizeBytes(PipelineOptions options)
An estimate of the total size (in bytes) of the data that would be read from this source.java.lang.String
getFileOrPatternSpec()
ValueProvider<java.lang.String>
getFileOrPatternSpecProvider()
long
getMaxEndOffset(PipelineOptions options)
Returns the actual ending offset of the current source.FileBasedSource.Mode
getMode()
MatchResult.Metadata
getSingleFileMetadata()
Returns the information about the single file that this source is reading from.protected boolean
isSplittable()
Determines whether a file represented by this source is can be split into bundles.void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.java.util.List<? extends FileBasedSource<T>>
split(long desiredBundleSizeBytes, PipelineOptions options)
Splits the source into bundles of approximatelydesiredBundleSizeBytes
.java.lang.String
toString()
void
validate()
Checks that this source is valid, before it can be used in a pipeline.-
Methods inherited from class org.apache.beam.sdk.io.OffsetBasedSource
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
-
Methods inherited from class org.apache.beam.sdk.io.Source
getDefaultOutputCoder, getOutputCoder
-
-
-
-
Constructor Detail
-
FileBasedSource
protected FileBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, EmptyMatchTreatment emptyMatchTreatment, long minBundleSize)
Create aFileBaseSource
based on a file or a file pattern specification, with the given strategy for treating filepatterns that do not match any files.
-
FileBasedSource
protected FileBasedSource(ValueProvider<java.lang.String> fileOrPatternSpec, long minBundleSize)
LikeFileBasedSource(ValueProvider, EmptyMatchTreatment, long)
, but uses the default value ofEmptyMatchTreatment.DISALLOW
.
-
FileBasedSource
protected FileBasedSource(MatchResult.Metadata fileMetadata, long minBundleSize, long startOffset, long endOffset)
Create aFileBasedSource
based on a single file. This constructor must be used when creating a newFileBasedSource
for a subrange of a single file. Additionally, this constructor must be used to create newFileBasedSource
s when subclasses implement the methodcreateForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long)
.See
OffsetBasedSource
for detailed descriptions ofminBundleSize
,startOffset
, andendOffset
.- Parameters:
fileMetadata
- specification of the file represented by theFileBasedSource
, in suitable form for use withFileSystems.match(List)
.minBundleSize
- minimum bundle size in bytes.startOffset
- starting byte offset.endOffset
- ending byte offset. If the specified value>= #getMaxEndOffset()
it implies#getMaxEndOffSet()
.
-
-
Method Detail
-
getSingleFileMetadata
public final MatchResult.Metadata getSingleFileMetadata()
Returns the information about the single file that this source is reading from.- Throws:
java.lang.IllegalArgumentException
- if this source is inFileBasedSource.Mode.FILEPATTERN
mode.
-
getFileOrPatternSpec
public final java.lang.String getFileOrPatternSpec()
-
getFileOrPatternSpecProvider
public final ValueProvider<java.lang.String> getFileOrPatternSpecProvider()
-
getEmptyMatchTreatment
public final EmptyMatchTreatment getEmptyMatchTreatment()
-
getMode
public final FileBasedSource.Mode getMode()
-
createSourceForSubrange
public final FileBasedSource<T> createSourceForSubrange(long start, long end)
Description copied from class:OffsetBasedSource
Returns anOffsetBasedSource
for a subrange of the current source. The subrange[start, end)
must be within the range[startOffset, endOffset)
of the current source, i.e.startOffset <= start < end <= endOffset
.- Specified by:
createSourceForSubrange
in classOffsetBasedSource<T>
-
createForSubrangeOfFile
protected abstract FileBasedSource<T> createForSubrangeOfFile(MatchResult.Metadata fileMetadata, long start, long end)
Creates and returns a newFileBasedSource
of the same type as the currentFileBasedSource
backed by a given file and an offset range. When current source is being split, this method is used to generate new sub-sources. When creating the source subclasses must call the constructorFileBasedSource(Metadata, long, long, long)
ofFileBasedSource
with corresponding parameter values passed here.- Parameters:
fileMetadata
- file backing the newFileBasedSource
.start
- starting byte offset of the newFileBasedSource
.end
- ending byte offset of the newFileBasedSource
. May be Long.MAX_VALUE, in which case it will be inferred usinggetMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions)
.
-
createSingleFileReader
protected abstract FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
Creates and returns an instance of aFileBasedReader
implementation for the current source assuming the source represents a single file. File patterns will be handled byFileBasedSource
implementation automatically.
-
getEstimatedSizeBytes
public final long getEstimatedSizeBytes(PipelineOptions options) throws java.io.IOException
Description copied from class:BoundedSource
An estimate of the total size (in bytes) of the data that would be read from this source. This estimate is in terms of external storage size, before any decompression or other processing done by the reader.If there is no way to estimate the size of the source implementations MAY return 0L.
- Overrides:
getEstimatedSizeBytes
in classOffsetBasedSource<T>
- Throws:
java.io.IOException
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Description copied from class:Source
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Overrides:
populateDisplayData
in classOffsetBasedSource<T>
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
split
public final java.util.List<? extends FileBasedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) throws java.lang.Exception
Description copied from class:BoundedSource
Splits the source into bundles of approximatelydesiredBundleSizeBytes
.- Overrides:
split
in classOffsetBasedSource<T>
- Throws:
java.lang.Exception
-
isSplittable
protected boolean isSplittable() throws java.lang.Exception
Determines whether a file represented by this source is can be split into bundles.By default, a source in mode
FileBasedSource.Mode.FILEPATTERN
is always splittable, because splitting will involve expanding the file pattern and producing single-file/subrange sources, which may or may not be splittable themselves.By default, a source in
FileBasedSource.Mode.SINGLE_FILE_OR_SUBRANGE
is splittable if it is on a file system that supports efficient read seeking.Subclasses may override to provide different behavior.
- Throws:
java.lang.Exception
-
createReader
public final BoundedSource.BoundedReader<T> createReader(PipelineOptions options) throws java.io.IOException
Description copied from class:BoundedSource
Returns a newBoundedSource.BoundedReader
that reads from this source.- Specified by:
createReader
in classBoundedSource<T>
- Throws:
java.io.IOException
-
toString
public java.lang.String toString()
- Overrides:
toString
in classOffsetBasedSource<T>
-
validate
public void validate()
Description copied from class:Source
Checks that this source is valid, before it can be used in a pipeline.It is recommended to use
Preconditions
for implementing this method.- Overrides:
validate
in classOffsetBasedSource<T>
-
getMaxEndOffset
public final long getMaxEndOffset(PipelineOptions options) throws java.io.IOException
Description copied from class:OffsetBasedSource
Returns the actual ending offset of the current source. The value returned by this function will be used to clip the end of the range[startOffset, endOffset)
such that the range used is[startOffset, min(endOffset, maxEndOffset))
.As an example in which
OffsetBasedSource
is used to implement a file source, suppose that this source was constructed with anendOffset
ofLong.MAX_VALUE
to indicate that a file should be read to the end. Then this function should determine the actual, exact size of the file in bytes and return it.- Specified by:
getMaxEndOffset
in classOffsetBasedSource<T>
- Throws:
java.io.IOException
-
-