Class FileBasedSource<T>

    • Constructor Detail

      • FileBasedSource

        protected FileBasedSource​(ValueProvider<java.lang.String> fileOrPatternSpec,
                                  EmptyMatchTreatment emptyMatchTreatment,
                                  long minBundleSize)
        Create a FileBaseSource based on a file or a file pattern specification, with the given strategy for treating filepatterns that do not match any files.
      • FileBasedSource

        protected FileBasedSource​(MatchResult.Metadata fileMetadata,
                                  long minBundleSize,
                                  long startOffset,
                                  long endOffset)
        Create a FileBasedSource based on a single file. This constructor must be used when creating a new FileBasedSource for a subrange of a single file. Additionally, this constructor must be used to create new FileBasedSources when subclasses implement the method createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata, long, long).

        See OffsetBasedSource for detailed descriptions of minBundleSize, startOffset, and endOffset.

        Parameters:
        fileMetadata - specification of the file represented by the FileBasedSource, in suitable form for use with FileSystems.match(List).
        minBundleSize - minimum bundle size in bytes.
        startOffset - starting byte offset.
        endOffset - ending byte offset. If the specified value >= #getMaxEndOffset() it implies #getMaxEndOffSet().
    • Method Detail

      • getSingleFileMetadata

        public final MatchResult.Metadata getSingleFileMetadata()
        Returns the information about the single file that this source is reading from.
        Throws:
        java.lang.IllegalArgumentException - if this source is in FileBasedSource.Mode.FILEPATTERN mode.
      • getFileOrPatternSpec

        public final java.lang.String getFileOrPatternSpec()
      • getFileOrPatternSpecProvider

        public final ValueProvider<java.lang.String> getFileOrPatternSpecProvider()
      • createForSubrangeOfFile

        protected abstract FileBasedSource<T> createForSubrangeOfFile​(MatchResult.Metadata fileMetadata,
                                                                      long start,
                                                                      long end)
        Creates and returns a new FileBasedSource of the same type as the current FileBasedSource backed by a given file and an offset range. When current source is being split, this method is used to generate new sub-sources. When creating the source subclasses must call the constructor FileBasedSource(Metadata, long, long, long) of FileBasedSource with corresponding parameter values passed here.
        Parameters:
        fileMetadata - file backing the new FileBasedSource.
        start - starting byte offset of the new FileBasedSource.
        end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE, in which case it will be inferred using getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).
      • createSingleFileReader

        protected abstract FileBasedSource.FileBasedReader<T> createSingleFileReader​(PipelineOptions options)
        Creates and returns an instance of a FileBasedReader implementation for the current source assuming the source represents a single file. File patterns will be handled by FileBasedSource implementation automatically.
      • getEstimatedSizeBytes

        public final long getEstimatedSizeBytes​(PipelineOptions options)
                                         throws java.io.IOException
        Description copied from class: BoundedSource
        An estimate of the total size (in bytes) of the data that would be read from this source. This estimate is in terms of external storage size, before any decompression or other processing done by the reader.

        If there is no way to estimate the size of the source implementations MAY return 0L.

        Overrides:
        getEstimatedSizeBytes in class OffsetBasedSource<T>
        Throws:
        java.io.IOException
      • populateDisplayData

        public void populateDisplayData​(DisplayData.Builder builder)
        Description copied from class: Source
        Register display data for the given transform or component.

        populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace of the subcomponent.

        By default, does not register any display data. Implementors may override this method to provide their own display data.

        Specified by:
        populateDisplayData in interface HasDisplayData
        Overrides:
        populateDisplayData in class OffsetBasedSource<T>
        Parameters:
        builder - The builder to populate with display data.
        See Also:
        HasDisplayData
      • split

        public final java.util.List<? extends FileBasedSource<T>> split​(long desiredBundleSizeBytes,
                                                                        PipelineOptions options)
                                                                 throws java.lang.Exception
        Description copied from class: BoundedSource
        Splits the source into bundles of approximately desiredBundleSizeBytes.
        Overrides:
        split in class OffsetBasedSource<T>
        Throws:
        java.lang.Exception
      • isSplittable

        protected boolean isSplittable()
                                throws java.lang.Exception
        Determines whether a file represented by this source is can be split into bundles.

        By default, a source in mode FileBasedSource.Mode.FILEPATTERN is always splittable, because splitting will involve expanding the file pattern and producing single-file/subrange sources, which may or may not be splittable themselves.

        By default, a source in FileBasedSource.Mode.SINGLE_FILE_OR_SUBRANGE is splittable if it is on a file system that supports efficient read seeking.

        Subclasses may override to provide different behavior.

        Throws:
        java.lang.Exception
      • validate

        public void validate()
        Description copied from class: Source
        Checks that this source is valid, before it can be used in a pipeline.

        It is recommended to use Preconditions for implementing this method.

        Overrides:
        validate in class OffsetBasedSource<T>
      • getMaxEndOffset

        public final long getMaxEndOffset​(PipelineOptions options)
                                   throws java.io.IOException
        Description copied from class: OffsetBasedSource
        Returns the actual ending offset of the current source. The value returned by this function will be used to clip the end of the range [startOffset, endOffset) such that the range used is [startOffset, min(endOffset, maxEndOffset)).

        As an example in which OffsetBasedSource is used to implement a file source, suppose that this source was constructed with an endOffset of Long.MAX_VALUE to indicate that a file should be read to the end. Then this function should determine the actual, exact size of the file in bytes and return it.

        Specified by:
        getMaxEndOffset in class OffsetBasedSource<T>
        Throws:
        java.io.IOException