Package org.apache.beam.sdk.io
Class TextRowCountEstimator
- java.lang.Object
-
- org.apache.beam.sdk.io.TextRowCountEstimator
-
public abstract class TextRowCountEstimator extends java.lang.Object
This returns a row count estimation for files associated with a file pattern.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
TextRowCountEstimator.Builder
Builder forTextRowCountEstimator
.static class
TextRowCountEstimator.LimitNumberOfFiles
This strategy stops sampling if we sample enough number of bytes.static class
TextRowCountEstimator.LimitNumberOfTotalBytes
This strategy stops sampling when total number of sampled bytes are more than some threshold.static class
TextRowCountEstimator.NoEstimationException
An exception that will be thrown if the estimator cannot get an estimation of the number of lines.static class
TextRowCountEstimator.SampleAllFiles
This strategy samples all the files.static interface
TextRowCountEstimator.SamplingStrategy
Sampling Strategy shows us when should we stop reading further files.
-
Constructor Summary
Constructors Constructor Description TextRowCountEstimator()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description static TextRowCountEstimator.Builder
builder()
java.lang.Double
estimateRowCount(PipelineOptions pipelineOptions)
Estimates the number of non empty rows.abstract Compression
getCompression()
abstract byte @Nullable []
getDelimiters()
abstract FileIO.ReadMatches.DirectoryTreatment
getDirectoryTreatment()
abstract EmptyMatchTreatment
getEmptyMatchTreatment()
abstract java.lang.String
getFilePattern()
abstract long
getNumSampledBytesPerFile()
abstract TextRowCountEstimator.SamplingStrategy
getSamplingStrategy()
-
-
-
Method Detail
-
getNumSampledBytesPerFile
public abstract long getNumSampledBytesPerFile()
-
getDelimiters
public abstract byte @Nullable [] getDelimiters()
-
getFilePattern
public abstract java.lang.String getFilePattern()
-
getCompression
public abstract Compression getCompression()
-
getSamplingStrategy
public abstract TextRowCountEstimator.SamplingStrategy getSamplingStrategy()
-
getEmptyMatchTreatment
public abstract EmptyMatchTreatment getEmptyMatchTreatment()
-
getDirectoryTreatment
public abstract FileIO.ReadMatches.DirectoryTreatment getDirectoryTreatment()
-
builder
public static TextRowCountEstimator.Builder builder()
-
estimateRowCount
public java.lang.Double estimateRowCount(PipelineOptions pipelineOptions) throws java.io.IOException, TextRowCountEstimator.NoEstimationException
Estimates the number of non empty rows. It samples NumSampledBytesPerFile bytes from every file until the condition in sampling strategy is met. Then it takes the average line size of the rows and divides the total file sizes by that number. If all the sampled rows are empty, and it has not sampled all the lines (due to sampling strategy) it throws Exception.- Returns:
- Number of estimated rows.
- Throws:
TextRowCountEstimator.NoEstimationException
- if all the sampled lines are empty and we have not read all the lines in the matched files.java.io.IOException
-
-