Package org.apache.parquet.avro
Class AvroParquetInputFormat<T>
- java.lang.Object
-
- org.apache.hadoop.mapreduce.InputFormat<K,V>
-
- org.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,T>
-
- org.apache.parquet.hadoop.ParquetInputFormat<T>
-
- org.apache.parquet.avro.AvroParquetInputFormat<T>
-
- Type Parameters:
T
- the Java type of objects produced by this InputFormat
public class AvroParquetInputFormat<T> extends org.apache.parquet.hadoop.ParquetInputFormat<T>
A HadoopInputFormat
for Parquet files.
-
-
Field Summary
-
Fields inherited from class org.apache.parquet.hadoop.ParquetInputFormat
BLOOM_FILTERING_ENABLED, COLUMN_INDEX_FILTERING_ENABLED, DICTIONARY_FILTERING_ENABLED, FILTER_PREDICATE, PAGE_VERIFY_CHECKSUM_ENABLED, READ_SUPPORT_CLASS, RECORD_FILTERING_ENABLED, SPLIT_FILES, STATS_FILTERING_ENABLED, STRICT_TYPE_CHECKING, TASK_SIDE_METADATA, UNBOUND_RECORD_FILTER
-
-
Constructor Summary
Constructors Constructor Description AvroParquetInputFormat()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static void
setAvroDataSupplier(org.apache.hadoop.mapreduce.Job job, Class<? extends AvroDataSupplier> supplierClass)
Uses an instance of the specifiedAvroDataSupplier
class to control how theSpecificData
instance that is used to find Avro specific records is created.static void
setAvroReadSchema(org.apache.hadoop.mapreduce.Job job, org.apache.avro.Schema avroReadSchema)
Override the Avro schema to use for reading.static void
setRequestedProjection(org.apache.hadoop.mapreduce.Job job, org.apache.avro.Schema requestedProjection)
Set the subset of columns to read (projection pushdown).-
Methods inherited from class org.apache.parquet.hadoop.ParquetInputFormat
createRecordReader, getFilter, getFooters, getFooters, getFooters, getGlobalMetaData, getReadSupportClass, getReadSupportInstance, getSplits, getSplits, getUnboundRecordFilter, isSplitable, isTaskSideMetaData, listStatus, setFilterPredicate, setReadSupportClass, setReadSupportClass, setTaskSideMetaData, setUnboundRecordFilter
-
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
-
-
-
-
Method Detail
-
setRequestedProjection
public static void setRequestedProjection(org.apache.hadoop.mapreduce.Job job, org.apache.avro.Schema requestedProjection)
Set the subset of columns to read (projection pushdown). Specified as an Avro schema, the requested projection is converted into a Parquet schema for Parquet column projection.This is useful if the full schema is large and you only want to read a few columns, since it saves time by not reading unused columns.
If a requested projection is set, then the Avro schema used for reading must be compatible with the projection. For instance, if a column is not included in the projection then it must either not be included or be optional in the read schema. Use
setAvroReadSchema(org.apache.hadoop.mapreduce.Job, org.apache.avro.Schema)
to set a read schema, if needed.- Parameters:
job
- a jobrequestedProjection
- the requested projection schema- See Also:
setAvroReadSchema(org.apache.hadoop.mapreduce.Job, org.apache.avro.Schema)
,AvroParquetOutputFormat.setSchema(org.apache.hadoop.mapreduce.Job, org.apache.avro.Schema)
-
setAvroReadSchema
public static void setAvroReadSchema(org.apache.hadoop.mapreduce.Job job, org.apache.avro.Schema avroReadSchema)
Override the Avro schema to use for reading. If not set, the Avro schema used for writing is used.Differences between the read and write schemas are resolved using Avro's schema resolution rules.
- Parameters:
job
- a jobavroReadSchema
- the requested schema- See Also:
setRequestedProjection(org.apache.hadoop.mapreduce.Job, org.apache.avro.Schema)
,AvroParquetOutputFormat.setSchema(org.apache.hadoop.mapreduce.Job, org.apache.avro.Schema)
-
setAvroDataSupplier
public static void setAvroDataSupplier(org.apache.hadoop.mapreduce.Job job, Class<? extends AvroDataSupplier> supplierClass)
Uses an instance of the specifiedAvroDataSupplier
class to control how theSpecificData
instance that is used to find Avro specific records is created.- Parameters:
job
- a jobsupplierClass
- an avro data supplier class
-
-