Package org.apache.parquet.avro

Provides classes to store Avro data in Parquet files.

See: Description

Package org.apache.parquet.avro Description

Provides classes to store Avro data in Parquet files. Avro schemas are converted to parquet schemas as follows. Only record schemas are converted, other top-level schema types are not converted and attempting to do so will result in an error. Avro types are converted to Parquet types using the mapping shown here:

Avro type Parquet type
null no type (the field is not encoded in Parquet), unless a null union
boolean boolean
int int32
long int64
float float
double double
bytes binary
string binary (with original type UTF8)
record group containing nested fields
enum binary (with original type ENUM)
array group (with original type LIST) containing one repeated group field
map group (with original type MAP) containing one repeated group field (with original type MAP_KEY_VALUE) of (key, value)
fixed fixed_len_byte_array
union an optional type, in the case of a null union, otherwise not supported

For Parquet files that were not written with classes from this package there is no Avro write schema stored in the Parquet file metadata. To read such files using classes from this package you must either provide an Avro read schema, or a default Avro schema will be derived using the following mapping.

Parquet type Avro type boolean boolean int32 int int64 long int96 not supported float float double double fixed_len_byte_array fixed binary (with no original type) bytes binary (with original type UTF8) string binary (with original type ENUM) string group (with original type LIST) containing one repeated group field array group (with original type MAP) containing one repeated group field (with original type MAP_KEY_VALUE) of (key, value) map

Parquet fields that are optional are mapped to an Avro null union.

Some conversions are lossy. Avro nulls are not represented in Parquet, so they are lost when converted back to Avro. Similarly, a Parquet enum does not store its values, so it cannot be converted back to an Avro enum, which is why an Avro string had to suffice. Type names for nested records, enums, and fixed types are lost in the conversion to Parquet. Avro aliases, default values, field ordering, and documentation strings are all dropped in the conversion to Parquet. Parquet maps can have any type for keys, but this is not true in Avro where map keys are assumed to be strings.

Copyright © 2015 The Apache Software Foundation. All rights reserved.