public class ParquetReaderUtility extends Object
Modifier and Type | Class and Description |
---|---|
static class |
ParquetReaderUtility.DateCorruptionStatus
For most recently created parquet files, we can determine if we have corrupted dates (see DRILL-4203)
based on the file metadata.
|
static class |
ParquetReaderUtility.NanoTimeUtils
Utilities for converting from parquet INT96 binary (impala, hive timestamp)
to date time value.
|
Modifier and Type | Field and Description |
---|---|
static String |
ALLOWED_DRILL_VERSION_FOR_BINARY |
static long |
CORRECT_CORRUPT_DATE_SHIFT
All old parquet files (which haven't "is.date.correct=true" or "parquet-writer.version" properties
in metadata) have a corrupt date shift: 4881176L days or 2 * 2440588L
|
static int |
DATE_CORRUPTION_THRESHOLD
The year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption.
|
static int |
DRILL_WRITER_VERSION_STD_DATE_FORMAT
Version 2 (and later) of the Drill Parquet writer uses the date format described in the
Parquet spec.
|
static long |
JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH
Number of days between Julian day epoch (January 1, 4713 BC) and Unix day epoch (January 1, 1970).
|
Constructor and Description |
---|
ParquetReaderUtility() |
Modifier and Type | Method and Description |
---|---|
static int |
autoCorrectCorruptedDate(int corruptedDate) |
static void |
checkDecimalTypeEnabled(OptionManager options) |
static ParquetReaderUtility.DateCorruptionStatus |
checkForCorruptDateValuesInStatistics(org.apache.parquet.hadoop.metadata.ParquetMetadata footer,
List<SchemaPath> columns,
boolean autoCorrectCorruptDates)
Detect corrupt date values by looking at the min/max values in the metadata.
|
static boolean |
containsComplexColumn(org.apache.parquet.hadoop.metadata.ParquetMetadata footer,
List<SchemaPath> columns)
Check whether any of columns in the given list is either nested or repetitive.
|
static void |
correctDatesInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata) |
static ParquetReaderUtility.DateCorruptionStatus |
detectCorruptDates(org.apache.parquet.hadoop.metadata.ParquetMetadata footer,
List<SchemaPath> columns,
boolean autoCorrectCorruptDates)
Check for corrupted dates in a parquet file.
|
static Map<String,org.apache.parquet.column.ColumnDescriptor> |
getColNameToColumnDescriptorMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer)
Map full column paths to all ColumnDescriptors in file schema
|
static Map<String,org.apache.parquet.format.SchemaElement> |
getColNameToSchemaElementMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer)
Map full schema paths in format `a`.`b`.`c` to respective SchemaElement objects.
|
static List<TypeProtos.MajorType> |
getComplexTypes(List<org.apache.parquet.schema.OriginalType> originalTypes)
Converts list of
OriginalType s to list of TypeProtos.MajorType s. |
static TypeProtos.DataMode |
getDataMode(org.apache.parquet.schema.Type.Repetition repetition)
Converts Parquet's
Type.Repetition to Drill's TypeProtos.DataMode . |
static String |
getFullColumnPath(org.apache.parquet.column.ColumnDescriptor column)
generate full path of the column in format `a`.`b`.`c`
|
static int |
getIntFromLEBytes(byte[] input,
int start) |
static TypeProtos.MinorType |
getMinorType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type,
org.apache.parquet.schema.OriginalType originalType)
Builds minor type using given
OriginalType originalType or PrimitiveTypeName type . |
static TypeProtos.MajorType |
getType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type,
org.apache.parquet.schema.OriginalType originalType,
int precision,
int scale)
Builds major type using given
OriginalType originalType or PrimitiveTypeName type . |
static boolean |
isLogicalListType(org.apache.parquet.schema.GroupType groupType)
Checks whether group field approximately matches pattern for Logical Lists:
|
static boolean |
isLogicalMapType(org.apache.parquet.schema.GroupType groupType)
Checks whether group field matches pattern for Logical Map type:
|
static void |
transformBinaryInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata,
ParquetReaderConfig readerConfig)
Transforms values for min / max binary statistics to byte array.
|
public static final long JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH
public static final long CORRECT_CORRUPT_DATE_SHIFT
public static final int DATE_CORRUPTION_THRESHOLD
public static final int DRILL_WRITER_VERSION_STD_DATE_FORMAT
CORRECT_CORRUPT_DATE_SHIFT
public static final String ALLOWED_DRILL_VERSION_FOR_BINARY
public static void checkDecimalTypeEnabled(OptionManager options)
public static int getIntFromLEBytes(byte[] input, int start)
public static Map<String,org.apache.parquet.format.SchemaElement> getColNameToSchemaElementMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer)
footer
- Parquet file metadatapublic static String getFullColumnPath(org.apache.parquet.column.ColumnDescriptor column)
column
- ColumnDescriptor objectpublic static Map<String,org.apache.parquet.column.ColumnDescriptor> getColNameToColumnDescriptorMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer)
footer
- Parquet file metadatapublic static int autoCorrectCorruptedDate(int corruptedDate)
public static void correctDatesInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata)
public static void transformBinaryInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata, ParquetReaderConfig readerConfig)
parquetTableMetadata
- table metadata that should be correctedreaderConfig
- parquet reader configpublic static ParquetReaderUtility.DateCorruptionStatus detectCorruptDates(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns, boolean autoCorrectCorruptDates)
public static ParquetReaderUtility.DateCorruptionStatus checkForCorruptDateValuesInStatistics(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns, boolean autoCorrectCorruptDates)
ParquetRecordWriter.WRITER_VERSION_PROPERTY
<
DRILL_WRITER_VERSION_STD_DATE_FORMAT
)
This method only checks the first Row Group, because Drill has only ever written
a single Row Group per file.footer
- parquet footercolumns
- list of columns schema pathautoCorrectCorruptDates
- user setting to allow enabling/disabling of auto-correction
of corrupt dates. There are some rare cases (storing dates thousands
of years into the future, with tools other than Drill writing files)
that would result in the date values being "corrected" into bad values.public static TypeProtos.MajorType getType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type, org.apache.parquet.schema.OriginalType originalType, int precision, int scale)
OriginalType originalType
or PrimitiveTypeName type
.
For DECIMAL will be returned major type with scale and precision.type
- parquet primitive typeoriginalType
- parquet original typescale
- type scale (used for DECIMAL type)precision
- type precision (used for DECIMAL type)public static TypeProtos.MinorType getMinorType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type, org.apache.parquet.schema.OriginalType originalType)
OriginalType originalType
or PrimitiveTypeName type
.type
- parquet primitive typeoriginalType
- parquet original typepublic static boolean containsComplexColumn(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns)
footer
- Parquet file schemacolumns
- list of query SchemaPath objectspublic static List<TypeProtos.MajorType> getComplexTypes(List<org.apache.parquet.schema.OriginalType> originalTypes)
OriginalType
s to list of TypeProtos.MajorType
s.
NOTE: current implementation cares about OriginalType.MAP
and OriginalType.LIST
only
converting it to TypeProtos.MinorType.DICT
and TypeProtos.MinorType.LIST
respectively.
Other original types are converted to null
, because there is no certain correspondence
(and, actually, a need because these types are used to differentiate between Drill's MAP and DICT (and arrays of thereof) types
when constructing TupleSchema
) between these two.originalTypes
- list of Parquet's typesnull
or type with minor
type TypeProtos.MinorType.DICT
or
TypeProtos.MinorType.LIST
valuespublic static boolean isLogicalListType(org.apache.parquet.schema.GroupType groupType)
<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> element; } }(See for more details: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) Note, that standard field names 'list' and 'element' aren't checked intentionally, because Hive lists have 'bag' and 'array_element' names instead.
groupType
- type which may have LIST original typepublic static boolean isLogicalMapType(org.apache.parquet.schema.GroupType groupType)
<map-repetition> group <name> (MAP) { repeated group key_value { required <key-type> key; <value-repetition> <value-type> value; } }Note, that actual group names are not checked specifically.
groupType
- parquet type which may be of MAP typepublic static TypeProtos.DataMode getDataMode(org.apache.parquet.schema.Type.Repetition repetition)
Type.Repetition
to Drill's TypeProtos.DataMode
.repetition
- repetition to be convertedCopyright © 2022 The Apache Software Foundation. All rights reserved.