ParquetReaderUtility (Drill : Exec : Java Execution Engine 1.20.0 API)

java.lang.Object
- org.apache.drill.exec.store.parquet.ParquetReaderUtility

```
public class ParquetReaderUtility
extends Object
```
Utility class where we can capture common logic between the two parquet readers

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`ParquetReaderUtility.DateCorruptionStatus` For most recently created parquet files, we can determine if we have corrupted dates (see DRILL-4203) based on the file metadata.
`static class`	`ParquetReaderUtility.NanoTimeUtils` Utilities for converting from parquet INT96 binary (impala, hive timestamp) to date time value.

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`ALLOWED_DRILL_VERSION_FOR_BINARY`
`static long`	`CORRECT_CORRUPT_DATE_SHIFT` All old parquet files (which haven't "is.date.correct=true" or "parquet-writer.version" properties in metadata) have a corrupt date shift: 4881176L days or 2 * 2440588L
`static int`	`DATE_CORRUPTION_THRESHOLD` The year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption.
`static int`	`DRILL_WRITER_VERSION_STD_DATE_FORMAT` Version 2 (and later) of the Drill Parquet writer uses the date format described in the Parquet spec.
`static long`	`JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH` Number of days between Julian day epoch (January 1, 4713 BC) and Unix day epoch (January 1, 1970).

Constructor Summary

Constructors
Constructor and Description

ParquetReaderUtility()

Constructors
Constructor and Description
`ParquetReaderUtility()`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static int`	`autoCorrectCorruptedDate(int corruptedDate)`
`static void`	`checkDecimalTypeEnabled(OptionManager options)`
`static ParquetReaderUtility.DateCorruptionStatus`	`checkForCorruptDateValuesInStatistics(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns, boolean autoCorrectCorruptDates)` Detect corrupt date values by looking at the min/max values in the metadata.
`static boolean`	`containsComplexColumn(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns)` Check whether any of columns in the given list is either nested or repetitive.
`static void`	`correctDatesInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata)`
`static ParquetReaderUtility.DateCorruptionStatus`	`detectCorruptDates(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns, boolean autoCorrectCorruptDates)` Check for corrupted dates in a parquet file.
`static Map<String,org.apache.parquet.column.ColumnDescriptor>`	`getColNameToColumnDescriptorMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer)` Map full column paths to all ColumnDescriptors in file schema
`static Map<String,org.apache.parquet.format.SchemaElement>`	`getColNameToSchemaElementMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer)` Map full schema paths in format `a`.`b`.`c` to respective SchemaElement objects.
`static List<TypeProtos.MajorType>`	`getComplexTypes(List<org.apache.parquet.schema.OriginalType> originalTypes)` Converts list of `OriginalType`s to list of `TypeProtos.MajorType`s.
`static TypeProtos.DataMode`	`getDataMode(org.apache.parquet.schema.Type.Repetition repetition)` Converts Parquet's `Type.Repetition` to Drill's `TypeProtos.DataMode`.
`static String`	`getFullColumnPath(org.apache.parquet.column.ColumnDescriptor column)` generate full path of the column in format `a`.`b`.`c`
`static int`	`getIntFromLEBytes(byte[] input, int start)`
`static TypeProtos.MinorType`	`getMinorType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type, org.apache.parquet.schema.OriginalType originalType)` Builds minor type using given `OriginalType originalType` or `PrimitiveTypeName type`.
`static TypeProtos.MajorType`	`getType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type, org.apache.parquet.schema.OriginalType originalType, int precision, int scale)` Builds major type using given `OriginalType originalType` or `PrimitiveTypeName type`.
`static boolean`	`isLogicalListType(org.apache.parquet.schema.GroupType groupType)` Checks whether group field approximately matches pattern for Logical Lists:
`static boolean`	`isLogicalMapType(org.apache.parquet.schema.GroupType groupType)` Checks whether group field matches pattern for Logical Map type:
`static void`	`transformBinaryInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata, ParquetReaderConfig readerConfig)` Transforms values for min / max binary statistics to byte array.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH
```
public static final long JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH
```
    Number of days between Julian day epoch (January 1, 4713 BC) and Unix day epoch (January 1, 1970). The value of this constant is 2440588L.
    
    See Also:
    
    Constant Field Values
  - CORRECT_CORRUPT_DATE_SHIFT
```
public static final long CORRECT_CORRUPT_DATE_SHIFT
```
    All old parquet files (which haven't "is.date.correct=true" or "parquet-writer.version" properties in metadata) have a corrupt date shift: 4881176L days or 2 * 2440588L
    
    See Also:
    
    Constant Field Values
  - DATE_CORRUPTION_THRESHOLD
```
public static final int DATE_CORRUPTION_THRESHOLD
```
    The year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption. This balances two possible cases of bad auto-correction. External tools writing dates in the future will not be shifted unless they are past this threshold (and we cannot identify them as external files based on the metadata). On the other hand, historical dates written with Drill wouldn't risk being incorrectly shifted unless they were something like 10,000 years in the past.
  - DRILL_WRITER_VERSION_STD_DATE_FORMAT
```
public static final int DRILL_WRITER_VERSION_STD_DATE_FORMAT
```
    Version 2 (and later) of the Drill Parquet writer uses the date format described in the Parquet spec. Prior versions had dates formatted with CORRECT_CORRUPT_DATE_SHIFT
    
    See Also:
    
    Constant Field Values
  - ALLOWED_DRILL_VERSION_FOR_BINARY
```
public static final String ALLOWED_DRILL_VERSION_FOR_BINARY
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - ParquetReaderUtility
```
public ParquetReaderUtility()
```
- Method Detail
  - checkDecimalTypeEnabled
```
public static void checkDecimalTypeEnabled(OptionManager options)
```
  - getIntFromLEBytes
```
public static int getIntFromLEBytes(byte[] input,
                                    int start)
```
  - getColNameToSchemaElementMapping
```
public static Map<String,org.apache.parquet.format.SchemaElement> getColNameToSchemaElementMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer)
```
    Map full schema paths in format `a`.`b`.`c` to respective SchemaElement objects.
    
    Parameters:
    
    footer - Parquet file metadata
    
    Returns:
    
    schema full path to SchemaElement map
  - getFullColumnPath
```
public static String getFullColumnPath(org.apache.parquet.column.ColumnDescriptor column)
```
    generate full path of the column in format `a`.`b`.`c`
    
    Parameters:
    
    column - ColumnDescriptor object
    
    Returns:
    
    full path in format `a`.`b`.`c`
  - getColNameToColumnDescriptorMapping
```
public static Map<String,org.apache.parquet.column.ColumnDescriptor> getColNameToColumnDescriptorMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer)
```
    Map full column paths to all ColumnDescriptors in file schema
    
    Parameters:
    
    footer - Parquet file metadata
    
    Returns:
    
    column full path to ColumnDescriptor object map
  - autoCorrectCorruptedDate
```
public static int autoCorrectCorruptedDate(int corruptedDate)
```
  - correctDatesInMetadataCache
```
public static void correctDatesInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata)
```
  - transformBinaryInMetadataCache
```
public static void transformBinaryInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata,
                                                  ParquetReaderConfig readerConfig)
```
    Transforms values for min / max binary statistics to byte array. Transformation logic depends on metadata file version.
    
    Parameters:
    
    parquetTableMetadata - table metadata that should be corrected
    
    readerConfig - parquet reader config
  - detectCorruptDates
```
public static ParquetReaderUtility.DateCorruptionStatus detectCorruptDates(org.apache.parquet.hadoop.metadata.ParquetMetadata footer,
                                                                           List<SchemaPath> columns,
                                                                           boolean autoCorrectCorruptDates)
```
    Check for corrupted dates in a parquet file. See Drill-4203
  - checkForCorruptDateValuesInStatistics
```
public static ParquetReaderUtility.DateCorruptionStatus checkForCorruptDateValuesInStatistics(org.apache.parquet.hadoop.metadata.ParquetMetadata footer,
                                                                                              List<SchemaPath> columns,
                                                                                              boolean autoCorrectCorruptDates)
```
    Detect corrupt date values by looking at the min/max values in the metadata. This should only be used when a file does not have enough metadata to determine if the data was written with an external tool or an older version of Drill (ParquetRecordWriter.WRITER_VERSION_PROPERTY < DRILL_WRITER_VERSION_STD_DATE_FORMAT) This method only checks the first Row Group, because Drill has only ever written a single Row Group per file.
    
    Parameters:
    
    footer - parquet footer
    
    columns - list of columns schema path
    
    autoCorrectCorruptDates - user setting to allow enabling/disabling of auto-correction of corrupt dates. There are some rare cases (storing dates thousands of years into the future, with tools other than Drill writing files) that would result in the date values being "corrected" into bad values.
  - getType
```
public static TypeProtos.MajorType getType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type,
                                           org.apache.parquet.schema.OriginalType originalType,
                                           int precision,
                                           int scale)
```
    Builds major type using given OriginalType originalType or PrimitiveTypeName type. For DECIMAL will be returned major type with scale and precision.
    
    Parameters:
    
    type - parquet primitive type
    
    originalType - parquet original type
    
    scale - type scale (used for DECIMAL type)
    
    precision - type precision (used for DECIMAL type)
    
    Returns:
    
    major type
  - getMinorType
```
public static TypeProtos.MinorType getMinorType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type,
                                                org.apache.parquet.schema.OriginalType originalType)
```
    Builds minor type using given OriginalType originalType or PrimitiveTypeName type.
    
    Parameters:
    
    type - parquet primitive type
    
    originalType - parquet original type
    
    Returns:
    
    minor type
  - containsComplexColumn
```
public static boolean containsComplexColumn(org.apache.parquet.hadoop.metadata.ParquetMetadata footer,
                                            List<SchemaPath> columns)
```
    Check whether any of columns in the given list is either nested or repetitive.
    
    Parameters:
    
    footer - Parquet file schema
    
    columns - list of query SchemaPath objects
  - getComplexTypes
```
public static List<TypeProtos.MajorType> getComplexTypes(List<org.apache.parquet.schema.OriginalType> originalTypes)
```
    Converts list of OriginalTypes to list of TypeProtos.MajorTypes. NOTE: current implementation cares about OriginalType.MAP and OriginalType.LIST only converting it to TypeProtos.MinorType.DICT and TypeProtos.MinorType.LIST respectively. Other original types are converted to null, because there is no certain correspondence (and, actually, a need because these types are used to differentiate between Drill's MAP and DICT (and arrays of thereof) types when constructing TupleSchema) between these two.
    
    Parameters:
    
    originalTypes - list of Parquet's types
    
    Returns:
    
    list containing either null or type with minor type TypeProtos.MinorType.DICT or TypeProtos.MinorType.LIST values
  - isLogicalListType
```
public static boolean isLogicalListType(org.apache.parquet.schema.GroupType groupType)
```
    Checks whether group field approximately matches pattern for Logical Lists:
```
 <list-repetition> group <name> (LIST) {
   repeated group list {
     <element-repetition> <element-type> element;
   }
 }
 
```
    (See for more details: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) Note, that standard field names 'list' and 'element' aren't checked intentionally, because Hive lists have 'bag' and 'array_element' names instead.
    Parameters:
    
    groupType - type which may have LIST original type
    
    Returns:
    
    whether the type is LIST and nested field is repeated group
    
    See Also:
    
    Parquet List logical type
  - isLogicalMapType
```
public static boolean isLogicalMapType(org.apache.parquet.schema.GroupType groupType)
```
    Checks whether group field matches pattern for Logical Map type:
```
 <map-repetition> group <name> (MAP) {
   repeated group key_value {
     required <key-type> key;
     <value-repetition> <value-type> value;
   }
 }
 
```
    Note, that actual group names are not checked specifically.
    Parameters:
    
    groupType - parquet type which may be of MAP type
    
    Returns:
    
    whether the type is MAP
    
    See Also:
    
    Parquet Map logical type
  - getDataMode
```
public static TypeProtos.DataMode getDataMode(org.apache.parquet.schema.Type.Repetition repetition)
```
    Converts Parquet's Type.Repetition to Drill's TypeProtos.DataMode.
    
    Parameters:
    
    repetition - repetition to be converted
    
    Returns:
    
    data mode corresponding to Parquet's repetition

Class ParquetReaderUtility

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH

CORRECT_CORRUPT_DATE_SHIFT

DATE_CORRUPTION_THRESHOLD

DRILL_WRITER_VERSION_STD_DATE_FORMAT

ALLOWED_DRILL_VERSION_FOR_BINARY

Constructor Detail

ParquetReaderUtility

Method Detail

checkDecimalTypeEnabled

getIntFromLEBytes

getColNameToSchemaElementMapping

getFullColumnPath

getColNameToColumnDescriptorMapping

autoCorrectCorruptedDate

correctDatesInMetadataCache

transformBinaryInMetadataCache

detectCorruptDates

checkForCorruptDateValuesInStatistics

getType

getMinorType

containsComplexColumn

getComplexTypes

isLogicalListType

isLogicalMapType

getDataMode