Package org.apache.parquet.hadoop
Class ParquetFileWriter
- java.lang.Object
-
- org.apache.parquet.hadoop.ParquetFileWriter
-
public class ParquetFileWriter extends Object
Internal implementation of the Parquet file writer as a block container
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
ParquetFileWriter.Mode
-
Field Summary
Fields Modifier and Type Field Description static int
CURRENT_VERSION
static String
EF_MAGIC_STR
static byte[]
EFMAGIC
static byte[]
MAGIC
static String
MAGIC_STR
protected PositionOutputStream
out
static String
PARQUET_COMMON_METADATA_FILE
static String
PARQUET_METADATA_FILE
-
Constructor Summary
Constructors Constructor Description ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, MessageType schema, org.apache.hadoop.fs.Path file)
Deprecated.will be removed in 2.0.0ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode)
Deprecated.will be removed in 2.0.0ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize)
Deprecated.will be removed in 2.0.0ParquetFileWriter(OutputFile file, MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize)
Deprecated.will be removed in 2.0.0ParquetFileWriter(OutputFile file, MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled)
ParquetFileWriter(OutputFile file, MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled, FileEncryptionProperties encryptionProperties)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description void
appendColumnChunk(ColumnDescriptor descriptor, SeekableInputStream from, ColumnChunkMetaData chunk, BloomFilter bloomFilter, ColumnIndex columnIndex, OffsetIndex offsetIndex)
void
appendFile(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file)
Deprecated.will be removed in 2.0.0; useappendFile(InputFile)
insteadvoid
appendFile(InputFile file)
void
appendRowGroup(org.apache.hadoop.fs.FSDataInputStream from, BlockMetaData rowGroup, boolean dropColumns)
Deprecated.will be removed in 2.0.0; useappendRowGroup(SeekableInputStream,BlockMetaData,boolean)
insteadvoid
appendRowGroup(SeekableInputStream from, BlockMetaData rowGroup, boolean dropColumns)
void
appendRowGroups(org.apache.hadoop.fs.FSDataInputStream file, List<BlockMetaData> rowGroups, boolean dropColumns)
Deprecated.will be removed in 2.0.0; useappendRowGroups(SeekableInputStream,List,boolean)
insteadvoid
appendRowGroups(SeekableInputStream file, List<BlockMetaData> rowGroups, boolean dropColumns)
void
end(Map<String,String> extraMetaData)
ends a file once all blocks have been written.void
endBlock()
ends a block once all column chunks have been writtenvoid
endColumn()
end a column (once all rep, def and data have been written)ParquetMetadata
getFooter()
long
getNextRowGroupSize()
long
getPos()
static ParquetMetadata
mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf)
Deprecated.metadata files are not recommended and will be removed in 2.0.0static ParquetMetadata
mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf, KeyValueMetadataMergeStrategy keyValueMetadataMergeStrategy)
Deprecated.metadata files are not recommended and will be removed in 2.0.0void
start()
start the filevoid
startBlock(long recordCount)
start a blockvoid
startColumn(ColumnDescriptor descriptor, long valueCount, org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName)
start a column inside a blockvoid
writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, Encoding rlEncoding, Encoding dlEncoding, Encoding valuesEncoding)
Deprecated.void
writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, Statistics statistics, long rowCount, Encoding rlEncoding, Encoding dlEncoding, Encoding valuesEncoding)
Writes a single pagevoid
writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, Statistics statistics, Encoding rlEncoding, Encoding dlEncoding, Encoding valuesEncoding)
Deprecated.this method does not support writing column indexes; UsewriteDataPage(int, int, BytesInput, Statistics, long, Encoding, Encoding, Encoding)
insteadvoid
writeDataPageV2(int rowCount, int nullCount, int valueCount, org.apache.parquet.bytes.BytesInput repetitionLevels, org.apache.parquet.bytes.BytesInput definitionLevels, Encoding dataEncoding, org.apache.parquet.bytes.BytesInput compressedData, int uncompressedDataSize, Statistics<?> statistics)
Writes a single v2 data pagevoid
writeDictionaryPage(DictionaryPage dictionaryPage)
writes a dictionary page pagevoid
writeDictionaryPage(DictionaryPage dictionaryPage, BlockCipher.Encryptor headerBlockEncryptor, byte[] AAD)
static void
writeMergedMetadataFile(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.conf.Configuration conf)
Deprecated.metadata files are not recommended and will be removed in 2.0.0static void
writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<Footer> footers)
Deprecated.metadata files are not recommended and will be removed in 2.0.0static void
writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<Footer> footers, ParquetOutputFormat.JobSummaryLevel level)
Deprecated.metadata files are not recommended and will be removed in 2.0.0
-
-
-
Field Detail
-
PARQUET_METADATA_FILE
public static final String PARQUET_METADATA_FILE
- See Also:
- Constant Field Values
-
MAGIC_STR
public static final String MAGIC_STR
- See Also:
- Constant Field Values
-
MAGIC
public static final byte[] MAGIC
-
EF_MAGIC_STR
public static final String EF_MAGIC_STR
- See Also:
- Constant Field Values
-
EFMAGIC
public static final byte[] EFMAGIC
-
PARQUET_COMMON_METADATA_FILE
public static final String PARQUET_COMMON_METADATA_FILE
- See Also:
- Constant Field Values
-
CURRENT_VERSION
public static final int CURRENT_VERSION
- See Also:
- Constant Field Values
-
out
protected final PositionOutputStream out
-
-
Constructor Detail
-
ParquetFileWriter
@Deprecated public ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, MessageType schema, org.apache.hadoop.fs.Path file) throws IOException
Deprecated.will be removed in 2.0.0- Parameters:
configuration
- Hadoop configurationschema
- the schema of the datafile
- the file to write to- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
@Deprecated public ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode) throws IOException
Deprecated.will be removed in 2.0.0- Parameters:
configuration
- Hadoop configurationschema
- the schema of the datafile
- the file to write tomode
- file creation mode- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
@Deprecated public ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize) throws IOException
Deprecated.will be removed in 2.0.0- Parameters:
configuration
- Hadoop configurationschema
- the schema of the datafile
- the file to write tomode
- file creation moderowGroupSize
- the row group sizemaxPaddingSize
- the maximum padding- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
@Deprecated public ParquetFileWriter(OutputFile file, MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize) throws IOException
Deprecated.will be removed in 2.0.0- Parameters:
file
- OutputFile to create or overwriteschema
- the schema of the datamode
- file creation moderowGroupSize
- the row group sizemaxPaddingSize
- the maximum padding- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
public ParquetFileWriter(OutputFile file, MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled) throws IOException
- Parameters:
file
- OutputFile to create or overwriteschema
- the schema of the datamode
- file creation moderowGroupSize
- the row group sizemaxPaddingSize
- the maximum paddingcolumnIndexTruncateLength
- the length which the min/max values in column indexes tried to be truncated tostatisticsTruncateLength
- the length which the min/max values in row groups tried to be truncated topageWriteChecksumEnabled
- whether to write out page level checksums- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
public ParquetFileWriter(OutputFile file, MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled, FileEncryptionProperties encryptionProperties) throws IOException
- Throws:
IOException
-
-
Method Detail
-
start
public void start() throws IOException
start the file- Throws:
IOException
- if there is an error while writing
-
startBlock
public void startBlock(long recordCount) throws IOException
start a block- Parameters:
recordCount
- the record count in this block- Throws:
IOException
- if there is an error while writing
-
startColumn
public void startColumn(ColumnDescriptor descriptor, long valueCount, org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName) throws IOException
start a column inside a block- Parameters:
descriptor
- the column descriptorvalueCount
- the value count in this columncompressionCodecName
- a compression codec name- Throws:
IOException
- if there is an error while writing
-
writeDictionaryPage
public void writeDictionaryPage(DictionaryPage dictionaryPage) throws IOException
writes a dictionary page page- Parameters:
dictionaryPage
- the dictionary page- Throws:
IOException
- if there is an error while writing
-
writeDictionaryPage
public void writeDictionaryPage(DictionaryPage dictionaryPage, BlockCipher.Encryptor headerBlockEncryptor, byte[] AAD) throws IOException
- Throws:
IOException
-
writeDataPage
@Deprecated public void writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, Encoding rlEncoding, Encoding dlEncoding, Encoding valuesEncoding) throws IOException
Deprecated.writes a single page- Parameters:
valueCount
- count of valuesuncompressedPageSize
- the size of the data once uncompressedbytes
- the compressed data for the page without headerrlEncoding
- encoding of the repetition leveldlEncoding
- encoding of the definition levelvaluesEncoding
- encoding of values- Throws:
IOException
- if there is an error while writing
-
writeDataPage
@Deprecated public void writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, Statistics statistics, Encoding rlEncoding, Encoding dlEncoding, Encoding valuesEncoding) throws IOException
Deprecated.this method does not support writing column indexes; UsewriteDataPage(int, int, BytesInput, Statistics, long, Encoding, Encoding, Encoding)
insteadwrites a single page- Parameters:
valueCount
- count of valuesuncompressedPageSize
- the size of the data once uncompressedbytes
- the compressed data for the page without headerstatistics
- statistics for the pagerlEncoding
- encoding of the repetition leveldlEncoding
- encoding of the definition levelvaluesEncoding
- encoding of values- Throws:
IOException
- if there is an error while writing
-
writeDataPage
public void writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, Statistics statistics, long rowCount, Encoding rlEncoding, Encoding dlEncoding, Encoding valuesEncoding) throws IOException
Writes a single page- Parameters:
valueCount
- count of valuesuncompressedPageSize
- the size of the data once uncompressedbytes
- the compressed data for the page without headerstatistics
- the statistics of the pagerowCount
- the number of rows in the pagerlEncoding
- encoding of the repetition leveldlEncoding
- encoding of the definition levelvaluesEncoding
- encoding of values- Throws:
IOException
- if any I/O error occurs during writing the file
-
writeDataPageV2
public void writeDataPageV2(int rowCount, int nullCount, int valueCount, org.apache.parquet.bytes.BytesInput repetitionLevels, org.apache.parquet.bytes.BytesInput definitionLevels, Encoding dataEncoding, org.apache.parquet.bytes.BytesInput compressedData, int uncompressedDataSize, Statistics<?> statistics) throws IOException
Writes a single v2 data page- Parameters:
rowCount
- count of rowsnullCount
- count of nullsvalueCount
- count of valuesrepetitionLevels
- repetition level bytesdefinitionLevels
- definition level bytesdataEncoding
- encoding for datacompressedData
- compressed data bytesuncompressedDataSize
- the size of uncompressed datastatistics
- the statistics of the page- Throws:
IOException
- if any I/O error occurs during writing the file
-
endColumn
public void endColumn() throws IOException
end a column (once all rep, def and data have been written)- Throws:
IOException
- if there is an error while writing
-
endBlock
public void endBlock() throws IOException
ends a block once all column chunks have been written- Throws:
IOException
- if there is an error while writing
-
appendFile
@Deprecated public void appendFile(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file) throws IOException
Deprecated.will be removed in 2.0.0; useappendFile(InputFile)
instead- Parameters:
conf
- a configurationfile
- a file path to append the contents of to this file- Throws:
IOException
- if there is an error while reading or writing
-
appendFile
public void appendFile(InputFile file) throws IOException
- Throws:
IOException
-
appendRowGroups
@Deprecated public void appendRowGroups(org.apache.hadoop.fs.FSDataInputStream file, List<BlockMetaData> rowGroups, boolean dropColumns) throws IOException
Deprecated.will be removed in 2.0.0; useappendRowGroups(SeekableInputStream,List,boolean)
instead- Parameters:
file
- a file stream to read fromrowGroups
- row groups to copydropColumns
- whether to drop columns from the file that are not in this file's schema- Throws:
IOException
- if there is an error while reading or writing
-
appendRowGroups
public void appendRowGroups(SeekableInputStream file, List<BlockMetaData> rowGroups, boolean dropColumns) throws IOException
- Throws:
IOException
-
appendRowGroup
@Deprecated public void appendRowGroup(org.apache.hadoop.fs.FSDataInputStream from, BlockMetaData rowGroup, boolean dropColumns) throws IOException
Deprecated.will be removed in 2.0.0; useappendRowGroup(SeekableInputStream,BlockMetaData,boolean)
instead- Parameters:
from
- a file stream to read fromrowGroup
- row group to copydropColumns
- whether to drop columns from the file that are not in this file's schema- Throws:
IOException
- if there is an error while reading or writing
-
appendRowGroup
public void appendRowGroup(SeekableInputStream from, BlockMetaData rowGroup, boolean dropColumns) throws IOException
- Throws:
IOException
-
appendColumnChunk
public void appendColumnChunk(ColumnDescriptor descriptor, SeekableInputStream from, ColumnChunkMetaData chunk, BloomFilter bloomFilter, ColumnIndex columnIndex, OffsetIndex offsetIndex) throws IOException
- Parameters:
descriptor
- the descriptor for the target columnfrom
- a file stream to read fromchunk
- the column chunk to be copiedbloomFilter
- the bloomFilter for this chunkcolumnIndex
- the column index for this chunkoffsetIndex
- the offset index for this chunk- Throws:
IOException
-
end
public void end(Map<String,String> extraMetaData) throws IOException
ends a file once all blocks have been written. closes the file.- Parameters:
extraMetaData
- the extra meta data to write in the footer- Throws:
IOException
- if there is an error while writing
-
getFooter
public ParquetMetadata getFooter()
-
mergeMetadataFiles
@Deprecated public static ParquetMetadata mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf) throws IOException
Deprecated.metadata files are not recommended and will be removed in 2.0.0Given a list of metadata files, merge them into a single ParquetMetadata Requires that the schemas be compatible, and the extraMetadata be exactly equal.- Parameters:
files
- a list of files to merge metadata fromconf
- a configuration- Returns:
- merged parquet metadata for the files
- Throws:
IOException
- if there is an error while writing
-
mergeMetadataFiles
@Deprecated public static ParquetMetadata mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf, KeyValueMetadataMergeStrategy keyValueMetadataMergeStrategy) throws IOException
Deprecated.metadata files are not recommended and will be removed in 2.0.0Given a list of metadata files, merge them into a single ParquetMetadata Requires that the schemas be compatible, and the extraMetadata be exactly equal.- Parameters:
files
- a list of files to merge metadata fromconf
- a configurationkeyValueMetadataMergeStrategy
- strategy to merge values for same key, if there are multiple- Returns:
- merged parquet metadata for the files
- Throws:
IOException
- if there is an error while writing
-
writeMergedMetadataFile
@Deprecated public static void writeMergedMetadataFile(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.conf.Configuration conf) throws IOException
Deprecated.metadata files are not recommended and will be removed in 2.0.0Given a list of metadata files, merge them into a single metadata file. Requires that the schemas be compatible, and the extraMetaData be exactly equal. This is useful when merging 2 directories of parquet files into a single directory, as long as both directories were written with compatible schemas and equal extraMetaData.- Parameters:
files
- a list of files to merge metadata fromoutputPath
- path to write merged metadata toconf
- a configuration- Throws:
IOException
- if there is an error while reading or writing
-
writeMetadataFile
@Deprecated public static void writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<Footer> footers) throws IOException
Deprecated.metadata files are not recommended and will be removed in 2.0.0writes a _metadata and _common_metadata file- Parameters:
configuration
- the configuration to use to get the FileSystemoutputPath
- the directory to write the _metadata file tofooters
- the list of footers to merge- Throws:
IOException
- if there is an error while writing
-
writeMetadataFile
@Deprecated public static void writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<Footer> footers, ParquetOutputFormat.JobSummaryLevel level) throws IOException
Deprecated.metadata files are not recommended and will be removed in 2.0.0writes _common_metadata file, and optionally a _metadata file depending on theParquetOutputFormat.JobSummaryLevel
provided- Parameters:
configuration
- the configuration to use to get the FileSystemoutputPath
- the directory to write the _metadata file tofooters
- the list of footers to mergelevel
- level of summary to write- Throws:
IOException
- if there is an error while writing
-
getPos
public long getPos() throws IOException
- Returns:
- the current position in the underlying file
- Throws:
IOException
- if there is an error while getting the current stream's position
-
getNextRowGroupSize
public long getNextRowGroupSize() throws IOException
- Throws:
IOException
-
-