Allows creation of parquet based tables using the syntax
CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet
.
:: DeveloperApi :: Operator that acts as a sink for queries on RDDs and can be used to store the output inside a directory of Parquet files.
:: DeveloperApi :: Operator that acts as a sink for queries on RDDs and can be used to store the output inside a directory of Parquet files. This operator is similar to Hive's INSERT INTO TABLE operation in the sense that one can choose to either overwrite or append to a directory. Note that consecutive insertions to the same table must have compatible (source) schemas.
WARNING: EXPERIMENTAL! InsertIntoParquetTable with overwrite=false may cause data corruption in the case that multiple users try to append to the same table simultaneously. Inserting into a table that was previously generated by other means (e.g., by creating an HDFS directory and importing Parquet files generated by other tools) may cause unpredicted behaviour and therefore results in a RuntimeException (only detected via filename pattern so will not catch all cases).
An alternative to ParquetRelation that plugs in using the data sources API.
An alternative to ParquetRelation that plugs in using the data sources API. This class is currently not intended as a full replacement of the parquet support in Spark SQL though it is likely that it will eventually subsume the existing physical plan implementation.
Compared with the current implementation, this class has the following notable differences:
Partitioning: Partitions are auto discovered and must be in the form of directories key=value/
located at path
. Currently only a single partitioning column is supported and it must
be an integer. This class supports both fully self-describing data, which contains the partition
key, and data where the partition key is only present in the folder structure. The presence
of the partitioning key in the data is also auto-detected. The null
partition is not yet
supported.
Metadata: The metadata is automatically discovered by reading the first parquet file present. There is currently no support for working with files that have different schema. Additionally, when parquet metadata caching is turned on, the FileStatus objects for all data will be cached to improve the speed of interactive querying. When data is added to a table it must be dropped and recreated to pick up any changes.
Statistics: Statistics for the size of the table are automatically populated during metadata discovery.
:: DeveloperApi :: Parquet table scan operator.
:: DeveloperApi ::
Parquet table scan operator. Imports the file that backs the given
org.apache.spark.sql.parquet.ParquetRelation as a
.
RDD[Row]
Allows creation of parquet based tables using the syntax
CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet
. Currently the only option required ispath
, which should be the location of a collection of, optionally partitioned, parquet files.