org.apache.spark.sql

parquet

package parquet

Visibility
  1. Public
  2. All

Type Members

  1. class DefaultSource extends RelationProvider

    Allows creation of parquet based tables using the syntax CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet.

    Allows creation of parquet based tables using the syntax CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet. Currently the only option required is path, which should be the location of a collection of, optionally partitioned, parquet files.

  2. case class InsertIntoParquetTable(relation: ParquetRelation, child: SparkPlan, overwrite: Boolean = false) extends SparkPlan with UnaryNode with SparkHadoopMapReduceUtil with Product with Serializable

    :: DeveloperApi :: Operator that acts as a sink for queries on RDDs and can be used to store the output inside a directory of Parquet files.

    :: DeveloperApi :: Operator that acts as a sink for queries on RDDs and can be used to store the output inside a directory of Parquet files. This operator is similar to Hive's INSERT INTO TABLE operation in the sense that one can choose to either overwrite or append to a directory. Note that consecutive insertions to the same table must have compatible (source) schemas.

    WARNING: EXPERIMENTAL! InsertIntoParquetTable with overwrite=false may cause data corruption in the case that multiple users try to append to the same table simultaneously. Inserting into a table that was previously generated by other means (e.g., by creating an HDFS directory and importing Parquet files generated by other tools) may cause unpredicted behaviour and therefore results in a RuntimeException (only detected via filename pattern so will not catch all cases).

    Annotations
    @DeveloperApi()
  3. case class ParquetRelation2(path: String)(sqlContext: SQLContext) extends CatalystScan with Logging with Product with Serializable

    An alternative to ParquetRelation that plugs in using the data sources API.

    An alternative to ParquetRelation that plugs in using the data sources API. This class is currently not intended as a full replacement of the parquet support in Spark SQL though it is likely that it will eventually subsume the existing physical plan implementation.

    Compared with the current implementation, this class has the following notable differences:

    Partitioning: Partitions are auto discovered and must be in the form of directories key=value/ located at path. Currently only a single partitioning column is supported and it must be an integer. This class supports both fully self-describing data, which contains the partition key, and data where the partition key is only present in the folder structure. The presence of the partitioning key in the data is also auto-detected. The null partition is not yet supported.

    Metadata: The metadata is automatically discovered by reading the first parquet file present. There is currently no support for working with files that have different schema. Additionally, when parquet metadata caching is turned on, the FileStatus objects for all data will be cached to improve the speed of interactive querying. When data is added to a table it must be dropped and recreated to pick up any changes.

    Statistics: Statistics for the size of the table are automatically populated during metadata discovery.

    Annotations
    @DeveloperApi()
  4. case class ParquetTableScan(attributes: Seq[Attribute], relation: ParquetRelation, columnPruningPred: Seq[Expression]) extends SparkPlan with LeafNode with Product with Serializable

    :: DeveloperApi :: Parquet table scan operator.

    :: DeveloperApi :: Parquet table scan operator. Imports the file that backs the given org.apache.spark.sql.parquet.ParquetRelation as a RDD[Row].

Ungrouped