Class SqlTransform

  • All Implemented Interfaces:
    java.io.Serializable, org.apache.beam.sdk.transforms.display.HasDisplayData

    public abstract class SqlTransform
    extends org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PInput,​org.apache.beam.sdk.values.PCollection<org.apache.beam.sdk.values.Row>>
    SqlTransform is the DSL interface of Beam SQL. It translates a SQL query as a PTransform, so developers can use standard SQL queries in a Beam pipeline.

    Beam SQL DSL usage:

    A typical pipeline with Beam SQL DSL is:

    
     PipelineOptions options = PipelineOptionsFactory.create();
     Pipeline p = Pipeline.create(options);
    
     //create table from TextIO;
     PCollection<Row> inputTableA = p.apply(TextIO.read().from("/my/input/patha")).apply(...);
     PCollection<Row> inputTableB = p.apply(TextIO.read().from("/my/input/pathb")).apply(...);
    
     //run a simple query, and register the output as a table in BeamSql;
     String sql1 = "select MY_FUNC(c1), c2 from PCOLLECTION";
     PCollection<Row> outputTableA = inputTableA.apply(
        SqlTransform
            .query(sql1)
            .addUdf("MY_FUNC", MY_FUNC.class, "FUNC");
    
     //run a JOIN with one table from TextIO, and one table from another query
     PCollection<Row> outputTableB =
         PCollectionTuple
         .of(new TupleTag<>("TABLE_O_A"), outputTableA)
         .and(new TupleTag<>("TABLE_B"), inputTableB)
             .apply(SqlTransform.query("select * from TABLE_O_A JOIN TABLE_B where ..."));
    
     //output the final result with TextIO
     outputTableB.apply(...).apply(TextIO.write().to("/my/output/path"));
    
     p.run().waitUntilFinish();
     

    A typical pipeline with Beam SQL DDL and DSL is:

    
     PipelineOptions options = PipelineOptionsFactory.create();
     Pipeline p = Pipeline.create(options);
    
     String sql1 = "INSERT INTO pubsub_sink SELECT * FROM pubsub_source";
    
     String ddlSource = "CREATE EXTERNAL TABLE pubsub_source(" +
         "attributes MAP<VARCHAR, VARCHAR>, payload ROW<name VARCHAR, size INTEGER>)" +
         "TYPE pubsub LOCATION 'projects/myproject/topics/topic1'";
    
     String ddlSink = "CREATE EXTERNAL TABLE pubsub_sink(" +
         "attributes MAP<VARCHAR, VARCHAR>, payload ROW<name VARCHAR, size INTEGER>)" +
         "TYPE pubsub LOCATION 'projects/myproject/topics/mytopic'";
    
     p.apply(SqlTransform.query(sql1).withDdlString(ddlSource).withDdlString(ddlSink))
    
     p.run().waitUntilFinish();
     
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static java.lang.String PCOLLECTION_NAME  
      • Fields inherited from class org.apache.beam.sdk.transforms.PTransform

        annotations, displayData, name, resourceHints
    • Constructor Summary

      Constructors 
      Constructor Description
      SqlTransform()  
    • Field Detail

      • PCOLLECTION_NAME

        public static final java.lang.String PCOLLECTION_NAME
        See Also:
        Constant Field Values
    • Constructor Detail

      • SqlTransform

        public SqlTransform()
    • Method Detail

      • expand

        public org.apache.beam.sdk.values.PCollection<org.apache.beam.sdk.values.Row> expand​(org.apache.beam.sdk.values.PInput input)
        Specified by:
        expand in class org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PInput,​org.apache.beam.sdk.values.PCollection<org.apache.beam.sdk.values.Row>>
      • query

        public static SqlTransform query​(java.lang.String queryString)
        Returns a SqlTransform representing an equivalent execution plan.

        The SqlTransform can be applied to a PCollection or PCollectionTuple representing all the input tables.

        The PTransform outputs a PCollection of Row.

        If the PTransform is applied to PCollection then it gets registered with name PCOLLECTION.

        If the PTransform is applied to PCollectionTuple then TupleTag.getId() is used as the corresponding PCollections name.

        • If the sql query only uses a subset of tables from the upstream PCollectionTuple, this is valid;
        • If the sql query references a table not included in the upstream PCollectionTuple, an IllegalStateException is thrown during query validati on;
        • Always, tables from the upstream PCollectionTuple are only valid in the scope of the current query call.

        Any available implementation of QueryPlanner can be used as the query planner in SqlTransform. An implementation can be specified globally for the entire pipeline with BeamSqlPipelineOptions.getPlannerName(). The global planner can be overridden per-transform with withQueryPlannerClass(Class).

      • withDefaultTableProvider

        public SqlTransform withDefaultTableProvider​(java.lang.String name,
                                                     TableProvider tableProvider)
      • withNamedParameters

        public SqlTransform withNamedParameters​(java.util.Map<java.lang.String,​?> parameters)
      • withPositionalParameters

        public SqlTransform withPositionalParameters​(java.util.List<?> parameters)
      • withDdlString

        public SqlTransform withDdlString​(java.lang.String ddlString)
      • withAutoLoading

        public SqlTransform withAutoLoading​(boolean autoLoading)
      • registerUdf

        public SqlTransform registerUdf​(java.lang.String functionName,
                                        java.lang.Class<? extends BeamSqlUdf> clazz)
        register a UDF function used in this query.

        Refer to BeamSqlUdf for more about how to implement a UDF in BeamSql.

      • registerUdf

        public SqlTransform registerUdf​(java.lang.String functionName,
                                        org.apache.beam.sdk.transforms.SerializableFunction sfn)
        Register SerializableFunction as a UDF function used in this query. Note, SerializableFunction must have a constructor without arguments.
      • registerUdaf

        public SqlTransform registerUdaf​(java.lang.String functionName,
                                         org.apache.beam.sdk.transforms.Combine.CombineFn combineFn)
        register a Combine.CombineFn as UDAF function used in this query.
      • withErrorsTransformer

        public SqlTransform withErrorsTransformer​(org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<org.apache.beam.sdk.values.Row>,​? extends org.apache.beam.sdk.values.POutput> errorsTransformer)