Class BigQueryIO
- java.lang.Object
-
- org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO
-
public class BigQueryIO extends java.lang.Object
PTransform
s for reading and writing BigQuery tables.Table References
A fully-qualified BigQuery table name consists of three components:
projectId
: the Cloud project id (defaults toGcpOptions.getProject()
).datasetId
: the BigQuery dataset id, unique within a project.tableId
: a table id, unique within a dataset.
BigQuery table references are stored as a
TableReference
, which comes from the BigQuery Java Client API. Tables can be referred to as Strings, with or without theprojectId
. A helper function is provided (BigQueryHelpers.parseTableSpec(String)
) that parses the following string forms into aTableReference
:- [
project_id
]:[dataset_id
].[table_id
] - [
dataset_id
].[table_id
]
BigQuery Concepts
Tables have rows (
TableRow
) and each row has cells (TableCell
). A table has a schema (TableSchema
), which in turn describes the schema of each cell (TableFieldSchema
). The terms field and cell are used interchangeably.TableSchema
: describes the schema (types and order) for values in each row. It has one attribute, 'fields', which is list ofTableFieldSchema
objects.TableFieldSchema
: describes the schema (type, name) for one field. It has several attributes, including 'name' and 'type'. Common values for the type attribute are: 'STRING', 'INTEGER', 'FLOAT', 'BOOLEAN', 'NUMERIC', 'GEOGRAPHY'. All possible values are described at: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-typesTableRow
: Holds all values in a table row. Has one attribute, 'f', which is a list ofTableCell
instances.TableCell
: Holds the value for one cell (or field). Has one attribute, 'v', which is the value of the table cell.As of Beam 2.7.0, the NUMERIC data type is supported. This data type supports high-precision decimal numbers (precision of 38 digits, scale of 9 digits). The GEOGRAPHY data type works with Well-Known Text (See https://en.wikipedia.org/wiki/Well-known_text) format for reading and writing to BigQuery. BigQuery IO requires values of BYTES datatype to be encoded using base64 encoding when writing to BigQuery. When bytes are read from BigQuery they are returned as base64-encoded strings.
Reading
Reading from BigQuery is supported by
read(SerializableFunction)
, which parses records in AVRO format into a custom type (see the table below for type conversion) using a specified parse function, and byreadTableRows()
which parses them intoTableRow
, which may be more convenient but has lower performance.Both functions support reading either from a table or from the result of a query, via
BigQueryIO.TypedRead.from(String)
andBigQueryIO.TypedRead.fromQuery(java.lang.String)
respectively. Exactly one of these must be specified.If you are reading from an authorized view wih
BigQueryIO.TypedRead.fromQuery(java.lang.String)
, you need to useBigQueryIO.TypedRead.withQueryLocation(String)
to set the location of the BigQuery job. Otherwise, Beam will ty to determine that location by reading the metadata of the dataset that contains the underlying tables. With authorized views, that will result in a 403 error and the query will not be resolved.Type Conversion Table
BigQuery standard SQL type Avro type Java type BOOLEAN boolean Boolean INT64 long Long FLOAT64 double Double BYTES bytes java.nio.ByteBuffer STRING string CharSequence DATE int Integer DATETIME string CharSequence TIMESTAMP long Long TIME long Long NUMERIC bytes java.nio.ByteBuffer GEOGRAPHY string CharSequence ARRAY array java.util.Collection STRUCT record org.apache.avro.generic.GenericRecord Example: Reading rows of a table as
TableRow
.
Example: Reading rows of a table and parsing them into a custom type.PCollection<TableRow> weatherData = pipeline.apply( BigQueryIO.readTableRows().from("clouddataflow-readonly:samples.weather_stations"));
PCollection<WeatherRecord> weatherData = pipeline.apply( BigQueryIO .read(new SerializableFunction<SchemaAndRecord, WeatherRecord>() { public WeatherRecord apply(SchemaAndRecord schemaAndRecord) { return new WeatherRecord(...); } }) .from("clouddataflow-readonly:samples.weather_stations")) .withCoder(SerializableCoder.of(WeatherRecord.class));
Note: When using
read(SerializableFunction)
, you may sometimes need to useBigQueryIO.TypedRead.withCoder(Coder)
to specify aCoder
for the result type, if Beam fails to infer it automatically.Example: Reading results of a query as
TableRow
.PCollection<TableRow> meanTemperatureData = pipeline.apply(BigQueryIO.readTableRows() .fromQuery("SELECT year, mean_temp FROM [samples.weather_stations]"));
Users can optionally specify a query priority using
BigQueryIO.TypedRead.withQueryPriority(TypedRead.QueryPriority)
and a geographic location where the query will be executed usingBigQueryIO.TypedRead.withQueryLocation(String)
. Query location must be specified for jobs that are not executed in US or EU, or if you are reading from an authorized view. See BigQuery Jobs: query.Writing
To write to a BigQuery table, apply a
BigQueryIO.Write
transformation. This consumes aPCollection
of a user-defined type when usingwrite()
(recommended), or aPCollection
ofTableRows
as input when usingwriteTableRows()
(not recommended). When using a user-defined type, one of the following must be provided.BigQueryIO.Write.withAvroFormatFunction(SerializableFunction)
(recommended) to write data using avro records.BigQueryIO.Write.withAvroWriter(org.apache.beam.sdk.transforms.SerializableFunction<org.apache.avro.Schema, org.apache.avro.io.DatumWriter<T>>)
to write avro data using a user-specifiedDatumWriter
(and format function).BigQueryIO.Write.withFormatFunction(SerializableFunction)
to write data as json encodedTableRows
.
BigQueryIO.Write.withAvroFormatFunction(SerializableFunction)
orBigQueryIO.Write.withAvroWriter(org.apache.beam.sdk.transforms.SerializableFunction<org.apache.avro.Schema, org.apache.avro.io.DatumWriter<T>>)
is used, the table schema MUST be specified using one of theBigQueryIO.Write.withJsonSchema(String)
,BigQueryIO.Write.withJsonSchema(ValueProvider)
,BigQueryIO.Write.withSchemaFromView(PCollectionView)
methods, orBigQueryIO.Write.to(DynamicDestinations)
.class Quote { final Instant timestamp; final String exchange; final String symbol; final double price; Quote(Instant timestamp, String exchange, String symbol, double price) { // initialize all member variables. } } PCollection<Quote> quotes = ... quotes.apply(BigQueryIO .<Quote>write() .to("my-project:my_dataset.my_table") .withSchema(new TableSchema().setFields( ImmutableList.of( new TableFieldSchema().setName("timestamp").setType("TIMESTAMP"), new TableFieldSchema().setName("exchange").setType("STRING"), new TableFieldSchema().setName("symbol").setType("STRING"), new TableFieldSchema().setName("price").setType("FLOAT")))) .withFormatFunction(quote -> new TableRow().set(..set the columns..)) .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
See
BigQueryIO.Write
for details on how to specify if a write should append to an existing table, replace the table, or verify that the table is empty. Note that the dataset being written to must already exist. Unbounded PCollections can only be written usingBigQueryIO.Write.WriteDisposition.WRITE_EMPTY
orBigQueryIO.Write.WriteDisposition.WRITE_APPEND
.BigQueryIO supports automatically inferring the BigQuery table schema from the Beam schema on the input PCollection. Beam can also automatically format the input into a TableRow in this case, if no format function is provide. In the above example, the quotes PCollection has a schema that Beam infers from the Quote POJO. So the write could be done more simply as follows:
{@literal @}DefaultSchema(JavaFieldSchema.class) class Quote { final Instant timestamp; final String exchange; final String symbol; final double price; {@literal @}SchemaCreate Quote(Instant timestamp, String exchange, String symbol, double price) { // initialize all member variables. } } PCollection<Quote> quotes = ... quotes.apply(BigQueryIO .<Quote>write() .to("my-project:my_dataset.my_table") .useBeamSchema() .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
Loading historical data into time-partitioned BigQuery tables
To load historical data into a time-partitioned BigQuery table, specify
BigQueryIO.Write.withTimePartitioning(com.google.api.services.bigquery.model.TimePartitioning)
with afield
used for column-based partitioning. For example:PCollection<Quote> quotes = ...; quotes.apply(BigQueryIO.write() .withSchema(schema) .withFormatFunction(quote -> new TableRow() .set("timestamp", quote.getTimestamp()) .set(..other columns..)) .to("my-project:my_dataset.my_table") .withTimePartitioning(new TimePartitioning().setField("time")));
Writing different values to different tables
A common use case is to dynamically generate BigQuery table names based on the current value. To support this,
BigQueryIO.Write.to(SerializableFunction)
accepts a function mapping the current element to a tablespec. For example, here's code that outputs quotes of different stocks to different tables:PCollection<Quote> quotes = ...; quotes.apply(BigQueryIO.write() .withSchema(schema) .withFormatFunction(quote -> new TableRow()...) .to((ValueInSingleWindow<Quote> quote) -> { String symbol = quote.getSymbol(); return new TableDestination( "my-project:my_dataset.quotes_" + symbol, // Table spec "Quotes of stock " + symbol // Table description ); });
Per-table schemas can also be provided using
BigQueryIO.Write.withSchemaFromView(org.apache.beam.sdk.values.PCollectionView<java.util.Map<java.lang.String, java.lang.String>>)
. This allows you the schemas to be calculated based on a previous pipeline stage or statically via aCreate
transform. This method expects to receive a map-valuedPCollectionView
, mapping table specifications (project:dataset.table-id), to JSON formattedTableSchema
objects. All destination tables must be present in this map, or the pipeline will fail to create tables. Care should be taken if the map value is based on a triggered aggregation over and unboundedPCollection
; the side input will contain the entire history of all table schemas ever generated, which might blow up memory usage. This method can also be useful when writing to a single table, as it allows a previous stage to calculate the schema (possibly based on the full collection of records being written to BigQuery).For the most general form of dynamic table destinations and schemas, look at
BigQueryIO.Write.to(DynamicDestinations)
.Insertion Method
BigQueryIO.Write
supports two methods of inserting data into BigQuery specified usingBigQueryIO.Write.withMethod(org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method)
. If no method is supplied, then a default method will be chosen based on the input PCollection. SeeBigQueryIO.Write.Method
for more information about the methods. The different insertion methods provide different tradeoffs of cost, quota, and data consistency; please see BigQuery documentation for more information about these tradeoffs.Usage with templates
When using
read()
orreadTableRows()
in a template, it's required to specifyBigQueryIO.Read.withTemplateCompatibility()
. Specifying this in a non-template pipeline is not recommended because it has somewhat lower performance.When using
write()
orwriteTableRows()
with batch loads in a template, it is recommended to specifyBigQueryIO.Write.withCustomGcsTempLocation(org.apache.beam.sdk.options.ValueProvider<java.lang.String>)
. Writing to BigQuery via batch loads involves writing temporary files to this location, so the location must be accessible at pipeline execution time. By default, this location is captured at pipeline construction time, may be inaccessible if the template may be reused from a different project or at a moment when the original location no longer exists.BigQueryIO.Write.withCustomGcsTempLocation(ValueProvider)
allows specifying the location as an argument to the template invocation.Permissions
Permission requirements depend on the
PipelineRunner
that is used to execute the pipeline. Please refer to the documentation of correspondingPipelineRunner
s for more details.Please see BigQuery Access Control for security and permission related information specific to BigQuery.
Updates to the I/O connector code
For any significant updates to this I/O connector, please consider involving corresponding code reviewers mentioned here.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
BigQueryIO.Read
Implementation ofread()
.static class
BigQueryIO.TypedRead<T>
Implementation ofread(SerializableFunction)
.static class
BigQueryIO.Write<T>
Implementation ofwrite()
.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
BIGQUERY_JOB_TEMPLATE
Template for BigQuery jobs created by BigQueryIO.
-
Method Summary
All Methods Static Methods Concrete Methods Deprecated Methods Modifier and Type Method Description static BigQueryIO.Read
read()
Deprecated.Useread(SerializableFunction)
orreadTableRows()
instead.static <T> BigQueryIO.TypedRead<T>
read(org.apache.beam.sdk.transforms.SerializableFunction<SchemaAndRecord,T> parseFn)
Reads from a BigQuery table or query and returns aPCollection
with one element per each row of the table or query result, parsed from the BigQuery AVRO format using the specified function.static BigQueryIO.TypedRead<com.google.api.services.bigquery.model.TableRow>
readTableRows()
Likeread(SerializableFunction)
but represents each row as aTableRow
.static BigQueryIO.TypedRead<com.google.api.services.bigquery.model.TableRow>
readTableRowsWithSchema()
LikereadTableRows()
but withSchema
support.static <T> BigQueryIO.TypedRead<T>
readWithDatumReader(org.apache.beam.sdk.io.AvroSource.DatumReaderFactory<T> readerFactory)
Reads from a BigQuery table or query and returns aPCollection
with one element per each row of the table or query result.static <T> BigQueryIO.Write<T>
write()
APTransform
that writes aPCollection
to a BigQuery table.static BigQueryIO.Write<com.google.api.services.bigquery.model.TableRow>
writeTableRows()
APTransform
that writes aPCollection
containingTableRows
to a BigQuery table.
-
-
-
Field Detail
-
BIGQUERY_JOB_TEMPLATE
public static final java.lang.String BIGQUERY_JOB_TEMPLATE
Template for BigQuery jobs created by BigQueryIO. This template is:"beam_bq_job_{TYPE}_{JOB_ID}_{STEP}_{RANDOM}"
, where:TYPE
represents the BigQuery job type (e.g. extract / copy / load / query)JOB_ID
is the Beam job name.STEP
is a UUID representing the Dataflow step that created the BQ job.RANDOM
is a random string.
NOTE: This job name template does not have backwards compatibility guarantees.
- See Also:
- Constant Field Values
-
-
Method Detail
-
read
@Deprecated public static BigQueryIO.Read read()
Deprecated.Useread(SerializableFunction)
orreadTableRows()
instead.readTableRows()
does exactly the same asread()
, howeverread(SerializableFunction)
performs better.
-
readTableRows
public static BigQueryIO.TypedRead<com.google.api.services.bigquery.model.TableRow> readTableRows()
Likeread(SerializableFunction)
but represents each row as aTableRow
.This method is more convenient to use in some cases, but usually has significantly lower performance than using
read(SerializableFunction)
directly to parse data into a domain-specific type, due to the overhead of converting the rows toTableRow
.
-
readTableRowsWithSchema
public static BigQueryIO.TypedRead<com.google.api.services.bigquery.model.TableRow> readTableRowsWithSchema()
LikereadTableRows()
but withSchema
support.
-
read
public static <T> BigQueryIO.TypedRead<T> read(org.apache.beam.sdk.transforms.SerializableFunction<SchemaAndRecord,T> parseFn)
Reads from a BigQuery table or query and returns aPCollection
with one element per each row of the table or query result, parsed from the BigQuery AVRO format using the specified function.Each
SchemaAndRecord
contains a BigQueryTableSchema
and aGenericRecord
representing the row, indexed by column name. Here is a sample parse function that parses click events from a table.class ClickEvent { long userId; String url; ... } p.apply(BigQueryIO.read(new SerializableFunction<SchemaAndRecord, ClickEvent>() { public ClickEvent apply(SchemaAndRecord record) { GenericRecord r = record.getRecord(); return new ClickEvent((Long) r.get("userId"), (String) r.get("url")); } }).from("...");
-
readWithDatumReader
public static <T> BigQueryIO.TypedRead<T> readWithDatumReader(org.apache.beam.sdk.io.AvroSource.DatumReaderFactory<T> readerFactory)
Reads from a BigQuery table or query and returns aPCollection
with one element per each row of the table or query result. This API directly deserializes BigQuery AVRO data to the input class, based on the appropriateDatumReader
.class ClickEvent { long userId; String url; ... } p.apply(BigQueryIO.read(ClickEvent.class)).from("...") .read((AvroSource.DatumReaderFactory<ClickEvent>) (writer, reader) -> new ReflectDatumReader<>(ReflectData.get().getSchema(ClickEvent.class)));
-
write
public static <T> BigQueryIO.Write<T> write()
APTransform
that writes aPCollection
to a BigQuery table. A formatting function must be provided to convert each input element into aTableRow
usingBigQueryIO.Write.withFormatFunction(SerializableFunction)
.In BigQuery, each table has an enclosing dataset. The dataset being written must already exist.
By default, tables will be created if they do not exist, which corresponds to a
BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED
disposition that matches the default of BigQuery's Jobs API. A schema must be provided (viaBigQueryIO.Write.withSchema(TableSchema)
), or else the transform may fail at runtime with anIllegalArgumentException
.By default, writes require an empty table, which corresponds to a
BigQueryIO.Write.WriteDisposition.WRITE_EMPTY
disposition that matches the default of BigQuery's Jobs API.Here is a sample transform that produces TableRow values containing "word" and "count" columns:
static class FormatCountsFn extends DoFn<KV<String, Long>, TableRow> { public void processElement(ProcessContext c) { TableRow row = new TableRow() .set("word", c.element().getKey()) .set("count", c.element().getValue().intValue()); c.output(row); } }
-
writeTableRows
public static BigQueryIO.Write<com.google.api.services.bigquery.model.TableRow> writeTableRows()
APTransform
that writes aPCollection
containingTableRows
to a BigQuery table.It is recommended to instead use
write()
withBigQueryIO.Write.withFormatFunction(SerializableFunction)
.
-
-