Class BigQueryIO.Write<T>

  • All Implemented Interfaces:
    java.io.Serializable, org.apache.beam.sdk.transforms.display.HasDisplayData
    Enclosing class:
    BigQueryIO

    public abstract static class BigQueryIO.Write<T>
    extends org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<T>,​WriteResult>
    Implementation of BigQueryIO.write().
    See Also:
    Serialized Form
    • Constructor Detail

      • Write

        public Write()
    • Method Detail

      • to

        public BigQueryIO.Write<T> to​(com.google.api.services.bigquery.model.TableReference table)
        Writes to the given table, specified as a TableReference.
      • to

        public BigQueryIO.Write<T> to​(org.apache.beam.sdk.options.ValueProvider<java.lang.String> tableSpec)
        Same as to(String), but with a ValueProvider.
      • to

        public BigQueryIO.Write<T> to​(org.apache.beam.sdk.transforms.SerializableFunction<org.apache.beam.sdk.values.ValueInSingleWindow<T>,​TableDestination> tableFunction)
        Writes to table specified by the specified table function. The table is a function of ValueInSingleWindow, so can be determined by the value or by the window.

        If the function produces destinations configured with clustering fields, ensure that withClustering() is also set so that the clustering configurations get properly encoded and decoded.

      • withFormatFunction

        public BigQueryIO.Write<T> withFormatFunction​(org.apache.beam.sdk.transforms.SerializableFunction<T,​com.google.api.services.bigquery.model.TableRow> formatFunction)
        Formats the user's type into a TableRow to be written to BigQuery.
      • withFormatRecordOnFailureFunction

        public BigQueryIO.Write<T> withFormatRecordOnFailureFunction​(org.apache.beam.sdk.transforms.SerializableFunction<T,​com.google.api.services.bigquery.model.TableRow> formatFunction)
        If an insert failure occurs, this function is applied to the originally supplied row T. The resulting TableRow will be accessed via WriteResult.getFailedInsertsWithErr().
      • withAvroSchemaFactory

        public BigQueryIO.Write<T> withAvroSchemaFactory​(org.apache.beam.sdk.transforms.SerializableFunction<@Nullable com.google.api.services.bigquery.model.TableSchema,​org.apache.avro.Schema> avroSchemaFactory)
        Uses the specified function to convert a TableSchema to a Schema.

        If not specified, the TableSchema will automatically be converted to an avro schema.

      • withSchema

        public BigQueryIO.Write<T> withSchema​(org.apache.beam.sdk.options.ValueProvider<com.google.api.services.bigquery.model.TableSchema> schema)
        Same as withSchema(TableSchema) but using a deferred ValueProvider.
      • withJsonSchema

        public BigQueryIO.Write<T> withJsonSchema​(org.apache.beam.sdk.options.ValueProvider<java.lang.String> jsonSchema)
        Same as withJsonSchema(String) but using a deferred ValueProvider.
      • withSchemaFromView

        public BigQueryIO.Write<T> withSchemaFromView​(org.apache.beam.sdk.values.PCollectionView<java.util.Map<java.lang.String,​java.lang.String>> view)
        Allows the schemas for each table to be computed within the pipeline itself.

        The input is a map-valued PCollectionView mapping string tablespecs to JSON-formatted TableSchemas. Tablespecs must be in the same format as taken by to(String).

      • withTimePartitioning

        public BigQueryIO.Write<T> withTimePartitioning​(com.google.api.services.bigquery.model.TimePartitioning partitioning)
        Allows newly created tables to include a TimePartitioning class. Can only be used when writing to a single table. If to(SerializableFunction) or to(DynamicDestinations) is used to write dynamic tables, time partitioning can be directly set in the returned TableDestination.
      • withClustering

        public BigQueryIO.Write<T> withClustering()
        Allows writing to clustered tables when to(SerializableFunction) or to(DynamicDestinations) is used. The returned TableDestination objects should specify the clustering fields per table. If writing to a single table, use withClustering(Clustering) instead to pass a Clustering instance that specifies the static clustering fields to use.

        Setting this option enables use of TableDestinationCoderV3 which encodes clustering information. The updated coder is compatible with non-clustered tables, so can be freely set for newly deployed pipelines, but note that pipelines using an older coder must be drained before setting this option, since TableDestinationCoderV3 will not be able to read state written with a previous version.

      • withTableDescription

        public BigQueryIO.Write<T> withTableDescription​(java.lang.String tableDescription)
        Specifies the table description.
      • withFailedInsertRetryPolicy

        public BigQueryIO.Write<T> withFailedInsertRetryPolicy​(InsertRetryPolicy retryPolicy)
        Specfies a policy for handling failed inserts.

        Currently this only is allowed when writing an unbounded collection to BigQuery. Bounded collections are written using batch load jobs, so we don't get per-element failures. Unbounded collections are written using streaming inserts, so we have access to per-element insert results.

      • withoutValidation

        public BigQueryIO.Write<T> withoutValidation()
        Disables BigQuery table validation.
      • withLoadJobProjectId

        public BigQueryIO.Write<T> withLoadJobProjectId​(java.lang.String loadJobProjectId)
        Set the project the BigQuery load job will be initiated from. This is only applicable when the write method is set to BigQueryIO.Write.Method.FILE_LOADS. If omitted, the project of the destination table is used.
      • withLoadJobProjectId

        public BigQueryIO.Write<T> withLoadJobProjectId​(org.apache.beam.sdk.options.ValueProvider<java.lang.String> loadJobProjectId)
      • withTriggeringFrequency

        public BigQueryIO.Write<T> withTriggeringFrequency​(org.joda.time.Duration triggeringFrequency)
        Choose the frequency at which file writes are triggered.

        This is only applicable when the write method is set to BigQueryIO.Write.Method.FILE_LOADS, and only when writing an unbounded PCollection.

        Every triggeringFrequency duration, a BigQuery load job will be generated for all the data written since the last load job. BigQuery has limits on how many load jobs can be triggered per day, so be careful not to set this duration too low, or you may exceed daily quota. Often this is set to 5 or 10 minutes to ensure that the project stays well under the BigQuery quota. See Quota Policy for more information about BigQuery quotas.

      • withCustomGcsTempLocation

        public BigQueryIO.Write<T> withCustomGcsTempLocation​(org.apache.beam.sdk.options.ValueProvider<java.lang.String> customGcsTempLocation)
        Provides a custom location on GCS for storing temporary files to be loaded via BigQuery batch load jobs. See "Usage with templates" in BigQueryIO documentation for discussion.
      • skipInvalidRows

        public BigQueryIO.Write<T> skipInvalidRows()
        Insert all valid rows of a request, even if invalid rows exist. This is only applicable when the write method is set to BigQueryIO.Write.Method.STREAMING_INSERTS. The default value is false, which causes the entire request to fail if any invalid rows exist.
      • ignoreUnknownValues

        public BigQueryIO.Write<T> ignoreUnknownValues()
        Accept rows that contain values that do not match the schema. The unknown values are ignored. Default is false, which treats unknown values as errors.
      • useAvroLogicalTypes

        public BigQueryIO.Write<T> useAvroLogicalTypes()
        Enables interpreting logical types into their corresponding types (ie. TIMESTAMP), instead of only using their raw types (ie. LONG).
      • ignoreInsertIds

        public BigQueryIO.Write<T> ignoreInsertIds()
        Setting this option to true disables insertId based data deduplication offered by BigQuery. For more information, please see https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication.
      • optimizedWrites

        public BigQueryIO.Write<T> optimizedWrites()
        If true, enables new codepaths that are expected to use less resources while writing to BigQuery. Not enabled by default in order to maintain backwards compatibility.
      • useBeamSchema

        @Experimental(SCHEMAS)
        public BigQueryIO.Write<T> useBeamSchema()
        If true, then the BigQuery schema will be inferred from the input schema. If no formatFunction is set, then BigQueryIO will automatically turn the input records into TableRows that match the schema.
      • withSuccessfulInsertsPropagation

        public BigQueryIO.Write<T> withSuccessfulInsertsPropagation​(boolean propagateSuccessful)
        If true, it enables the propagation of the successfully inserted TableRows on BigQuery as part of the WriteResult object when using BigQueryIO.Write.Method.STREAMING_INSERTS. By default this property is set on true. In the cases where a pipeline won't make use of the insert results this property can be set on false, which will make the pipeline let go of those inserted TableRows and reclaim worker resources.
      • withAutoSchemaUpdate

        public BigQueryIO.Write<T> withAutoSchemaUpdate​(boolean autoSchemaUpdate)
        If true, enables automatically detecting BigQuery table schema updates. Table schema updates are usually noticed within several minutes. Only supported when using one of the STORAGE_API insert methods.
      • withDeterministicRecordIdFn

        @Experimental
        public BigQueryIO.Write<T> withDeterministicRecordIdFn​(org.apache.beam.sdk.transforms.SerializableFunction<T,​java.lang.String> toUniqueIdFunction)
      • withMaxFilesPerBundle

        public BigQueryIO.Write<T> withMaxFilesPerBundle​(int maxFilesPerBundle)
        Control how many files will be written concurrently by a single worker when using BigQuery load jobs before spilling to a shuffle. When data comes into this transform, it is written to one file per destination per worker. When there are more files than maxFilesPerBundle (DEFAULT: 20), the data is shuffled (i.e. Grouped By Destination), and written to files one-by-one-per-worker. This flag sets the maximum number of files that a single worker can write concurrently before shuffling the data. This flag should be used with caution. Setting a high number can increase the memory pressure on workers, and setting a low number can make a pipeline slower (due to the need to shuffle data).
      • withMaxBytesPerPartition

        public BigQueryIO.Write<T> withMaxBytesPerPartition​(long maxBytesPerPartition)
        Control how much data will be assigned to a single BigQuery load job. If the amount of data flowing into one BatchLoads partition exceeds this value, that partition will be handled via multiple load jobs.

        The default value (11 TiB) respects BigQuery's maximum size per load job limit and is appropriate for most use cases. Reducing the value of this parameter can improve stability when loading to tables with complex schemas containing thousands of fields.

        See Also:
        BigQuery Load Job Limits
      • withWriteTempDataset

        public BigQueryIO.Write<T> withWriteTempDataset​(java.lang.String writeTempDataset)
        Temporary dataset. When writing to BigQuery from large file loads, the BigQueryIO.write() will create temporary tables in a dataset to store staging data from partitions. With this option, you can set an existing dataset to create the temporary tables. BigQueryIO will create temporary tables in that dataset, and will remove it once it is not needed. No other tables in the dataset will be modified. Remember that the dataset must exist and your job needs permissions to create and remove tables inside that dataset.
      • validate

        public void validate​(org.apache.beam.sdk.options.PipelineOptions pipelineOptions)
        Overrides:
        validate in class org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<T>,​WriteResult>
      • expand

        public WriteResult expand​(org.apache.beam.sdk.values.PCollection<T> input)
        Specified by:
        expand in class org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<T>,​WriteResult>
      • populateDisplayData

        public void populateDisplayData​(org.apache.beam.sdk.transforms.display.DisplayData.Builder builder)
        Specified by:
        populateDisplayData in interface org.apache.beam.sdk.transforms.display.HasDisplayData
        Overrides:
        populateDisplayData in class org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<T>,​WriteResult>
      • getTable

        public @Nullable org.apache.beam.sdk.options.ValueProvider<com.google.api.services.bigquery.model.TableReference> getTable()
        Returns the table reference, or null.