Cloud Storage transfers

The BigQuery Information Transfer Service for Cloud Storage allows you to schedule recurring data loads from Cloud Storage to BigQuery.

Before yous begin

Earlier you create a Cloud Storage transfer:

  • Verify that you have completed all actions required in Enabling the BigQuery Data Transfer Service.
  • Think your Cloud Storage URI.
  • Create a BigQuery dataset to store your data.
  • Create the destination table for your transfer and specify the schema definition.

Limitations

Recurring transfers from Deject Storage to BigQuery are bailiwick to the post-obit limitations:

  • All files matching the patterns defined by either a wildcard or past runtime parameters for your transfer must share the same schema you defined for the destination table, or the transfer will fail. Table schema changes between runs also causes the transfer to neglect.
  • Considering Deject Storage objects can exist versioned, information technology'south important to note that archived Deject Storage objects are not supported for BigQuery transfers. Objects must be live to be transferred.
  • Unlike individual loads of data from Cloud Storage to BigQuery, for ongoing transfers yous need to create the destination tabular array and its schema in accelerate of setting upwardly the transfer. BigQuery cannot create the table equally part of the recurring data transfer process.
  • Transfers from Cloud Storage set up the Write preference parameter to APPEND past default. In this mode, an unmodified file tin only be loaded into BigQuery once. If the file's last modification time holding is updated, then the file will exist reloaded.
  • BigQuery Data Transfer Service does not guarantee all files will exist transferred or transferred only one time if Cloud Storage files are touched while in mid-transfer.
  • If your dataset's location is fix to a value other than US, the regional or multi-regional Deject Storage bucket must be in the aforementioned region every bit the dataset.
  • BigQuery does not guarantee data consistency for external information sources. Changes to the underlying data while a query is running can result in unexpected beliefs.
  • BigQuery does not support Cloud Storage object versioning. If you include a generation number in the Cloud Storage URI, then the load job fails.

  • Depending on the format of your Cloud Storage source data, there may be additional limitations. For more information, see:

    • CSV limitations
    • JSON limitations
    • Datastore consign limitations
    • Firestore consign limitations
    • Limitations on nested and repeated information
  • Your Deject Storage bucket must exist in a region or multi-region that is compatible with the region or multi-region of the destination dataset in BigQuery. This is known every bit colocation. Encounter Cloud Storage transfer Data locations for details.

Minimum intervals

  • Source files are picked up for transfer immediately, with no minimum file age.
  • The minimum interval time between recurring transfers is 15 minutes. The default interval for a recurring transfer is every 24 hours.

Required permissions

When you load data into BigQuery, y'all need permissions that let you to load data into new or existing BigQuery tables and partitions. If yous are loading information from Cloud Storage, you'll also need access to the saucepan that contains your data. Ensure that you have the following required permissions:

  • BigQuery: Ensure that the person creating the transfer has the following permissions in BigQuery:

    • bigquery.transfers.update permissions to create the transfer
    • Both bigquery.datasets.get and bigquery.datasets.update permissions on the target dataset

    The bigquery.admin predefined IAM role includes bigquery.transfers.update, bigquery.datasets.update and bigquery.datasets.get permissions. For more than information on IAM roles in BigQuery Data Transfer Service, see Admission command reference.

  • Deject Storage: storage.objects.get permissions are required on the individual bucket or higher. If y'all are using a URI wildcard, you lot must besides have storage.objects.list permissions. If yous would similar to delete the source files after each successful transfer, you also need storage.objects.delete permissions. The storage.objectAdmin predefined IAM office includes all of these permissions.

Setting upwards a Deject Storage transfer

To create a Cloud Storage transfer in BigQuery Information Transfer Service:

Panel

  1. Go to the BigQuery page in the Deject Console.

    Go to the BigQuery page

  2. Click Transfers.

  3. Click Create.

  4. On the Create Transfer page:

    • In the Source blazon section, for Source, choose Cloud Storage.

      Transfer source

    • In the Transfer config name section, for Brandish proper name, enter a name for the transfer such equally My Transfer. The transfer name can be whatsoever value that allows you to easily identify the transfer if yous need to alter information technology after.

      Transfer name

    • In the Schedule options section, for Schedule, leave the default value (Get-go now) or click Starting time at a set time.

      • For Repeats, choose an selection for how often to run the transfer. The minimum interval is 15 minutes.
        • Daily (default)
        • Weekly
        • Monthly
        • Custom. For Custom Schedule, enter a custom frequency; for case, every day 00:00. Come across Formatting the schedule.
        • On-need
      • For Start date and run fourth dimension, enter the engagement and fourth dimension to first the transfer. If you choose Start now, this option is disabled.

        Transfer schedule

    • In the Destination settings department, for Destination dataset, choose the dataset y'all created to store your information.

      Transfer dataset

    • In the Data source details section:

      • For Destination table, enter the name of your destination tabular array. The destination table must follow the table naming rules. Destination tabular array names also back up parameters.
      • For Cloud Storage URI, enter the Cloud Storage URI. Wildcards and parameters are supported.
      • For Write preference, choose:

        • Suspend to append new data to your existing destination table. BigQuery load jobs are triggered with the WRITE_APPEND preference. For more information, run into the writeDisposition field details of the JobConfigurationLoad object. The default value for Write preference is APPEND.
        • MIRROR to refresh information within the destination tabular array, to reflect modified data in the source. MIRROR overwrites a fresh copy of data in the destination table.

      • For Delete source files afterwards transfer, check the box if you desire to delete the source files after each successful transfer. Delete jobs are best effort. Delete jobs practise not retry if the kickoff effort to delete the source files fails.

      • In the Transfer Options section:

        • Under All Formats:
          • For Number of errors allowed, enter the maximum number of bad records that BigQuery tin can ignore when running the job. If the number of bad records exceeds this value, an 'invalid' error is returned in the task result, and the job fails. The default value is 0.
          • (Optional) For Decimal target types, enter a comma-separated list of possible SQL data types that the source decimal values could be converted to. Which SQL data type is selected for conversion depends on the following conditions:
            • The data type selected for conversion will be the first information type in the following list that supports the precision and scale of the source data, in this order: NUMERIC, BIGNUMERIC, and Cord.
            • If none of the listed data types volition support the precision and the scale, the information type supporting the widest range in the specified list is selected. If a value exceeds the supported range when reading the source information, an fault volition be thrown.
            • The data type STRING supports all precision and calibration values.
            • If this field is left empty, the data blazon will default to "NUMERIC,String" for ORC, and "NUMERIC" for the other file formats.
            • This field cannot comprise indistinguishable data types.
            • The order of the data types that you list in this field is ignored.
        • Under JSON, CSV:
          • For Ignore unknown values, check the box if you desire the transfer to drop data that does not fit the destination tabular array'southward schema.
        • Under CSV:

          • For Field delimiter, enter the character that separates fields. The default value is a comma.
          • For Quote character, enter the grapheme that is used to quote data sections in a CSV file. The default value is a double-quote (").
          • For Header rows to skip, enter the number of header rows in the source file(due south) if you don't want to import them. The default value is 0.
          • For Let quoted newlines, cheque the box if you desire to allow newlines within quoted fields.
          • For Allow jagged rows, cheque the box if you want to allow the transfer of rows with missing NULLABLE columns.

      Cloud Storage source details

    • (Optional) In the Notification options department:

      • Click the toggle to enable email notifications. When you enable this selection, the transfer administrator receives an email notification when a transfer run fails.
      • For Select a Pub/Sub topic, choose your topic name or click Create a topic. This option configures Pub/Sub run notifications for your transfer.
  5. Click Save.

bq

Enter the bq mk command and supply the transfer creation flag — --transfer_config. The following flags are too required:

  • --data_source
  • --display_name
  • --target_dataset
  • --params

Optional flags:

  • --service_account_name - Specifies a service account to employ for Cloud Storage transfer authentication instead of your user account.
bq mk \ --transfer_config \ --project_id=project_id                        \ --data_source=data_source                        \ --display_name=name                        \ --target_dataset=dataset                        \ --params='parameters' \ --service_account_name=service_account_name                      

Where:

  • project_id is your project ID. If --project_id isn't supplied to specify a detail projection, the default project is used.
  • data_source is the data source — google_cloud_storage.
  • name is the display name for the transfer configuration. The transfer proper noun can exist any value that allows you to easily identify the transfer if y'all need to modify it after.
  • dataset is the target dataset for the transfer configuration.
  • parameters contains the parameters for the created transfer configuration in JSON format. For example: --params='{"param":"param_value"}'.
    • For Deject Storage, you lot must supply the data_path_template, the destination_table_name_template and the file_format parameters. data_path_template is the Cloud Storage URI that contains your files to be transferred, which can include one wildcard. The destination_table_name_template is the name of your destination table. For file_format, indicate the type of files y'all wish to transfer: CSV, JSON, AVRO, PARQUET, or ORC. The default value is CSV.
    • For all file_format values, you can include the optional param max_bad_records. The default value is 0.
    • For all file_format values, you can include the optional param decimal_target_types. decimal_target_types is a comma-separated list of possible SQL data types that the source decimal values could exist converted to. If this field is not provided, the datatype will default to "NUMERIC,String" for ORC, and "NUMERIC" for the other file formats.
    • For the JSON or CSV values in file_format, yous can include the optional param ignore_unknown_values. This param will exist ignored if y'all haven't selected CSV or JSON for the file_format.
    • For CSV file_format, yous can include the optional param field_delimiter for the grapheme that separates fields. The default value is a comma. This param will be ignored if you haven't selected CSV for the file_format.
    • For CSV file_format, y'all can include the optional param quote for the graphic symbol that is used to quote information sections in a CSV file. The default value is a double-quote ("). This param will be ignored if you lot haven't selected CSV for the file_format.
    • For CSV file_format, you tin include the optional param skip_leading_rows to bespeak header rows you don't want to import. The default value is 0. This param volition be ignored if you lot haven't selected CSV for the file_format.
    • For CSV file_format, yous tin can include the optional param allow_quoted_newlines if you want to allow newlines inside quoted fields. This param volition be ignored if y'all haven't selected CSV for the file_format.
    • For CSV file_format, y'all can include the optional param allow_jagged_rows if you want to accept rows that are missing trailing optional columns. The missing values will be filled in with NULLs. This param will be ignored if you haven't selected CSV for the file_format.
    • Optional param delete_source_files will delete the source files afterwards each successful transfer. (Delete jobs practice non retry if the commencement effort to delete the source files fails.) The default value for the delete_source_files is simulated.
  • service_account_name is the service account name used for authenticating your Cloud Storage transfer. The service account should be owned by the same project_id used for creating the transfer and it should have all the required permissions listed above.

For instance, the post-obit control creates a Cloud Storage transfer named My Transfer using a data_path_template value of gs://mybucket/myfile/*.csv, target dataset mydataset, and file_format CSV. This instance includes non-default values for the optional params associated with the CSV file_format.

The transfer is created in the default project:

                        bq mk --transfer_config \ --target_dataset=mydataset \ --display_name='My Transfer' \ --params='{"data_path_template":"gs://mybucket/myfile/*.csv", "destination_table_name_template":"MyTable", "file_format":"CSV", "max_bad_records":"1", "ignore_unknown_values":"true", "field_delimiter":"|", "quote":";", "skip_leading_rows":"1", "allow_quoted_newlines":"true", "allow_jagged_rows":"false", "delete_source_files":"true"}' \ --data_source=google_cloud_storage                                              

After running the control, you receive a message like the following:

[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.

Follow the instructions and paste the authentication code on the command line.

API

Use the projects.locations.transferConfigs.create method and supply an example of the TransferConfig resource.

Coffee

Manually triggering a transfer

In addition to automatically scheduled transfers from Cloud Storage, yous can manually trigger a transfer to load additional data files.

If the transfer configuration is runtime parameterized, you volition demand to specify a range of dates for which boosted transfers volition be started.

To manually trigger a transfer:

Console

  1. Go to the BigQuery page in the Cloud Console.

    Go to the BigQuery page

  2. Click Data transfers.

  3. Click your transfer.

  4. Click RUN TRANSFER At present or SCHEDULE BACKFILL (for runtime parameterized transfer configurations).

  5. If applicable, choose the Outset engagement and Finish date, then click OK to confirm.

    RUN TRANSFER NOW

    For runtime parameterized transfer configurations, you will see date options when you click SCHEDULE BACKFILL.

    SCHEDULE BACKFILL

What'due south next

  • Learn about Using runtime parameters in Cloud Storage transfers
  • Learn more about the BigQuery Information Transfer Service