An application that uses Cloud Dataflow and Cloud Build to copy/transfer BigQuery tables between locations/regions.
This is an Apache 2.0 license. Feel free to fork, change, or basically do whatever you want with this repo. PRs more than welcome.
- Java (Cloud Dataflow & BigQuery/GCS APIs)
- Gradle (build)
- Cloud Build (CI/CD)
Unfortunately, there's no easy/native way to do copy tables between locations/regions directly in BigQuery. For example, you can't just copy a table from the US to the EU without jumping through a few hoops.
The process is convoluted. For example, to copy a table from EU to Sydney:
- Export BigQuery table to a bucket located in the EU
- Copy/sync the exported table from the EU bucket to another bucket located in Sydney
- Load into BigQuery from Sydney bucket
You can roll your own solution for this (e.g. bash + gcloud), or there's currently 2 patterns available, which do it for you:
- Use Cloud Composer (Airflow)
- Use Cloud Dataflow (Beam)
This is the repo for Cloud Dataflow option.
You basically configure some YAML with name(s) of the table(s) that you copy between regions. Then invoke Cloud Build
to build, deploy and run it. The application will handle everything for you. See config.yaml
for details on how to
configure the application. It will create the necessary GCS buckets for you, and also in the correct locations. It can
also create the BigQuery target dataset if you can't be bothered manually creating it beforehand.
You can specify for the job to copy a list of individual tables from one region to another, or copy an entire dataset from one region to another. Dataset copying can be performed in one large Dataflow job, or split into multiple.
- Make sure all the relevant APIs are enabled on your GCP project. These include Cloud Dataflow, BigQuery, GCS, and Cloud Build.
- Elevate permissions on the Cloud Build service account that was created for you by Google. It will look something
like
<your_project_number>@cloudbuild.gserviceaccount.com
. You can give it only the required permissions for each service, or simply give it theProject Editor
role if you're comfortable with that.
3A. Clone the GitHub repo and make the necessary changes to config.yaml
4A. Finally, gcloud builds submit --config=cloudbuild.yaml <path_to_repo>
3B. Pull the GitHub repo
4B. Upload your config to GCS gs://bucket/containing/my/config.yaml
5B. Finally, gcloud builds submit --config=cloudbuild.yaml --substitutions=_APP_ARGS=--configPath=gs://bucket/containing/my/config.yaml
Complex schemas are not supported e.g. nested records etc. If you have a complex schema, then create an empty table
in the target dataset with the schema and set the flag detectSchema
to false
in the YAML config for the
appropriate copy, and the application will skip trying to detect the schema.
Sure. Email me at [email protected]