-
Notifications
You must be signed in to change notification settings - Fork 944
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Nfiann-bigquery-cloud-config (#6336)
- Loading branch information
Showing
4 changed files
with
127 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,6 +52,123 @@ As an end user, if your organization has set up BigQuery OAuth, you can link a p | |
|
||
To learn how to optimize performance with data platform-specific configurations in dbt Cloud, refer to [BigQuery-specific configuration](/reference/resource-configs/bigquery-configs). | ||
|
||
### Optional configurations | ||
|
||
In BigQuery, optional configurations let you tailor settings for tasks such as query priority, dataset location, job timeout, and more. These options give you greater control over how BigQuery functions behind the scenes to meet your requirements. | ||
|
||
To customize your optional configurations in dbt Cloud: | ||
|
||
1. Click your name at the bottom left-hand side bar menu in dbt Cloud | ||
2. Select **Your profile** from the menu | ||
3. From there, click **Projects** and select your BigQuery project | ||
5. Go to **Development Connection** and select BigQuery | ||
6. Click **Edit** and then scroll down to **Optional settings** | ||
|
||
<Lightbox src="/img/bigquery/bigquery-optional-config.png" width="70%" title="BigQuery optional configuration"/> | ||
|
||
The following are the optional configurations you can set in dbt Cloud: | ||
|
||
| Configuration | <div style={{width:'250'}}>Information</div> | Type | <div style={{width:'150'}}>Example</div> | | ||
|---------------------------|-----------------------------------------|---------|--------------------| | ||
| [Priority](#priority) | Sets the priority for BigQuery jobs (either `interactive` or queued for `batch` processing) | String | `batch` or `interactive` | | ||
| [Retries](#retries) | Specifies the number of retries for failed jobs due to temporary issues | Integer | `3` | | ||
| [Location](#location) | Location for creating new datasets | String | `US`, `EU`, `us-west2` | | ||
| [Maximum bytes billed](#maximum-bytes-billed) | Limits the maximum number of bytes that can be billed for a query | Integer | `1000000000` | | ||
| [Execution project](#execution-project) | Specifies the project ID to bill for query execution | String | `my-project-id` | | ||
| [Impersonate service account](#impersonate-service-account) | Allows users authenticated locally to access BigQuery resources under a specified service account | String | `[email protected]` | | ||
| [Job retry deadline seconds](#job-retry-deadline-seconds) | Sets the total number of seconds BigQuery will attempt to retry a job if it fails | Integer | `600` | | ||
| [Job creation timeout seconds](#job-creation-timeout-seconds) | Specifies the maximum timeout for the job creation step | Integer | `120` | | ||
| [Google cloud storage-bucket](#google-cloud-storage-bucket) | Location for storing objects in Google Cloud Storage | String | `my-bucket` | | ||
| [Dataproc region](#dataproc-region) | Specifies the cloud region for running data processing jobs | String | `US`, `EU`, `asia-northeast1` | | ||
| [Dataproc cluster name](#dataproc-cluster-name) | Assigns a unique identifier to a group of virtual machines in Dataproc | String | `my-cluster` | | ||
|
||
|
||
<Expandable alt_header="Priority"> | ||
|
||
The `priority` for the BigQuery jobs that dbt executes can be configured with the `priority` configuration in your BigQuery profile. The priority field can be set to one of `batch` or `interactive`. For more information on query priority, consult the [BigQuery documentation](https://cloud.google.com/bigquery/docs/running-queries). | ||
|
||
</Expandable> | ||
|
||
<Expandable alt_header="Retries"> | ||
|
||
Retries in BigQuery help to ensure that jobs complete successfully by trying again after temporary failures, making your operations more robust and reliable. | ||
|
||
</Expandable> | ||
|
||
<Expandable alt_header="Location"> | ||
|
||
The `location` of BigQuery datasets can be set using the `location` setting in a BigQuery profile. As per the [BigQuery documentation](https://cloud.google.com/bigquery/docs/locations), `location` may be either a multi-regional location (for example, `EU`, `US`), or a regional location (like `us-west2`). | ||
|
||
</Expandable> | ||
|
||
<Expandable alt_header="Maximum bytes billed"> | ||
|
||
When a `maximum_bytes_billed` value is configured for a BigQuery profile, that allows you to limit how much data your query can process. It’s a safeguard to prevent your query from accidentally processing more data than you expect, which could lead to higher costs. Queries executed by dbt will fail if they exceed the configured maximum bytes threshhold. This configuration should be supplied as an integer number of bytes. | ||
|
||
If your `maximum_bytes_billed` is 1000000000, you would enter that value in the `maximum_bytes_billed` field in dbt cloud. | ||
|
||
|
||
</Expandable> | ||
|
||
<Expandable alt_header="Execution project"> | ||
|
||
By default, dbt will use the specified `project`/`database` as both: | ||
|
||
1. The location to materialize resources (models, seeds, snapshots, and so on), unless they specify a custom project/database config | ||
2. The GCP project that receives the bill for query costs or slot usage | ||
|
||
Optionally, you may specify an execution project to bill for query execution, instead of the project/database where you materialize most resources. | ||
|
||
</Expandable> | ||
|
||
<Expandable alt_header="Impersonate service account"> | ||
|
||
This feature allows users authenticating using local OAuth to access BigQuery resources based on the permissions of a service account. | ||
|
||
For a general overview of this process, see the official docs for [Creating Short-lived Service Account Credentials](https://cloud.google.com/iam/docs/create-short-lived-credentials-direct). | ||
|
||
</Expandable> | ||
|
||
<Expandable alt_header="Job retry deadline seconds"> | ||
|
||
Job retry deadline seconds is the maximum amount of time BigQuery will spend retrying a job before it gives up. | ||
|
||
</Expandable> | ||
|
||
<Expandable alt_header="Job creation timeout seconds"> | ||
|
||
Job creation timeout seconds is the maximum time BigQuery will wait to start the job. If the job doesn’t start within that time, it times out. | ||
|
||
</Expandable> | ||
|
||
#### Run dbt python models on Google Cloud Platform | ||
|
||
import BigQueryDataproc from '/snippets/_bigquery-dataproc.md'; | ||
|
||
<BigQueryDataproc /> | ||
|
||
<Expandable alt_header="Google cloud storage bucket"> | ||
|
||
Everything you store in Cloud Storage must be placed inside a [bucket](https://cloud.google.com/storage/docs/buckets). Buckets help you organize your data and manage access to it. | ||
|
||
</Expandable> | ||
|
||
<Expandable alt_header="Dataproc region"> | ||
|
||
A designated location in the cloud where you can run your data processing jobs efficiently. This region must match the location of your BigQuery dataset if you want to use Dataproc with BigQuery to ensure data doesn't move across regions, which can be inefficient and costly. | ||
|
||
For more information on [Dataproc regions](https://cloud.google.com/bigquery/docs/locations), refer to the BigQuery documentation. | ||
|
||
</Expandable> | ||
|
||
<Expandable alt_header="Dataproc cluster name"> | ||
|
||
A unique label you give to your group of virtual machines to help you identify and manage your data processing tasks in the cloud. When you integrate Dataproc with BigQuery, you need to provide the cluster name so BigQuery knows which specific set of resources (the cluster) to use for running the data jobs. | ||
|
||
Have a look at [Dataproc's document on Create a cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) for an overview on how clusters work. | ||
|
||
</Expandable> | ||
|
||
### Account level connections and credential management | ||
|
||
You can re-use connections across multiple projects with [global connections](/docs/cloud/connect-data-platform/about-connections#migration-from-project-level-connections-to-account-level-connections). Connections are attached at the environment level (formerly project level), so you can utilize multiple connections inside of a single project (to handle dev, staging, production, etc.). | ||
|
@@ -147,3 +264,7 @@ For a project, you will first create an environment variable to store the secret | |
"extended_attributes_id": FFFFF | ||
}' | ||
``` | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
To run dbt Python models on GCP, dbt uses companion services, Dataproc and Cloud Storage, that offer tight integrations with BigQuery. You may use an existing Dataproc cluster and Cloud Storage bucket, or create new ones: | ||
- https://cloud.google.com/dataproc/docs/guides/create-cluster | ||
- https://cloud.google.com/storage/docs/creating-buckets |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.