-
Notifications
You must be signed in to change notification settings - Fork 296
Experimental BQ support to run dbt models with ExecutionMode.AIRFLOW_ASYNC
#1230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
851564f
Draft: dbt compile task
pankajkoti 9dc2c9c
Put compiled files under dag_id folder & refactor few snippets
pankajkoti 0ce662e
Add tests & minor refactorings
pankajkoti 1b6f57e
Apply suggestions from code review
pankajkoti cc48161
Install deps for the newly added example DAG
pankajkoti 1068025
Add docs
pankajkoti faa706d
Add async run operator
pankajkoti 0e155e4
Fix remote sql path and async args
pankajastro 5f1ecaa
Fix query
pankajastro 1278847
🎨 [pre-commit.ci] Auto format from pre-commit.com hooks
pre-commit-ci[bot] b3d6cf3
Use dbt node's filepath to construct remote path to fetch compiled SQ…
pankajkoti 78bc069
Merge branch 'main' into execute-async-task
tatiana 9ca5e85
🎨 [pre-commit.ci] Auto format from pre-commit.com hooks
pre-commit-ci[bot] 99bf7c0
Fix unittests
tatiana 3aaaf9e
Improve code
tatiana 43158be
Working with deferrable=False, not working with deferrable=True
tatiana 83b1010
Working with deferrable=False, not working with deferrable=True
tatiana bd6657a
Fix issue when using BQ deferrable operator - it requires location
tatiana 1195955
Add limitation in docs
pankajastro 2bdd9bb
Add full_refresh as templated field
pankajastro 4a44603
Add more template fields
pankajastro c3c51cb
Construct & relay 'dbt dag-task group' identifier to upload & downloa…
pankajkoti 72c6164
Fix model_name retrieval; get from dbt_node_config
pankajkoti e67098e
Fix unit tests
pankajkoti 3e550bf
Fix subsequent failing unit tests
pankajkoti 0730d0f
Fix type check failures
pankajkoti 745768e
Add back the deleted sources.yml from jaffle_shop as it has dependenc…
pankajkoti 43d62ea
Install dbt bigquery adapter for running simple_dag_async
pankajkoti 9656248
Install dbt bigquery adapter in our CI setup scripts
pankajkoti a654f49
Update gcp conn in dev/dags/simple_dag_async.py
pankajkoti e60ace2
Refactor args in DbtRunAirflowAsyncOperator
tatiana 7f055bc
Use GoogleCloudServiceAccountDictProfileMapping in profilemapping
pankajkoti ad057c8
set should_upload_compiled_sql to True
pankajkoti a70ca46
Remove async_op_args
tatiana 7c6a1b2
remove install_deps from DAG
pankajkoti 64a31d0
Merge branch 'main' into execute-async-task
tatiana c1aeff0
Fix test_build_airflow_graph_with_dbt_compile_task by passing needed …
pankajkoti 02f7985
Specify required project id in the GoogleCloudServiceAccountDictProfi…
pankajkoti af454a9
Pass gcp_conn_id to super class init, otherwise it is lost & uses the…
pankajkoti 9081e6a
Adapt manifest DAG to use & adapt to the newer GCP conn secret that i…
pankajkoti 2dccf84
Release 1.7.0a1
tatiana 7adeb99
Retrigger GH actions
tatiana 7e6de30
temporarily move out simple_dag_async.py
tatiana 16a87ea
Fix CI issue
tatiana 05db6a0
Fix dbt-compile dependency by using Airflow tasks instead of dbt nodes
pankajkoti 8fc4ae2
Apply suggestions from code review
pankajkoti ea5816b
Apply suggestions from code review
pankajkoti 85f86a4
Add install instruction
pankajastro 402f823
Add min airflow version in limitation
pankajastro 621a4de
Ignore Async DAG for dbt <=1.5
pankajastro a0cb147
Ignore Async DAG for dbt <=1.5
pankajastro File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,67 +1,190 @@ | ||
| from __future__ import annotations | ||
|
|
||
| import inspect | ||
| from pathlib import Path | ||
| from typing import TYPE_CHECKING, Any, Sequence | ||
|
|
||
| from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook | ||
| from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator | ||
| from airflow.utils.context import Context | ||
|
|
||
| from cosmos import settings | ||
| from cosmos.config import ProfileConfig | ||
| from cosmos.exceptions import CosmosValueError | ||
| from cosmos.operators.base import AbstractDbtBaseOperator | ||
| from cosmos.operators.local import ( | ||
| DbtBuildLocalOperator, | ||
| DbtCompileLocalOperator, | ||
| DbtDocsAzureStorageLocalOperator, | ||
| DbtDocsGCSLocalOperator, | ||
| DbtDocsLocalOperator, | ||
| DbtDocsS3LocalOperator, | ||
| DbtLocalBaseOperator, | ||
| DbtLSLocalOperator, | ||
| DbtRunLocalOperator, | ||
| DbtRunOperationLocalOperator, | ||
| DbtSeedLocalOperator, | ||
| DbtSnapshotLocalOperator, | ||
| DbtSourceLocalOperator, | ||
| DbtTestLocalOperator, | ||
| ) | ||
| from cosmos.settings import remote_target_path, remote_target_path_conn_id | ||
|
|
||
| _SUPPORTED_DATABASES = ["bigquery"] | ||
|
|
||
| class DbtBuildAirflowAsyncOperator(DbtBuildLocalOperator): | ||
| pass | ||
| from abc import ABCMeta | ||
|
|
||
|
|
||
| class DbtLSAirflowAsyncOperator(DbtLSLocalOperator): | ||
| pass | ||
| from airflow.models.baseoperator import BaseOperator | ||
|
|
||
|
|
||
| class DbtSeedAirflowAsyncOperator(DbtSeedLocalOperator): | ||
| pass | ||
|
|
||
|
|
||
| class DbtSnapshotAirflowAsyncOperator(DbtSnapshotLocalOperator): | ||
| pass | ||
|
|
||
|
|
||
| class DbtSourceAirflowAsyncOperator(DbtSourceLocalOperator): | ||
| pass | ||
| class DbtBaseAirflowAsyncOperator(BaseOperator, metaclass=ABCMeta): | ||
| def __init__(self, **kwargs) -> None: # type: ignore | ||
| self.location = kwargs.pop("location") | ||
| self.configuration = kwargs.pop("configuration", {}) | ||
| super().__init__(**kwargs) | ||
|
|
||
|
|
||
| class DbtRunAirflowAsyncOperator(DbtRunLocalOperator): | ||
| class DbtBuildAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtBuildLocalOperator): # type: ignore | ||
| pass | ||
|
|
||
|
|
||
| class DbtTestAirflowAsyncOperator(DbtTestLocalOperator): | ||
| class DbtLSAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtLSLocalOperator): # type: ignore | ||
| pass | ||
|
|
||
|
|
||
| class DbtRunOperationAirflowAsyncOperator(DbtRunOperationLocalOperator): | ||
| class DbtSeedAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtSeedLocalOperator): # type: ignore | ||
| pass | ||
|
|
||
|
|
||
| class DbtDocsAirflowAsyncOperator(DbtDocsLocalOperator): | ||
| class DbtSnapshotAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtSnapshotLocalOperator): # type: ignore | ||
| pass | ||
|
|
||
|
|
||
| class DbtDocsS3AirflowAsyncOperator(DbtDocsS3LocalOperator): | ||
| class DbtSourceAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtSourceLocalOperator): # type: ignore | ||
| pass | ||
|
|
||
|
|
||
| class DbtDocsAzureStorageAirflowAsyncOperator(DbtDocsAzureStorageLocalOperator): | ||
| class DbtRunAirflowAsyncOperator(BigQueryInsertJobOperator): # type: ignore | ||
|
|
||
| template_fields: Sequence[str] = ( | ||
| "full_refresh", | ||
| "project_dir", | ||
| "gcp_project", | ||
| "dataset", | ||
| "location", | ||
| ) | ||
|
|
||
| def __init__( # type: ignore | ||
| self, | ||
| project_dir: str, | ||
| profile_config: ProfileConfig, | ||
| location: str, # This is a mandatory parameter when using BigQueryInsertJobOperator with deferrable=True | ||
| full_refresh: bool = False, | ||
| extra_context: dict[str, object] | None = None, | ||
| configuration: dict[str, object] | None = None, | ||
| **kwargs, | ||
| ) -> None: | ||
| # dbt task param | ||
| self.project_dir = project_dir | ||
| self.extra_context = extra_context or {} | ||
| self.full_refresh = full_refresh | ||
| self.profile_config = profile_config | ||
| if not self.profile_config or not self.profile_config.profile_mapping: | ||
| raise CosmosValueError(f"Cosmos async support is only available when using ProfileMapping") | ||
|
|
||
| self.profile_type: str = profile_config.get_profile_type() # type: ignore | ||
| if self.profile_type not in _SUPPORTED_DATABASES: | ||
| raise CosmosValueError(f"Async run are only supported: {_SUPPORTED_DATABASES}") | ||
|
|
||
| # airflow task param | ||
| self.location = location | ||
| self.configuration = configuration or {} | ||
| self.gcp_conn_id = self.profile_config.profile_mapping.conn_id # type: ignore | ||
| profile = self.profile_config.profile_mapping.profile | ||
| self.gcp_project = profile["project"] | ||
| self.dataset = profile["dataset"] | ||
|
|
||
| # Cosmos attempts to pass many kwargs that BigQueryInsertJobOperator simply does not accept. | ||
| # We need to pop them. | ||
| clean_kwargs = {} | ||
| non_async_args = set(inspect.signature(AbstractDbtBaseOperator.__init__).parameters.keys()) | ||
| non_async_args |= set(inspect.signature(DbtLocalBaseOperator.__init__).parameters.keys()) | ||
| non_async_args -= {"task_id"} | ||
|
|
||
| for arg_key, arg_value in kwargs.items(): | ||
| if arg_key not in non_async_args: | ||
| clean_kwargs[arg_key] = arg_value | ||
|
|
||
| # The following are the minimum required parameters to run BigQueryInsertJobOperator using the deferrable mode | ||
| super().__init__( | ||
| gcp_conn_id=self.gcp_conn_id, | ||
| configuration=self.configuration, | ||
| location=self.location, | ||
| deferrable=True, | ||
| **clean_kwargs, | ||
| ) | ||
|
|
||
| def get_remote_sql(self) -> str: | ||
| if not settings.AIRFLOW_IO_AVAILABLE: | ||
| raise CosmosValueError(f"Cosmos async support is only available starting in Airflow 2.8 or later.") | ||
| from airflow.io.path import ObjectStoragePath | ||
|
|
||
| file_path = self.extra_context["dbt_node_config"]["file_path"] # type: ignore | ||
| dbt_dag_task_group_identifier = self.extra_context["dbt_dag_task_group_identifier"] | ||
|
|
||
| remote_target_path_str = str(remote_target_path).rstrip("/") | ||
|
|
||
| if TYPE_CHECKING: | ||
| assert self.project_dir is not None | ||
|
|
||
| project_dir_parent = str(Path(self.project_dir).parent) | ||
| relative_file_path = str(file_path).replace(project_dir_parent, "").lstrip("/") | ||
| remote_model_path = f"{remote_target_path_str}/{dbt_dag_task_group_identifier}/compiled/{relative_file_path}" | ||
|
|
||
| object_storage_path = ObjectStoragePath(remote_model_path, conn_id=remote_target_path_conn_id) | ||
| with object_storage_path.open() as fp: # type: ignore | ||
| return fp.read() # type: ignore | ||
|
|
||
| def drop_table_sql(self) -> None: | ||
| model_name = self.extra_context["dbt_node_config"]["resource_name"] # type: ignore | ||
| sql = f"DROP TABLE IF EXISTS {self.gcp_project}.{self.dataset}.{model_name};" | ||
|
|
||
| hook = BigQueryHook( | ||
| gcp_conn_id=self.gcp_conn_id, | ||
| impersonation_chain=self.impersonation_chain, | ||
| ) | ||
| self.configuration = { | ||
| "query": { | ||
| "query": sql, | ||
| "useLegacySql": False, | ||
| } | ||
| } | ||
| hook.insert_job(configuration=self.configuration, location=self.location, project_id=self.gcp_project) | ||
|
|
||
| def execute(self, context: Context) -> Any | None: | ||
| if not self.full_refresh: | ||
| raise CosmosValueError("The async execution only supported for full_refresh") | ||
| else: | ||
| # It may be surprising to some, but the dbt-core --full-refresh argument fully drops the table before populating it | ||
| # https://github.com/dbt-labs/dbt-core/blob/5e9f1b515f37dfe6cdae1ab1aa7d190b92490e24/core/dbt/context/base.py#L662-L666 | ||
| # https://docs.getdbt.com/reference/resource-configs/full_refresh#recommendation | ||
| # We're emulating this behaviour here | ||
| self.drop_table_sql() | ||
| sql = self.get_remote_sql() | ||
| model_name = self.extra_context["dbt_node_config"]["resource_name"] # type: ignore | ||
| # prefix explicit create command to create table | ||
| sql = f"CREATE TABLE {self.gcp_project}.{self.dataset}.{model_name} AS {sql}" | ||
| self.configuration = { | ||
| "query": { | ||
| "query": sql, | ||
| "useLegacySql": False, | ||
| } | ||
| } | ||
| return super().execute(context) | ||
|
|
||
|
|
||
| class DbtTestAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtTestLocalOperator): # type: ignore | ||
| pass | ||
|
|
||
|
|
||
| class DbtDocsGCSAirflowAsyncOperator(DbtDocsGCSLocalOperator): | ||
| class DbtRunOperationAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtRunOperationLocalOperator): # type: ignore | ||
| pass | ||
|
|
||
|
|
||
| class DbtCompileAirflowAsyncOperator(DbtCompileLocalOperator): | ||
| class DbtCompileAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtCompileLocalOperator): # type: ignore | ||
| pass | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.