Experimental BQ support to run dbt models with ExecutionMode.AIRFLOW_ASYNC#1230
Conversation
3f00cc9 to
0ce662e
Compare
35e58b6 to
407d311
Compare
Remove print stmt Fix query Fix query Remove oss execute method code
407d311 to
faa706d
Compare
7916857 to
5f1ecaa
Compare
2e56a75 to
77c7c6c
Compare
This PR is the groundwork for the implementation of `ExecutionMode.AIRFLOW_ASYNC` (#1120), which - once all other epic tasks are completed - will enable asynchronous execution of dbt resources using Apache Airflow’s deferrable operators. As part of this work, this PR introduces a new option to the enum `ExecutionMode` : `AIRFLOW_ASYNC`. When this execution mode is used, Cosmos now creates a setup task that will pre-compile the dbt project SQL and make it available to the remaining dbt tasks. This PR, however, does not yet leverage Airflow's deferrable operators. If users use `ExecutionMode.AIRFLOW_ASYNC` they will actually be running `ExecutionMode.LOCAL` operators with this change. The PR (#1230) has a first experimental version of using deferrable operators for task execution. ## Setup task as the ground work for a new Execution Mode: `ExecutionMode.AIRFLOW_ASYNC`: - Adds a new operator, `DbtCompileAirflowAsyncOperator`, as a root task(analogous to a setup task) in the DAG, running the dbt compile command and uploading the compiled SQL files to a remote storage location for subsequent tasks that fetch these compiled SQL files from the remote storage and run them asynchronously using Airflow's deferrable operators. ## Airflow Configurations: - `remote_target_path`: Introduces a configurable path to store dbt-generated files remotely, supporting any storage scheme that works with Airflow’s Object Store (e.g., S3, GCS, Azure Blob). - `remote_target_path_conn_id`: Allows specifying a custom connection ID for the remote target path, defaulting to the scheme’s associated Airflow connection if not set. ## Example DAG for CI Testing: Introduces an example DAG (`simple_dag_async.py`) demonstrating how to use the new execution mode(The execution like mentioned earlier would still run like Execution Mode LOCAL operators at the moment with this PR alone) This DAG is integrated into the CI pipeline to run integration tests and aims at verifying the functionality of the `ExecutionMode.AIRFLOW_ASYNC` as and when implementation gets added starting with the experimental implementation in #1230 . ## Unit & Integration Tests: - Adds comprehensive unit and integration tests to ensure correct behavior. - Tests include validation for successful uploads, error handling for misconfigured remote paths, and scenarios where `remote_target_path` are not set. ## Documentation: - Adds detailed documentation explaining how to configure and set the `ExecutionMode.AIRFLOW_ASYNC`. ## Scope & Limitations of the feature being introduced: 1. This feature is meant to be released as Experimental and is also marked so in the documentation. 2. Currently, it has been scoped for only dbt models to be executed asynchronously (being worked upon in PR #1230), while other resource types would be run synchronously. 3. `BigQuery` will be the only supported target database for this execution mode ((being worked upon in PR #1230). Thus, this PR enhances Cosmos by providing the ground work for more efficient execution of long-running dbt resources ## Additional Notes: - This feature is planned to be introduced in Cosmos v1.7.0. related: #1134
✅ Deploy Preview for sunny-pastelito-5ecb04 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
✅ Deploy Preview for sunny-pastelito-5ecb04 canceled.
|
|
On the issue:
We identified that Cosmos was using its own |
tatiana
left a comment
There was a problem hiding this comment.
Incredible work, @pankajastro @pankajkoti ! 🎉
To run dbt transformations without dbt while leveraging Airflow asynchronous processing is far from trivial - and this PR is a significant step in the right direction.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1230 +/- ##
==========================================
- Coverage 95.93% 95.73% -0.20%
==========================================
Files 67 67
Lines 3885 3965 +80
==========================================
+ Hits 3727 3796 +69
- Misses 158 169 +11 ☔ View full report in Codecov by Sentry. |
|
I haven't read through the code but this feature looks so awesome - it's something we've been talking about since day 1 so it's incredible to see it come to life! |
|
Thank you so much, @pankajkoti and @tatiana, for taking this to completion. I truly appreciate your support and hard work 🙏 |
This work has been inspired by the talk "Airflow at Monzo: Evolving our data platform as the bank scales" by @jonathanrainer @ed-sparkes given at Airflow Summit 2023: https://airflowsummit.org/sessions/2023/airflow-at-monzo-evolving-our-data-platform-as-the-bank-scales/.
Enable BQ users to run dbt models (
full_refresh) asynchronously. This releases the Airflow worker node from waiting while the transformation (I/O) happens in the dataware house (some of which can take hours, according to customers), increasing the overall Airflow task throughput (more information: https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/deferring.html). As part of this change, we introduce the capability of not using the dbt command to run actual SQL transformations. This also avoids creating subprocesses in the worker node (ExecutionMode.LOCALwithInvocationMode. SUBPROCESSandExecutionMode.VIRTUALENV) or the overhead of creating a Kubernetes Pod to execute the actual dbt command (ExecutionMode.KUBERNETES). This can avoid issues related to memory and CPU usage.This PR takes advantage of an already implemented async operator in the Airflow repo by extending it in the Cosmos async operator. It also utilizes the pre-compiled SQL generated as part of the PR #1224. It downloads the generated SQL from a remote location (S3/GCS), which allows us to decouple from dbt during task execution.
Details
get_profile_typeon ProfileConfig: This aids in database selectionAddThe async operator params are process as kwargs in the operator_args parameterasync_op_args: A high-level parameter to forward arguments to the upstream operator (Airflow operator). (This may change in this PR itself)DbtRunAirflowAsyncOperator: This initializes the Airflow Operator, retrieves the SQL query at task runtime from a remote location, modifies the query as needed, and triggers the upstream execute method.Limitations
ExecutionMode.LOCAL)full_refresh=Trueinoperator_args(which means tables will be dropped before being populated, as implemented indbt-core)ProfileMappinginProfileConfig, since Cosmos relies on having the connection (credentials) to be able to run the transformation in BQ withoutdbt-corelocationinoperator_args(this is a limitation from theBigQueryInsertJobOperatorthat is being used to implement the native Airflow asynchronous support)Testing
We have added a new dbt project to the repository to facilitate asynchronous task execution. The goal is to accelerate development without disrupting or requiring fixes for the existing tests. Also, we have added DAG for end-to-end testing https://github.com/astronomer/astronomer-cosmos/blob/bd6657a29b111510fc34b2baf0bcc0d65ec0e5b9/dev/dags/simple_dag_async.py
Configuration
Users need to configure the below param to execute deferrable tasks in the Cosmos
Example DAG: https://github.com/astronomer/astronomer-cosmos/blob/bd6657a29b111510fc34b2baf0bcc0d65ec0e5b9/dev/dags/simple_dag_async.py
Installation
You can leverage async operator support by installing an additional dependency
Documentation
The PR also document the limitations and uses of Airflow async execution in the Cosmos.
Related Issue(s)
Related to: #1120
Closes: #1134
Breaking Change?
No
Notes
This is an experimental feature, and as such, it may undergo breaking changes. We encourage users to share their experiences and feedback to improve it further.
We'd love support and feedback so we can define the next steps.
Checklist
Credits
This was a result of teamwork and effort:
Co-authored-by: Pankaj Koti pankajkoti699@gmail.com
Co-authored-by: Tatiana Al-Chueyr tatiana.alchueyr@gmail.com
Future Work
ExecutionMode.AIRFLOW_ASYNCinterface to incorporate additional database #1238