Skip to content

Refactor AIRFLOW_ASYNC so that the path in the remote object store is specific per DAG run#1741

Merged
pankajkoti merged 29 commits into
astronomer:mainfrom
tuantran0910:remote-storage-path-specific-per-dag-run
May 21, 2025
Merged

Refactor AIRFLOW_ASYNC so that the path in the remote object store is specific per DAG run#1741
pankajkoti merged 29 commits into
astronomer:mainfrom
tuantran0910:remote-storage-path-specific-per-dag-run

Conversation

@tuantran0910
Copy link
Copy Markdown
Contributor

Description

Refactor AIRFLOW_ASYNC so that the path in the remote object store is specific per DAG run. The format of remote model path will be:

# test_cosmos/simple_dag_async/run/jaffle_shop/models/example/my_first_dbt_model.sql
remote_model_path = f"{remote_target_path_str}/{dbt_dag_task_group_identifier}/{run_id}/run/{relative_file_path}"

Related Issue(s)

Closes #1613

Copilot AI review requested due to automatic review settings May 4, 2025 17:40
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 4, 2025
@dosubot dosubot Bot added area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc area:testing Related to testing, like unit tests, integration tests, etc labels May 4, 2025
@netlify
Copy link
Copy Markdown

netlify Bot commented May 4, 2025

Deploy Preview for sunny-pastelito-5ecb04 canceled.

Name Link
🔨 Latest commit 4553787
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/6817a68b4fad2600085abb3e

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the AIRFLOW_ASYNC remote model path generation to include the run_id, ensuring that paths in the remote object store are specific per DAG run.

  • Updates file path construction in local and asynchronous operators to incorporate run_id.
  • Adds tests to verify that the run_id is correctly embedded in the constructed paths.
  • Adjusts async context initialization in asynchronous operators to include run_id.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/test_async_remote_target_dir_specific_run_id.py New tests validate that run_id is correctly used in file path building.
cosmos/operators/local.py Updates _construct_dest_file_path to inject run_id into the path.
cosmos/operators/_asynchronous/bigquery.py Modifies remote SQL path construction to include run_id.
cosmos/operators/_asynchronous/init.py Updates async_context creation to include run_id for async operators.

Comment thread cosmos/operators/local.py
Comment thread cosmos/operators/_asynchronous/__init__.py
@netlify
Copy link
Copy Markdown

netlify Bot commented May 4, 2025

Deploy Preview for sunny-pastelito-5ecb04 canceled.

Name Link
🔨 Latest commit 25367d2
🔍 Latest deploy log https://app.netlify.com/projects/sunny-pastelito-5ecb04/deploys/682cbae3f14ccb000847ca0f

Copy link
Copy Markdown
Contributor

@pankajastro pankajastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution, @tuantran0910! I've triggered the CI for the integration tests.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.72%. Comparing base (15a8d91) to head (25367d2).
Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1741   +/-   ##
=======================================
  Coverage   97.71%   97.72%           
=======================================
  Files          84       84           
  Lines        5252     5264   +12     
=======================================
+ Hits         5132     5144   +12     
  Misses        120      120           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pankajastro
Copy link
Copy Markdown
Contributor

Currently, the teardown task deletes SQL files through multiple calls, and the goal was to reduce the risk of deleting unexpected files when the task runs in parallel. Now that we're uploading based on the run-id, would it be worthwhile to delete the entire run-id folder directly, rather than iterating over each file? What are your thoughts?

def _delete_sql_files(self, tmp_project_dir: Path, resource_type: str) -> None:

This may also apply to the setup task.

@tuantran0910
Copy link
Copy Markdown
Contributor Author

Currently, the teardown task deletes SQL files through multiple calls, and the goal was to reduce the risk of deleting unexpected files when the task runs in parallel. Now that we're uploading based on the run-id, would it be worthwhile to delete the entire run-id folder directly, rather than iterating over each file? What are your thoughts?

def _delete_sql_files(self, tmp_project_dir: Path, resource_type: str) -> None:

This may also apply to the setup task.

Hmm, I think it's a good idea to delete the entire folder run-id for optimizing the workflow. I will implement this later.

@tuantran0910
Copy link
Copy Markdown
Contributor Author

Hi @pankajastro, can you checkout the logic to delete run-id specific directory at my new commit 2bf7258. Thanks a lot :D

@tuantran0910 tuantran0910 requested a review from pankajastro May 5, 2025 16:09
Copy link
Copy Markdown
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @tuantran0910 ! Thank you very much for improving this part of Cosmos and making it more reliable.

I've given some feedback in-line. Also, it seems there is an integration test failing, could you take a look please.

Comment thread cosmos/operators/local.py Outdated
Comment thread tests/test_async_remote_target_dir_specific_run_id.py Outdated
@tuantran0910
Copy link
Copy Markdown
Contributor Author

Sorry @pankajastro, can you enable the integration tests and approve this PR again ? I have just pushed a new commit to ensure that all the changes are coverage.

Comment thread cosmos/operators/local.py
Copy link
Copy Markdown
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tuantran0910 Thank you for all the patience on getting this to work. There is a minor request regarding code coverage - if you could address this as part of this PR, it would be great. If you can't do this before the release of 1.10.1, please, could you do a follow-up PR?

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 20, 2025
@tuantran0910
Copy link
Copy Markdown
Contributor Author

@tuantran0910 Thank you for all the patience on getting this to work. There is a minor request regarding code coverage - if you could address this as part of this PR, it would be great. If you can't do this before the release of 1.10.1, please, could you do a follow-up PR?

Hey @tatiana, I will try to push a quick commit to fix the coverage within an hour. Thank you for pointing that.

@tuantran0910
Copy link
Copy Markdown
Contributor Author

tuantran0910 commented May 20, 2025

Hi @tatiana, I have added the test in fe92ef0. Thank you so much :D

Copy link
Copy Markdown
Contributor

@pankajkoti pankajkoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improvising this @tuantran0910 . This is certainly a very valuable contribution 👏🏽

I tested this with an example async DAG running in Airflow UI and it works smoothly as intended for. We will include this fix in the upcoming release 1.10.1.

@pankajkoti pankajkoti merged commit 304e426 into astronomer:main May 21, 2025
92 checks passed
pankajkoti pushed a commit that referenced this pull request May 21, 2025
…is specific per DAG run (#1741)

Refactor `AIRFLOW_ASYNC` so that the path in the remote object store is
specific per DAG run. The format of remote model path will be:

```python
# test_cosmos/simple_dag_async/run/jaffle_shop/models/example/my_first_dbt_model.sql
remote_model_path = f"{remote_target_path_str}/{dbt_dag_task_group_identifier}/{run_id}/run/{relative_file_path}"
```

Closes #1613

(cherry picked from commit 304e426)
pankajkoti added a commit that referenced this pull request May 21, 2025
Bug Fixes

* Fix ``full_refresh`` parameter in ``AIRFLOW_ASYNC``
``ExecutionConfig`` mode by @tuantran0910 in #1738
* Fix dbt ls invocation method log message by @tatiana and @dstandish in
#1749
* Ensure remote target directory is created when copying files when
using local directory by @tuantran0910 and @corsettigyg in #1740
* Support custom ``packages-install-path`` by @tatiana in #1768
* Disable dbt static parser during Airflow task execution using dbt
runner by @pankajkoti and @tatiana in #1760
* Fix ``ExecutionMode.LOCAL`` to leverage
``ProjectConfig.manifest_path`` by @tatiana in #1772
* Refactor ``AIRFLOW_ASYNC`` so that the path in the remote object store
is specific per DAG run by @tuantran0910 in #1741
* Optimise memory usage with optional explicit imports by @pankajkoti
and @tatiana in #1769

Documentation

* Fix documentation rendering for ``use_dataset_airflow3_uri_standard``
by @pankajastro in #1742
* Correct custom callback example by @walter9388 in #1747

Others

* Re-enable integration tests durations to troubleshoot performance
degradation by @tatiana in #1735
* Run listener tests for Airflow 3 by @pankajastro in #1743
* Add Airflow 3 db files to ignore from git tracking by @pankajkoti in
#1755
* Log contents of ``packages.yml`` when
``AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG`` by @tatiana in #1764
* Fix Airflow dependencies in the CI by @tatiana in #1773
* Pre-commit updates: #1744, #1765, #1770
pankajkoti added a commit that referenced this pull request May 21, 2025
Bug Fixes

* Fix ``full_refresh`` parameter in ``AIRFLOW_ASYNC``
``ExecutionConfig`` mode by @tuantran0910 in #1738
* Fix dbt ls invocation method log message by @tatiana and @dstandish in
#1749
* Ensure remote target directory is created when copying files when
using local directory by @tuantran0910 and @corsettigyg in #1740
* Support custom ``packages-install-path`` by @tatiana in #1768
* Disable dbt static parser during Airflow task execution using dbt
runner by @pankajkoti and @tatiana in #1760
* Fix ``ExecutionMode.LOCAL`` to leverage
``ProjectConfig.manifest_path`` by @tatiana in #1772
* Refactor ``AIRFLOW_ASYNC`` so that the path in the remote object store
is specific per DAG run by @tuantran0910 in #1741
* Optimise memory usage with optional explicit imports by @pankajkoti
and @tatiana in #1769

Documentation

* Fix documentation rendering for ``use_dataset_airflow3_uri_standard``
by @pankajastro in #1742
* Correct custom callback example by @walter9388 in #1747

Others

* Re-enable integration tests durations to troubleshoot performance
degradation by @tatiana in #1735
* Run listener tests for Airflow 3 by @pankajastro in #1743
* Add Airflow 3 db files to ignore from git tracking by @pankajkoti in
#1755
* Log contents of ``packages.yml`` when
``AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG`` by @tatiana in #1764
* Fix Airflow dependencies in the CI by @tatiana in #1773
* Pre-commit updates: #1744, #1765, #1770


---------

(cherry picked from commit 430be00)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc area:testing Related to testing, like unit tests, integration tests, etc lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(Bug) Refactor AIRFLOW_ASYNC so that the path in the remote object store is specific per DAG run

5 participants