Skip to content

Refactor dbt ls to run from a temporary directory#414

Merged
tatiana merged 8 commits into
mainfrom
issue-411-dbt-ls-tmp-dir
Aug 14, 2023
Merged

Refactor dbt ls to run from a temporary directory#414
tatiana merged 8 commits into
mainfrom
issue-411-dbt-ls-tmp-dir

Conversation

@tatiana
Copy link
Copy Markdown
Collaborator

@tatiana tatiana commented Jul 28, 2023

As of Cosmos 1.0.0, LoadMode.DBT_LS runs dbt ls from within the original dbt project directory.

The dbt ls outputs files to the directory it's running from unless the environment variables DBT_LOG_PATH and DBT_TARGET_PATH are specified (as of dbt 1.6).

Depending on the deployment, the Airflow worker does not have write permissions to the dbt project directory. This can lead to an error message similar to the following:

20:43:06  Encountered an error:
[Errno 30] Read-only file system: '/usr/local/airflow/dags/dbt_grindr/logs/dbt.log'

This PR changes the behavior of dbt ls to try to make the dbt ls artifacts (logs and target directory) not be written to the original project directory.

In addition to the introduced test, this change was validated using airflow 2.6 and dbt 1.6, by following these steps:
(1) Delete folders logs and target from astronomer-cosmos/dev/dags/dbt/jaffle_shop
(2) Add a breakpoint after stdout, stderr = process.communicate() in dbt/graph.py
(3) Run a DAG that uses astronomer-cosmos/dev/dags/dbt/jaffle_shop, e.g.:

airflow dags test basic_cosmos_dag `date -Iseconds`

(4) When the breakpoint happens, check that no target or logs folder was created after running dbt ls in astronomer-cosmos/dev/dags/dbt/jaffle_shop

A limitation with the current approach is that, although dbt ls is not creating these directories in the given circumstances, if the user is using the local executor or an earlier version of dbt, the files will still be written to the project directory.

Closes: #411

@netlify
Copy link
Copy Markdown

netlify Bot commented Jul 28, 2023

👷 Deploy Preview for amazing-pothos-a3bca0 processing.

Name Link
🔨 Latest commit 7bd4e30
🔍 Latest deploy log https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/64d6136543a5b3000778ba9e

@tatiana tatiana force-pushed the issue-411-dbt-ls-tmp-dir branch from 2353167 to 5821a9c Compare July 28, 2023 12:00
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 28, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.07% 🎉

Comparison is base (1233f9c) 91.43% compared to head (7bd4e30) 91.51%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #414      +/-   ##
==========================================
+ Coverage   91.43%   91.51%   +0.07%     
==========================================
  Files          50       50              
  Lines        1752     1768      +16     
==========================================
+ Hits         1602     1618      +16     
  Misses        150      150              
Files Changed Coverage Δ
cosmos/constants.py 100.00% <100.00%> (ø)
cosmos/dbt/graph.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tatiana tatiana marked this pull request as ready for review July 28, 2023 15:58
@tatiana tatiana requested a review from a team as a code owner July 28, 2023 15:58
@tatiana tatiana requested a review from a team July 28, 2023 15:58
Comment thread tests/dbt/test_graph.py
Comment thread cosmos/dbt/graph.py Outdated
Comment thread cosmos/dbt/graph.py
As of Cosmos 1.0.0, `LoadMode.DBT_LS` ran `dbt ls` from within the original dbt project directory.

The `dbt ls` outputs files to the directory it's running from unless the environment variables `DBT_LOG_PATH` and `DBT_TARGET_PATH` are specified.

Depending on the deployment, the Airflow worker does not have write permissions to the dbt project directory. This PR changes the behavior of `dbt ls` to make a copy of the original project directory into a temporary directory and run the command `dbt ls` from there.

Closes: #411
…a separate dir

Unfortunately this does not work in dbt 1.5 or previous versions
@tatiana tatiana force-pushed the issue-411-dbt-ls-tmp-dir branch from e2ecb39 to 4360fd1 Compare August 10, 2023 15:11
@tatiana tatiana requested a review from jlaneve August 10, 2023 15:20
@tatiana
Copy link
Copy Markdown
Collaborator Author

tatiana commented Aug 10, 2023

Thanks, @jlaneve , I think I addressed all the feedback!

Comment thread .pre-commit-config.yaml
@tatiana
Copy link
Copy Markdown
Collaborator Author

tatiana commented Aug 10, 2023

A different approach we could adopt to solve this issue is not to pass project-dir to dbt-ls and to create symbolic links for the original project-dir files to the temp-dir. This way, the target & logs will be written to the temp dir.

Still, this change will not solve the limitation that ATM local operators run things locally. @jlaneve @harels is the scope of this ticket only dbt ls or all local operator executions?

Comment thread cosmos/dbt/graph.py
@tatiana tatiana merged commit 41bf623 into main Aug 14, 2023
@tatiana tatiana deleted the issue-411-dbt-ls-tmp-dir branch August 14, 2023 09:09
tatiana added a commit that referenced this pull request Aug 16, 2023
Feature (pending documentation!)
* Support dbt global flags (via dbt_cmd_global_flags in `operator_args` by @tatiana in #469

Enhancements
* Hide sensitive field when using BigQuery keyfile_dict profile mapping by @jbandoro in #471

Bug fixes
* Fix bug on select node add exclude selector subset ids logic by @jensenity in #463
* Refactor dbt ls to run from a temporary directory, to avoid Read-only file system errors during DAG parsing, by @tatiana in #414

Others
* Docs: Fix RenderConfig load argument by @jbandoro in #466
* Enable CI integration tests from external forks by @tatiana in #458
* Improve CI tests runtime by @tatiana in #457
* Change CI to run coverage after tests pass by @tatiana in #461
* Fix forks code revision in code coverage by @tatiana in #472
* [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #467"
i
@tatiana tatiana mentioned this pull request Aug 16, 2023
tatiana added a commit that referenced this pull request Aug 29, 2023
Since Cosmos 1.0, `load_method.DBT_LS` is the default dbt project
parsing method, unless the user gives a manifest.

Using the original dbt project path has been a source of issues when
that path is Read-Only. This issue was faced when running commands that
generate `{project-dir}/target/` and `{project-dir}/logs/`, which was
solved as part of #414.

This issue is particularly problematic if we want to run `dbt deps` from
the original project directory since dbt 1.6 saves adaptors to
`{project_dir}/dbt_packages` unless specified in the user's
`dbt_project.yml`. To our knowledge, dbt currently does not allow users
to define this directory via flags or environment variables, as
discussed in #481.

This change aims to solve these issues, by creating a temporary
directory and creating symbolic links to the original directory.

Finally, during the development of this task, it was observed that when
running dbt ls in a project with `packages.yml`, dbt raises a
'Compilation Error'. Since dbt may raise other errors in stdout, this PR
captures "Errors" more generically - making it more evident potential
issues to the end-users.
@tatiana tatiana mentioned this pull request Sep 6, 2023
tatiana added a commit that referenced this pull request Sep 6, 2023
**Features**

* Support dbt global flags (via dbt_cmd_global_flags in operator_args)
by @tatiana in #469
* Support parsing DAGs when there are no connections by @jlaneve in #489

**Enhancements**

* Hide sensitive field when using BigQuery keyfile_dict profile mapping
by @jbandoro in #471
* Consistent Airflow Dataset URIs, inlets and outlets with `Openlineage
package <https://pypi.org/project/openlineage-integration-common/>`_ by
@tatiana in #485. `Read more
<https://astronomer.github.io/astronomer-cosmos/configuration/lineage.html>`_.
* Refactor ``LoadMethod.DBT_LS`` to run from a temporary directory with
symbolic links by @tatiana in #488
* Run ``dbt deps`` when using ``LoadMethod.DBT_LS`` by @DanMawdsleyBA in
#481
* Update Cosmos log color to purple by @harels in #494
* Change operators to log ``dbt`` commands output as opposed to
recording to XCom by @tatiana in #513

**Bug fixes**

* Fix bug on select node add exclude selector subset ids logic by
@jensenity in #463
* Refactor dbt ls to run from a temporary directory, to avoid Read-only
file system errors during DAG parsing, by @tatiana in #414
* Fix profile_config arg in DbtKubernetesBaseOperator by @david-mag in
#505
* Fix SnowflakePrivateKeyPemProfileMapping private_key reference by
@nacpacheco in #501
* Fix incorrect temporary directory creation in VirtualenvOperator init
by @tatiana in #500
* Fix log propagation issue by @tatiana in #498
* Fix PostgresUserPasswordProfileMapping to retrieve port from
connection by @jlneve in #511

**Others**

* Docs: Fix RenderConfig load argument by @jbandoro in #466
* Enable CI integration tests from external forks by @tatiana in #458
* Improve CI tests runtime by @tatiana in #457
* Change CI to run coverage after tests pass by @tatiana in #461
* Fix forks code revision in code coverage by @tatiana in #472
* [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #467
* Drop support to Python 3.7 in the CI test matrix by @harels in #490
* Add Airflow 2.7 to the CI test matrix by @tatiana in #487
* Add MyPy type checks to CI since we exceeded pre-commit disk quota
usage by @tatiana in #510
@dwreeves dwreeves mentioned this pull request Feb 12, 2024
2 tasks
tatiana added a commit that referenced this pull request Apr 16, 2025
…` during DAG parsing (#1668)

Support running `dbt deps` incrementally to pre-calculated
`dbt_packages` during DAG parsing. This was a use case requested by an
Astro customer.

Before this change, Cosmos supported two types of configuration:
* If users choose `RenderConfig. dbt_deps=False` or
`ProjectConfig.install_dbt_deps=False`, Cosmos would create a symbolic
link for the user's pre-defined `dbt_packages` (background: #488, #600,
#730)
* If users choose `RenderConfig. dbt_deps=True` or
`ProjectConfig.install_dbt_deps=True` (default), Cosmos would ignore any
user-predefined ' dbt_packages` and do a run `dbt deps` from scratch
from a temporary folder.

An Astronomer customer requested to reuse the defined initially
`dbt_packages` directory and run `dbt deps` (incrementally).

We do not run dbt commands directly in the original dbt project folder
with Cosmos because some users use read-only filesystems (#414). We also
decided to use symbolic links instead of copying the directory due to
performance issues (#488). Since we did not want to introduce a breaking
change in a minor Cosmos release by changing the existing Cosmos 1.x
behaviour to meet this new use case, this PR supports:
* Copying the dbt deps related files (dbt packages folder and symbolic
link) to the Cosmos temporary folder; and
* Running `dbt deps`.

So this is not a breaking change, users must opt into this behaviour by
either:
- Changing individual `DbtDag` or `DbtTaskGroup` instances, using
`ProjectConfig.copy_dbt_packages=True` (new configuration) and
`RenderConfig. dbt_deps=True` or `ProjectConfig.install_dbt_deps=True`;
or
- Changing the behaviour globally, by setting the Airflow configuration
either via the environment variable
`AIRFLOW__COSMOS__DEFAULT_COPY_DBT_PACKAGES_VALUE=True`. or via the
`airflow.cfg`:
```
[cosmos]
default_copy_dbt_packages_value=True
```

The following two missing parts are being added as part of a separate
PR:
- Mimic this behaviour during task execution;
- Update documentation to be representative of both changes.

Depends on #1669 

Related to: #1630
tatiana added a commit that referenced this pull request Apr 16, 2025
…` during task execution (#1670)

Support running `dbt deps` incrementally to pre-calculated
`dbt_packages` during the task execution. This was a use case requested
by an Astro customer.

Before this change, Cosmos supported two types of configuration:
* If users choose `operator_args={"install_deps": False}` or
`ProjectConfig.install_dbt_deps=False`, Cosmos would create a symbolic
link for the user's pre-defined `dbt_packages` (background: #488, #600,
#730)
* If users choose `operator_args={"install_deps": True}` or
`ProjectConfig.install_dbt_deps=True` (default), Cosmos would ignore any
user-predefined ' dbt_packages` and do a run `dbt deps` from scratch
from a temporary folder.

An Astronomer customer requested to reuse the defined initially
`dbt_packages` directory and run `dbt deps` (incrementally).

# Implementation

We do not run dbt commands directly in the original dbt project folder
with Cosmos because some users use read-only filesystems (#414). We also
decided to use symbolic links instead of copying the directory due to
performance issues (#488). Since we did not want to introduce a breaking
change in a minor Cosmos release by changing the existing Cosmos 1.x
behaviour to meet this new use case, this PR supports:
* Copying the dbt deps related files (dbt packages folder and symbolic
link) to the Cosmos temporary folder; and
* Running `dbt deps`.

So this is not a breaking change, users must opt into this behaviour by
using `ProjectConfig.copy_dbt_packages=True` (new configuration) and
`operator_args={"install_dbt_deps": True}` or `ProjectConfig.
install_dbt_deps =True` and one of the following:
- Changing the operator to receive the argument `copy_dbt_packages=True`
- Changing individual `DbtDag` or `DbtTaskGroup` instances to use the
new configuration `ProjectConfig.copy_dbt_packages=True`
- Changing the behaviour globally, by setting the Airflow configuration
either via the environment variable
`AIRFLOW__COSMOS__DEFAULT_COPY_DBT_PACKAGES_VALUE=True` or via the
`airflow.cfg`:
```
[cosmos]
default_copy_dbt_packages_value=True
```

# How this was tested

To validate the end-to-end behaviour, we run the following dag from
`dev/dags`:
```
airflow dags test dbt_deps_example
```

# Related tickets

This is a follow-up to #1668 and #1669.

I'll make a follow-up PR covering the documentation.

Closes: #1630
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write logs to a temp dir when using dbt_ls parsing mode

3 participants