Add support for virtual env directory flag#611
Conversation
👷 Deploy Preview for amazing-pothos-a3bca0 processing.
|
64c4b93 to
0364ea3
Compare
ae7d1f5 to
9b94cec
Compare
95cb840 to
9b0e6e3
Compare
06be330 to
26032c9
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #611 +/- ##
==========================================
- Coverage 93.28% 93.06% -0.23%
==========================================
Files 55 54 -1
Lines 2502 2163 -339
==========================================
- Hits 2334 2013 -321
+ Misses 168 150 -18 ☔ View full report in Codecov by Sentry. |
tatiana
left a comment
There was a problem hiding this comment.
Thank you very much for creating this PR so quickly, @LennartKloppenburg ! This is looking very good.
I added some comments in line, and I have a gut feeling we may need to add some additional tests to cover the possible behaviours in _get_or_create_venv_py_interpreter.
We can aim to release this change as part of 1.2.1 (if we consider it a bugfix) or 1.3 (if we consider it a new feature) 🎉
| self.log.info(f"Checking for venv interpreter: {py_interpreter_path} : {py_interpreter_path.is_file()}") | ||
| if py_interpreter_path.is_file(): | ||
| self.log.info(f"Found Python interpreter in cached virtualenv: `{str(py_interpreter_path)}`") | ||
| return str(py_interpreter_path) |
There was a problem hiding this comment.
In this case, does it still make sense to install any potential dependencies/update them - if there were requirement changes?
There was a problem hiding this comment.
Yeah ideally we'd be able to clean up the virtual env after the DAG run, but for the reasons you mentioned before this can be tricky.
One way to "perhaps" invalidate the virtualenv is to check when it was created and, after say 24 hours or 48 or so, have this operator clean it up and recreate it?
There was a problem hiding this comment.
The time-based approach could lead to some strange scenarios and be tricky to troubleshoot. How feasible would be for us to run a pip install in an existing virtualenv? It should be very quick if it was previously setup, and it would make the operator reliable.
Regarding the cleanup - I know - ideally we'd be able to set the venv only once during the DAG setup and delete during tear down. Unfortunately - to my knowledge - even the latest Airflow (2.7) does not allow us to have a setup/tear down per worker node during the lifecycle of a DAG. But this can be an improvement for the future - in a separate PR!
There was a problem hiding this comment.
Sorry for the late response here!
so the underlying prepare_virtualenv that we are "avoiding" after determining it's already there is imported from Airflow core (airflow.utils.python_virtualenv). That little helper also takes into account the python requirements so if we bypass this helper, we can't inject requirements unless we repeat the logic over here:
...
pip_cmd = None
if requirements is not None and len(requirements) != 0:
pip_cmd = _generate_pip_install_cmd_from_list(venv_directory, requirements, pip_install_options)
if requirements_file_path is not None and requirements_file_path:
pip_cmd = _generate_pip_install_cmd_from_file(
venv_directory, requirements_file_path, pip_install_options
)
if pip_cmd:
execute_in_subprocess(pip_cmd)
...What do you think?
There was a problem hiding this comment.
In our case, we probably would only need to do part of the logic:
if requirements is not None and len(requirements) != 0:
pip_cmd = _generate_pip_install_cmd_from_list(venv_directory, requirements, pip_install_options)
Since we don't support requirements_file_path.
If we don't want to add this call unnecessarily, we'd probably need a pip freeze call - to confirm if the desired dependencies are already installed, which may be more work.
We probably need one or both of these. Otherwise, we're at the risk of an Airflow worker having partial/outdated dependencies that are incompatible with the dependencies the user requested.
I'm in favour of us caching for performance reasons, but we still should aim to have the task being idempotent.
| self._venv_dir = virtualenv_dir | ||
| self._venv_tmp_dir: None | TemporaryDirectory[str] = None | ||
|
|
||
| @cached_property |
There was a problem hiding this comment.
A general thought: do we still want to cache this property? Is there any risk that we could end up caching the incorrect path?
There was a problem hiding this comment.
How is this property cached? If people are debugging or want to pass in more dynamically configured directories, I don't know how this decorator behaves :) Is it per task_id per dagrun_id or is it more persistent?
There was a problem hiding this comment.
The property is cached while the Python process is alive.
384c7b5 to
02dbd9a
Compare
|
@tatiana I've updated the PR with some changes you've requested :) One lingering issue: The issue can then be resolved by retrying these tasks with some retry_delay, which will achieve the same result as "waiting" for the virtual env to be provisioned would have done. |
If execution_config was reused, Cosmos 1.2.2 would raise:
```
astronomer-cosmos/dags/basic_cosmos_task_group.py
Traceback (most recent call last):
File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/venv-38/lib/python3.8/site-packages/airflow/models/dagbag.py", line 343, in parse
loader.exec_module(new_module)
File "<frozen importlib._bootstrap_external>", line 848, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/dags/basic_cosmos_task_group.py", line 74, in <module>
basic_cosmos_task_group()
File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/venv-38/lib/python3.8/site-packages/airflow/models/dag.py", line 3817, in factory
f(**f_kwargs)
File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/dags/basic_cosmos_task_group.py", line 54, in basic_cosmos_task_group
orders = DbtTaskGroup(
File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/cosmos/airflow/task_group.py", line 26, in __init__
DbtToAirflowConverter.__init__(self, *args, **specific_kwargs(**kwargs))
File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/cosmos/converter.py", line 113, in __init__
raise CosmosValueError(
cosmos.exceptions.CosmosValueError: ProjectConfig.dbt_project_path is mutually exclusive with RenderConfig.dbt_project_path and ExecutionConfig.dbt_project_path.If using RenderConfig.dbt_project_path or ExecutionConfig.dbt_project_path, ProjectConfig.dbt_project_path should be None
```
This has been raised by an Astro customer and our field engineer, who
tried to run: https://github.com/astronomer/cosmos-demo
53c7595 to
3a84aa7
Compare
|
@tatiana Just completed the rebase, saw some artifacts that trip up the tests, will look at those tomorrow :) ! |
62c2a7c to
547b0af
Compare
|
@tatiana When I run the tests locally they pass, maybe I missed something while rebasing? I rebased so much that I no longer know where it was introduced :D |
|
Hi @LennartKloppenburg ! I'm sorry for the massive delay, I've been working on other projects and it has been hard to keep up with everything. I'm planning to get back to this PR next week, so we can try to release it as part of Cosmos 1.5 |
|
Hi @LennartKloppenburg ! I'm very sorry for the very long delay.
If you are happy with the proposed changes, please feel free to incorporate them into your PR. |
## Description
Added `virtualenv_dir` as an option to `ExecutionConfig` which is then
propagated downstream to `DbtVirtualenvBaseOperator`.
The following now happens:
- If the flag is set, the operator will attempt to locate the `venv`'s
`python` binary under the provided `virtualenv_dir`.
- If so, it will conclude that the `venv` exists and continues without
creating a new one.
- If not, it will create a new one at `virtualenv_dir`
- If the flag is not set, simply continue using the temporary directory
solution that was already in place.
## Impact
A very basic test using a local `docker compose` set-up as per the
contribution guide and the
[example_virtualenv](https://github.com/astronomer/astronomer-cosmos/blob/main/dev/dags/example_virtualenv.py)
DAG saw the DAG's runtime go down from **2m31s** to just **32s**. I'd
this improvement to be even more noticeable with more complex graphs and
more python requirements.
## Related Issue(s)
Closes: #610
Partially solves: #1042
Follow up ticket: #1157
## Breaking Change?
None, the flag is optional and is ignored (with a
[warning](https://github.com/astronomer/astronomer-cosmos/compare/main...LennartKloppenburg:astronomer-cosmos:feature/cache-virtualenv?expand=1#diff-61b585fb903927b6868b9626c95e0ec47e3818eb477d795ebd13b0276d4fd76cR125))
when used outside of `VirtualEnv` execution mode.
## Important notice
Most of the changes in this PR were originally implemented in PR #611 by
@LennartKloppenburg. It became stale over the last few months due to
limited maintainer availability. Our sincere apologies to the original
author.
What was accomplished since:
1. Rebased
2. Fixed conflicts
3. Fixed failing tests
4. Introduced new tests
Co-authored-by: Lennart Kloppenburg <lennartkloppenburg@live.nl>
|
We took this to completion in #1079, giving the credits to @LennartKloppenburg and this original PR |
Description
Added
virtualenv_diras an option toExecutionConfigwhich is then propagated downstream toDbtVirtualenvBaseOperator.The following now happens:
venv'spythonbinary under the providedvirtualenv_dir.venvexists and continues without creating a new one.virtualenv_dirImpact
A very basic test using a local
docker composeset-up as per the contribution guide and the example_virtualenv DAG saw the DAG's runtime go down from 2m31s to just 32s. I'd this improvement to be even more noticeable with more complex graphs and more python requirements.Related Issue(s)
Implements #610
Breaking Change?
None, the flag is optional and is ignored (with a warning) when used outside of
VirtualEnvexecution mode.Checklist