Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Jan 10, 2023

Git Sync and Persistence for DAGs makes very little sense together and is largely misleading our users on what it does.

Git Sync provides atomicity of DAG folder synchronisation via checking out a complete copy of the DAGs folder and swapping symbolic link pointing to it. It does not play well with networked persistence.

It makes it super-easy by users unaware how git-sync and persistence work under-the-hood to walk into several traps:

  • git sync on persistent remote volumes such as EFS generate a LOT of extra traffic due to the way how git sync works (it creates second working folder for dags and replaces symbolic link to folders which effectively forces full sync of whole DAG folder for all involved instances with every commit
  • due to that sync that gets distributed over multiple clients of persistent volumes it looses the atomicity property of git sync and the above case where there are burst of synchronisation betwween multiple nodes, it is very likely to trigger inconsistent DAG parsing
  • the problem amplifies when the network volumes are distributed among multiple nodes and there are some networking limits (for example not provisioned IOPS in EFS). The amount of traffic generated at sync might cause even more inconsistencies - only solvable by paying extra IOPS (where it would not be needed normally)
  • users might be tricked into trying to use gitSync and also update DAGs using persistence (so basically combine the development friendly dag distribution over persistent volumes and production-ready git-sync - without being aware that git-sync will override the manually synced DAGS when swapping the symbolic links

On top of it, the current status is that it does not work. There are several issues where volumes are missing when the combo is used in certain situations and better than fixing those is to disable it.

Closes: #27545
Closes: #27476
Closes: #27080

Related: #27124


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Contributor

@dstandish dstandish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine, though being technically breaking probably requires major release? Unless you want to think of it as bugfix.....
@jedcunningham

@potiuk
Copy link
Member Author

potiuk commented Jan 10, 2023

We had a lot of discussions about it with @jedcunningham in related issues and some time ago I wrote the article describing in detail why git-sync and networked persistence do not play along together (and do not serve any useful use case).

https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca

I think we've accumulated enough of the issues from our users that confirm that they REALLY did not want to use persistence and git-sync, and we had a lot of problems which we could attribute to lack of atomicity that is amplified by git-sync + network persistence.

The last straw for me was this discussion https://apache-airflow.slack.com/archives/C027H098M1C/p1673333515991549. where the users wanted to use dags.persistence and dags.gitSync to combine development and production workflows in one (which would not work but the user was unaware of that as they did not know that how git-sync works under the hood it would overwrote the manually updated/copied dags).

All-in-all - I think allowing this combo is actively harmful and we should help our users to make the right decision (and ourselves by not having similar discussions/getting issues created) by disabling the combo.

@potiuk
Copy link
Member Author

potiuk commented Jan 10, 2023

I personally think it is a bugfix (because it does not work anyway :) )

@potiuk potiuk force-pushed the disallow-git-sync-and-persistence branch from 59c15db to 7a969c1 Compare January 10, 2023 08:36
Git Sync and Persistence for DAGs makes very little sense together
and is largely misleading our users on what it does.

Git Sync provides atomicity of DAG folder synchronisation via
checking out a complete copy of the DAGs folder and swapping
symbolic link pointing to it. It does not play well with
networked persistence.

It makes it super-easy by users unaware how git-sync and
persistence work under-the-hood to walk into several traps:

* git sync on persistent remote volumes such as EFS generate a LOT
  of extra traffic due to the way how git sync works (it creates
  second working folder for dags and replaces symbolic link to folders
  which effectively forces full sync of whole DAG folder for all
  involved instances with every commit
* due to that sync that gets distributed over multiple clients of
  persistent volumes it looses the atomicity property of git sync
  and the above case where there are burst of synchronisation betwween
  multiple nodes, it is very likely to trigger inconsistent DAG parsing
* the problem amplifies when the network volumes are distributed among
  multiple nodes and there are some networking limits (for example
  not provisioned IOPS in EFS). The amount of traffic generated at
  sync might cause even more inconsistencies - only solvable by paying
  extra IOPS (where it would not be needed normally)
* users might be tricked into trying to use gitSync and also update
  DAGs using persistence (so basically combine the development friendly
  dag distribution over persistent volumes and production-ready
  git-sync - without being aware that git-sync will override the
  manually synced DAGS when swapping the symbolic links

On top of it, the current status is that it does not work. There
are several issues where volumes are missing when the combo is used
in certain situations and better than fixing those is to disable it.

Closes: apache#27545
Closes: apache#27476
Closes: apache#27080

Related: apache#27124
@potiuk potiuk force-pushed the disallow-git-sync-and-persistence branch from 7a969c1 to f3cfb6e Compare January 10, 2023 21:06
@jedcunningham
Copy link
Member

I think I'd rather warn instead of block (but I'm also a bit worn down on thinking about the combo in our chart, so I certainly won't stand in the way here).

@potiuk
Copy link
Member Author

potiuk commented Jan 11, 2023

I think this is one of those cases that we really balloon a matrix of tests (and complexity of the helm) for very little reason.

Seems that we alrady have a number of tests - now failing) where both flags are set - yet we still do not have the combo right.

After seeing the number of tests failing (28), though, I have a second thought - because apparently this combo was actually quite thoroughly tested before.

I wonder (for my own understanding here) - why the (currently failing) tests were added. What was the reasoning behind having a number of "gitSync": {"enabled": True}, "persistence": {"enabled": True} in the parameterized tests?
Was there any other thought behind it that we knew or expected this combo to be working or was it just - ok let's test all the combinations without much of a thinking whether this combo is actually useful?

I think answer to that should determine whether it makes sense to remove it completely or maybe just warn the users and try to fix the cases we know for now.

@potiuk
Copy link
Member Author

potiuk commented Jan 11, 2023

And yes I am just afraid that we might break things for some users - so far I understood that #27545 #27476 effectively means that it was really difficult to get a prod-ready setting, but After looking closely it's been "freshly" broken by #22913

Also #22913 is triggered by recently added standalone DAG File processor. So my claims that this is not "breaking" is not as bold as it was.

Let me think for a while about it.

@potiuk potiuk marked this pull request as draft January 11, 2023 00:55
@jedcunningham
Copy link
Member

Yes, it did work at one point and the tests were intentionally added (but even then this combo really only made sense with a single scheduler).

@potiuk
Copy link
Member Author

potiuk commented Jan 19, 2023

Still thinking on it :).

@hussein-awala
Copy link
Member

And yes I am just afraid that we might break things for some users

@potiuk I think the user who uses the combo and wants to upgrade to the new chart version could still use it by enabling git-sync, disabling dags persistence and adding the dags volume manually, what do you think?

@Aknc34
Copy link

Aknc34 commented Feb 19, 2023 via email

@potiuk potiuk added the pinned Protect from Stalebot auto closing label Mar 17, 2023
@jedcunningham jedcunningham removed this from the Airflow Helm Chart 1.9.0 milestone Apr 10, 2023
@JMLizano
Copy link

Hi, I was about to create a bug report related to this but then I saw this PR, so probably better to include it here.

The issue appears when activating both dags persistence and git-sync. When you have a DAG inside a package, that is importing some other module from the package, like:

from mypackage.lib.mylib import lib_method

the import will fail at DAG execution time if both git-sync and dag persistence are enabled. The DAG is still displayed in the Web UI, and airflow list it as correctly imported (i.e. it is displayed in the output of airflow dags list).
When you try to execute the DAG, it will fail right away, without any log output since it fails before even starting the execution, but if you check the worker logs you will see a message like this:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/dagbag.py", line 339, in parse
    loader.exec_module(new_module)
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/airflow/dags/repo/dags/mypackage/mydags/mydag.py", line 6, in <module>
    from mypackage.lib.mylib import lib_method
ModuleNotFoundError: No module named 'mypackage'

If you disable dags persistence then everything works as expected.

I have created a repository that reproduces the issue: https://github.com/JMLizano/explore-airflow-chart-issue

@potiuk
Copy link
Member Author

potiuk commented Jun 27, 2023

Closing in favour of #32181

@potiuk potiuk closed this Jun 27, 2023
@potiuk potiuk deleted the disallow-git-sync-and-persistence branch November 17, 2023 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:helm-chart Airflow Helm Chart pinned Protect from Stalebot auto closing

Projects

None yet

6 participants