-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep #11010
Conversation
@amichai07 @kaxil @ashb anyone interested to take a look? I'm porting #4751 to 1.10.* to fix a bad |
8644aa9
to
ea36166
Compare
@turbaszek I am not a release manager. Unfortunately, I cannot help you. I am focusing on the development of Airflow 2.0 If I can help you with anything else, please let me know. |
pls merge |
da1be37
to
8e1647c
Compare
Thanks @turbaszek . I cherry-picked this because the scheduler in 1.10.* is having trouble for large DAGs (not that large, just hundreds of tasks in one DAG). It queries the db too many times and was struggling to finish. (See flamegraph.before and flamegraph_after in the PR description.) Which caused us to hit #10790 too. So this cherry-pick is more of a fix rather than an improvement in some sense. |
bd4214b
to
195f00d
Compare
8e1647c
to
d52ab8e
Compare
As mentioned in #11119 (comment) we are no more accepting new features to 1-10 branch :< I'm sorry. The 2.0 alpha should be released soon 🎉 |
Hi, @turbaszek and @dimberman I agree that #11119 can be closed since it's a brand new feature. However this one #11010 is not a new feature. It's more of a fix to 1.10.. When DAGs are just a little bit larger than usual (500 tasks) and when user tries to run dozens of DagRuns in at the same time (around 30), airflow-scheduler becomes super slow. This PR is the fix for that backported for 1.10.. |
@turbaszek I think this change should be in the next rellease. See: #11780 WDYT? |
@yuqian90 Can you rebase it once more, please ? Appreciate it |
…che#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5)
d52ab8e
to
435c225
Compare
Done. Thank you! |
This is cherry-picked from #4751 for
v1-10-test
.Some investigation into the issue reported in #10790 led to the discovery that this loop in
scheduler_job.py
takes almost 90% of the time inSchedulerJob.process_file()
for large DAGs (around 500 tasks). This causes theDagFileProcessor
spawned by the scheduler to go slowly. The reason this loop is slow is that it creates a newDepContext
for everyti
. And everyDepContext
needs to populate its ownfinished_tasks
even though this list is the same for everyDagRun
.This is the flamegraph generated by
py-spy
showing the performance ofDagFileProcessor
in Airflow 1.10.12 before this PR:https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_before.svg
This is the performance after 1.10.12 is patched with this PR:
https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_after.svg
The nice thing is that #4751 already addressed this issue for master branch. We just need to cherry-pick it to fix this in 1.10.* with some very minor conflict fixes.
While this PR will not fix every scenario that causes #10790, it does reduce the
DagFileProcessor
time from around 100s to just about 12s for our use case (a DAG with about 500 tasks, many of them are sensors inreschedule
mode withpoke_interval
60s.).Original commit message in #4751:
This decreases scheduler delay between tasks by about 20% for larger DAGs,
sometimes more for larger or more complex DAGs.
The delay between tasks can be a major issue, especially when we have dags with
many subdags, figures out that the scheduling process spends plenty of time in
dependency checking, we took the trigger rule dependency which calls the db for
each task instance, we made it call the db just once for each dag_run
(cherry picked from commit 50efda5)