-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-3607] collected trigger rule dep check per dag run #4751
[AIRFLOW-3607] collected trigger rule dep check per dag run #4751
Conversation
Please reference your jira ticket number in the PR title. It is required for the jira git integration to work. |
24f5221
to
519b43a
Compare
Codecov Report
@@ Coverage Diff @@
## master #4751 +/- ##
==========================================
+ Coverage 84.81% 85.08% +0.27%
==========================================
Files 679 723 +44
Lines 38493 39558 +1065
==========================================
+ Hits 32646 33658 +1012
- Misses 5847 5900 +53
Continue to review full report at Codecov.
|
e9c50ca
to
bb5c825
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to think if the existing tests are detailed enough and cover both "branches" (with and without finished_tasks being passed in)
Anyone have thoughts on this?
# see if the task name is in the task upstream for our task | ||
upstream_tasks = [finished_task for finished_task in dep_context.finished_tasks | ||
if finished_task.task_id in ti.task.upstream_task_ids] | ||
if upstream_tasks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering before this change we had only SQL query, but now it's both python + sql.
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because we need to share sql results for more than one purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left the sql query for backwards compatibility but I suppose it will be best that in most places the python will be used.
5244994
to
e03c63a
Compare
9f63c64
to
ed922f2
Compare
@amichai07 you have failure in the test:
|
da89779
to
3788262
Compare
7562be3
to
b9a88f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only a few comments.
op5 = DummyOperator(task_id='E', trigger_rule=TriggerRule.ONE_FAILED) | ||
|
||
op1.set_downstream([op2, op3]) # op1 >> op2, op3 | ||
op4.set_upstream([op3, op2]) # op3 >> op4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
op4.set_upstream([op3, op2]) # op3 >> op4 | |
op4.set_upstream([op3, op2]) # (op3, op2) >> op4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is also fine to use the >>
operator. You just need to disable the related pylint rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I will fix it
…che#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run
…che#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run
This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run
…che#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5)
…che#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run
[AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep (apache#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run [AIRFLOW-3607] fix scheduler bug related to concurrency and depends on past (apache#7402) commit 50efda5 introduced a bug that prevents scheduler from scheduling tasks with the following properties: * has depends on past set to True * has custom concurrency limit [AIRFLOW-3607] Optimize dep checking when depends on past set and concurrency limit (apache#7503)
…che#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5)
This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5)
This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5)
This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5)
This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5)
This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5)
…che#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5) (cherry picked from commit cb750c1)
…che#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5) (cherry picked from commit cb750c1)
This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run (cherry picked from commit 50efda5)
Jira
Description
figures out that the scheduling process spends plenty of time in dependency checking, we took the
trigger rule dependency which calls the db for each task instance, we made it call the db just once for
each dag_run.
Tests
Commits
Documentation
no need for new docs
Code Quality
flake8
Results
The tests was made on a heavily multitasks dag (35 tasks).
The tasks don't do any db queries
On local environment
before changes:
after collecting dep check queries:
Stress test - running the dag for every 10 sec for an hour:
before changes:
after:
On production environment
before changes:
after:
Stress test - running with 150 other dags:
before changes:
after:
Edit
we recently did a stress test to check this change again with version 1.10.4
we did the test on a staging like production environment with one dag with 49 tasks that starts once a minute :
This is a graph of the time delay between two tasks in the dag by time:
This graph shows the db read iops difference which can explain the change (the times are different because this is in utc)
Conclusion
The query that we changed indeed had a dramatic impact on the performance of the scheduler. Reusing db results decreased the delay notably and gave the system chance to recover from stress.