Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use selectinload in trigger #40487

Merged
merged 3 commits into from
Sep 18, 2024
Merged

Conversation

josephangbc
Copy link
Contributor

@josephangbc josephangbc commented Jun 28, 2024

closes: #33647

As mentioned by @arunravimv in #33647, we have added this patch to our own Airflow deployment and have noticed improvements in triggerer performance.

Following are the Explain Analyze outputs for the two SQL Alchemy relationship loading strategies

Triggerer Process

Using joinedload in bulk_fetch method

-> Nested loop left join  (cost=101 rows=95) (actual time=0.22..0.359 rows=3 loops=1)
    -> Nested loop left join  (cost=67.8 rows=95) (actual time=0.21..0.348 rows=3 loops=1)
        -> Nested loop left join  (cost=34.6 rows=95) (actual time=0.204..0.338 rows=3 loops=1)
            -> Filter: (`trigger`.id in (969,968,984))  (cost=1.36 rows=3) (actual time=0.049..0.0565 rows=3 loops=1)
                -> Index range scan on trigger using PRIMARY over (id = 968) OR (id = 969) OR (id = 984)  (cost=1.36 rows=3) (actual time=0.048..0.0545 rows=3 loops=1)
            -> Nested loop inner join  (cost=35.9 rows=31.7) (actual time=0.0915..0.0932 rows=1 loops=3)
                -> Index lookup on task_instance_1 using ti_trigger_id (trigger_id=`trigger`.id)  (cost=8.97 rows=31.7) (actual time=0.0716..0.0731 rows=1 loops=3)
                -> Single-row index lookup on dag_run_1 using dag_run_dag_id_run_id_key (dag_id=task_instance_1.dag_id, run_id=task_instance_1.run_id)  (cost=0.251 rows=1) (actual time=0.0194..0.0195 rows=1 loops=3)
        -> Single-row index lookup on trigger_1 using PRIMARY (id=task_instance_1.trigger_id)  (cost=0.251 rows=1) (actual time=0.00301..0.00305 rows=1 loops=3)
    -> Single-row index lookup on job_1 using PRIMARY (id=trigger_1.triggerer_id)  (cost=0.251 rows=1) (actual time=0.00316..0.00321 rows=1 loops=3)

Using selectinload in bulk_fetch method

-> Nested loop inner join  (cost=5.26 rows=3) (actual time=0.0895..0.362 rows=3 loops=1)
    -> Nested loop left join  (cost=4.21 rows=3) (actual time=0.065..0.313 rows=3 loops=1)
        -> Nested loop left join  (cost=3.16 rows=3) (actual time=0.0521..0.298 rows=3 loops=1)
            -> Index range scan on task_instance using ti_trigger_id over (trigger_id = 968) OR (trigger_id = 969) OR (trigger_id = 984), with index condition: (task_instance.trigger_id in (969,968,984))  (cost=2.11 rows=3) (actual time=0.0399..0.273 rows=3 loops=1)
            -> Single-row index lookup on trigger_1 using PRIMARY (id=task_instance.trigger_id)  (cost=0.283 rows=1) (actual time=0.00755..0.00759 rows=1 loops=3)
        -> Single-row index lookup on job_1 using PRIMARY (id=trigger_1.triggerer_id)  (cost=0.283 rows=1) (actual time=0.00454..0.00458 rows=1 loops=3)
    -> Single-row index lookup on dag_run_1 using dag_run_dag_id_run_id_key (dag_id=task_instance.dag_id, run_id=task_instance.run_id)  (cost=0.283 rows=1) (actual time=0.016..0.016 rows=1 loops=3)

triggerview/list API

Using joined for relationship between TaskInstance and Trigger

-> Sort: `trigger`.id DESC  (actual time=0.0215..0.0215 rows=0 loops=1)
    -> Stream results  (cost=8234 rows=79751) (actual time=0.0169..0.0169 rows=0 loops=1)
        -> Nested loop left join  (cost=8234 rows=79751) (actual time=0.016..0.016 rows=0 loops=1)
            -> Table scan on trigger  (cost=0.35 rows=1) (actual time=0.0152..0.0152 rows=0 loops=1)
            -> Nested loop inner join  (cost=33331 rows=79751) (never executed)
                -> Table scan on dag_run_1  (cost=5419 rows=51596) (never executed)
                -> Filter: (task_instance_1.trigger_id = `trigger`.id)  (cost=0.386 rows=1.55) (never executed)
                    -> Index lookup on task_instance_1 using ti_dag_run (dag_id=dag_run_1.dag_id, run_id=dag_run_1.run_id)  (cost=0.386 rows=1.55) (never executed)

Using selectin for relationship between TaskInstance and Trigger

-> Nested loop inner join  (cost=8.41 rows=8) (actual time=0.0409..0.0409 rows=0 loops=1)
    -> Index range scan on task_instance using ti_trigger_id over (trigger_id = 109) OR (trigger_id = 110) OR (6 more), with index condition: (task_instance.trigger_id in (116,115,114,113,112,111,110,109))  (cost=5.61 rows=8) (actual time=0.0402..0.0402 rows=0 loops=1)
    -> Single-row index lookup on dag_run_1 using dag_run_dag_id_run_id_key (dag_id=task_instance.dag_id, run_id=task_instance.run_id)  (cost=0.263 rows=1) (never executed)

From above Explain Analyze results, we can see that using selectinload is gives more optimal performance for triggerer process as well as the triggerview list api.

@josephangbc josephangbc force-pushed the use-selectinload-trigger branch 2 times, most recently from 3105adb to 5a85601 Compare June 28, 2024 22:54
@josephangbc josephangbc marked this pull request as ready for review June 28, 2024 23:02
@vincbeck
Copy link
Contributor

vincbeck commented Jul 8, 2024

The spellcheck is not happy with "selectin". Could you add it to docs/spelling_wordlist.txt please?

@vincbeck
Copy link
Contributor

vincbeck commented Jul 8, 2024

Looks fantastic otherwise!

@uranusjr
Copy link
Member

Is there a metric where we can write a test against to ensure this query is optimised?

@arunravimv
Copy link

arunravimv commented Jul 24, 2024

@uranusjr we are not sure how to unit test against database plan cost. This is how we tested and observed our cost in a fully functional environment connected to MySQL database. If you could guide on how we should model our unit test it would be helpful. Thank you in advance.

@thirtyseven
Copy link

Our Airflow cluster is hit pretty badly by this issue, are there any blockers to merging this? Any way we can contribute?

@josephangbc
Copy link
Contributor Author

@thirtyseven We have been using this fix in our Airflow deployments for a while now, have not seen any issues. Provided the database plan comparison above, but we need some guidance on how to implement unit test for performance fixes like this. Would like to get this PR reviewed and merged too.

Copy link
Contributor

@dstandish dstandish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with this. We don't really have a performance testing framework in OSS. But we have some amount of QA downstream. And I looked at the queries and I'm not surprised that the "before" one causes trouble.

@vincbeck vincbeck merged commit 46b41e3 into apache:main Sep 18, 2024
51 checks passed
Copy link

boring-cyborg bot commented Sep 18, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@thirtyseven
Copy link

Thanks for merging, any chance this can be cherry picked into the next patch release?

@vincbeck
Copy link
Contributor

vincbeck commented Sep 18, 2024

@ephraimbuddy What do you think? Can we consider it as a bug fix? It is an performance improvement fix

@ephraimbuddy
Copy link
Contributor

@ephraimbuddy What do you think? Can we consider it as a bug fix? It is an performance improve fix

Ok. So, I rushed in and created a new label for things we need to backport to Airflow 2. However, we shouldn't backport this one to Airflow 2. It's an improvement change.

@vincbeck
Copy link
Contributor

From some comments, should we could consider it as a bug fix? #40487 (comment)

@thirtyseven
Copy link

The issue makes deferred tasks unusable when the task_instance table grows too large, I'd personally consider it a bug fix.

@ephraimbuddy
Copy link
Contributor

From some comments, should we could consider it as a bug fix? #40487 (comment)

I'm sceptical about getting it into Airflow 2, even with the comments, but if we feel sure about it, let's backport it. cc @jedcunningham

@jedcunningham
Copy link
Member

I'd say bugfix for this one. It fixes a real problem vs just optimizing something that isn't problematic.

@robg-eb
Copy link

robg-eb commented Sep 19, 2024

+1 on requesting this as a bugfix - It has caused our Trigger-based processes to become unusable in production on a few occasions already, and while we do have a workaround (regularly running ANALYZE TABLE task_instance to avoid the problem), it seems it could bite many others as well with large enough numbers of deferrred tasks

@vincbeck
Copy link
Contributor

@ephraimbuddy from the feedbacks from others I added back the label "needs backport to 2". How's that work? We no longer do the backport now but the release manager will do it before the next 2.10 release?

@ephraimbuddy
Copy link
Contributor

Hi @vincbeck , we do the backport. Feel free to backport it. I want to keep a tab of things we should have backported with the label. That's the only purpose of the label, and we will use it in auto-backporting bot, too

@vincbeck
Copy link
Contributor

Sounds good! I'll do it :)

vincbeck pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Sep 19, 2024
@vincbeck
Copy link
Contributor

#42351

vincbeck added a commit that referenced this pull request Sep 19, 2024
(cherry picked from commit 46b41e3)

Co-authored-by: Joseph Ang <[email protected]>
joaopamaral pushed a commit to joaopamaral/airflow that referenced this pull request Oct 21, 2024
@utkarsharma2 utkarsharma2 added this to the Airflow 2.10.3 milestone Oct 23, 2024
@utkarsharma2 utkarsharma2 added the type:bug-fix Changelog: Bug Fixes label Oct 23, 2024
utkarsharma2 pushed a commit that referenced this pull request Oct 23, 2024
(cherry picked from commit 46b41e3)

Co-authored-by: Joseph Ang <[email protected]>
utkarsharma2 pushed a commit that referenced this pull request Oct 24, 2024
(cherry picked from commit 46b41e3)

Co-authored-by: Joseph Ang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Triggerer type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Airflow Triggerer facing frequent restarts
10 participants