Skip to content

Skip pre hash computation for join when input is table scan#20948

Merged
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:disable_hashgen
Dec 8, 2023
Merged

Skip pre hash computation for join when input is table scan#20948
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:disable_hashgen

Conversation

@feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Sep 23, 2023

Description

Skip hash generation for a join, when the input is table scan, and the hash is on a single big int and is not reused later.

Motivation and Context

We observed in production query that, hash precomputation actually hurts performance (both cpu and latency) for the case described above. Hence add an option to disable hash precomputation for it.

Impact

CPU and latency improvement for the targeted queries.

Test Plan

Existing unit tests and verifier test

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add a session property to skip hash precomputation for join when the input is table scan, and the hash is on a single big int and is not reused later. It's controlled by session property `skip_hash_generation_for_join_with_table_scan_input` and default to not enabled.

@feilong-liu feilong-liu requested a review from a team as a code owner September 23, 2023 00:29
@feilong-liu feilong-liu marked this pull request as draft September 23, 2023 00:30
@feilong-liu feilong-liu force-pushed the disable_hashgen branch 5 times, most recently from b183b79 to bfae0e7 Compare October 27, 2023 18:39
@feilong-liu feilong-liu changed the title Skip pre hash computation for join if key is single bigint and key hash not reused Skip pre hash computation for join when input is table scan Oct 27, 2023
@feilong-liu feilong-liu marked this pull request as ready for review October 27, 2023 18:53
Copy link
Contributor

@vivek-bharathan vivek-bharathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests showing plan changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we would ever add this hash computation when the parent does not require it. I.e. wouldn't this check simply always be
"return hashComputation.isPresent() && !parentPreference.getHashes().contains(hashComputation.get());"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization is based on our observation, where TableScan below join is significantly than ScanProject (here project is for hash generation) for big int join key. Do not observe the same for other cases.

@feilong-liu feilong-liu force-pushed the disable_hashgen branch 2 times, most recently from 30f9a74 to bb8c19c Compare October 30, 2023 22:04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about other operators like filter/project on top of table scan or values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In verifier suite, didn't observe same performance improvement for these cases, hence limit to the specific case where I see most significant performance improvement here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was similar to my question above.
If you look at this comment in the code, it seems to suggest that aggregations in general perform better for BIGINT's without the generated hash. I suspect the same principle applies to joins. I wonder if we are special casing this too much in adding the TableScanNode check.
Would it be possible to share the benchmarks you are seeing this behavior on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at this comment in the code, it seems to suggest that aggregations in general perform better for BIGINT's without the generated hash.

I see, this is because we have custom group by hash BigintGroupByHash for group by on single big int column, which does not utilize existing pre-computed hash. Not sure if join will see the same pattern, but currently we do not have specialized join hash implementation for bigint like group by. Hence what applied to group by may not be available to join here.

Would it be possible to share the benchmarks you are seeing this behavior on?

The benchmarks is based on production queries and cannot be shared.
But the queries which improve most is the queries which have table scan as input source of join. The plan changed from join <- ScanProject to join <- TableScan, and the biggest savings are from changing TableScan to ScanProject, especially when the input table is huge. And this is why I want to specialize for this case in this optimization here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. We can always relax this constraint in the future if needed

@feilong-liu
Copy link
Contributor Author

Please add tests showing plan changes

Added unit plan test

@feilong-liu feilong-liu requested review from vivek-bharathan and removed request for vivek-bharathan November 9, 2023 19:04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. We can always relax this constraint in the future if needed

@feilong-liu feilong-liu requested a review from mlyublena December 7, 2023 04:58
@feilong-liu feilong-liu merged commit d9fc66d into prestodb:master Dec 8, 2023
@feilong-liu feilong-liu deleted the disable_hashgen branch December 8, 2023 22:47
@wanglinsong wanglinsong mentioned this pull request Feb 12, 2024
64 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants