Skip pre hash computation for join when input is table scan#20948
Skip pre hash computation for join when input is table scan#20948feilong-liu merged 1 commit intoprestodb:masterfrom
Conversation
b183b79 to
bfae0e7
Compare
vivek-bharathan
left a comment
There was a problem hiding this comment.
Please add tests showing plan changes
There was a problem hiding this comment.
I'm curious why we would ever add this hash computation when the parent does not require it. I.e. wouldn't this check simply always be
"return hashComputation.isPresent() && !parentPreference.getHashes().contains(hashComputation.get());"
There was a problem hiding this comment.
This optimization is based on our observation, where TableScan below join is significantly than ScanProject (here project is for hash generation) for big int join key. Do not observe the same for other cases.
30f9a74 to
bb8c19c
Compare
There was a problem hiding this comment.
what about other operators like filter/project on top of table scan or values
There was a problem hiding this comment.
In verifier suite, didn't observe same performance improvement for these cases, hence limit to the specific case where I see most significant performance improvement here.
There was a problem hiding this comment.
This was similar to my question above.
If you look at this comment in the code, it seems to suggest that aggregations in general perform better for BIGINT's without the generated hash. I suspect the same principle applies to joins. I wonder if we are special casing this too much in adding the TableScanNode check.
Would it be possible to share the benchmarks you are seeing this behavior on?
There was a problem hiding this comment.
If you look at this comment in the code, it seems to suggest that aggregations in general perform better for BIGINT's without the generated hash.
I see, this is because we have custom group by hash BigintGroupByHash for group by on single big int column, which does not utilize existing pre-computed hash. Not sure if join will see the same pattern, but currently we do not have specialized join hash implementation for bigint like group by. Hence what applied to group by may not be available to join here.
Would it be possible to share the benchmarks you are seeing this behavior on?
The benchmarks is based on production queries and cannot be shared.
But the queries which improve most is the queries which have table scan as input source of join. The plan changed from join <- ScanProject to join <- TableScan, and the biggest savings are from changing TableScan to ScanProject, especially when the input table is huge. And this is why I want to specialize for this case in this optimization here.
There was a problem hiding this comment.
Fair enough. We can always relax this constraint in the future if needed
Added unit plan test |
There was a problem hiding this comment.
Fair enough. We can always relax this constraint in the future if needed
bb8c19c to
acbc6a7
Compare
acbc6a7 to
25c9c11
Compare
Description
Skip hash generation for a join, when the input is table scan, and the hash is on a single big int and is not reused later.
Motivation and Context
We observed in production query that, hash precomputation actually hurts performance (both cpu and latency) for the case described above. Hence add an option to disable hash precomputation for it.
Impact
CPU and latency improvement for the targeted queries.
Test Plan
Existing unit tests and verifier test
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.