Vectorize Hashing in FlatGroupByHash#19302
Conversation
|
test-other-modules failure is unrelated. I just fixed that in master. Sorry :) |
09ac658 to
68f7df6
Compare
No worries, thanks for the heads up- just rebased to pick up the fix. |
|
Started benchmark workflow for this PR.
|
3bbddf6 to
9fd7533
Compare
|
Started benchmark workflow for this PR.
Benchmark Summary:Top 5 duration differences
|
9fd7533 to
8df5f3a
Compare
martint
left a comment
There was a problem hiding this comment.
Squash the first three commits, as they are all essentially incremental parts of the same change.
core/trino-main/src/main/java/io/trino/operator/FlatHashStrategyCompiler.java
Outdated
Show resolved
Hide resolved
Implements column-wise hash calculations in FlatHashCompiler and changes FlatGroupByHash to use it to implement a batched approach to first computing a range of position hashes and then attempting to insert those positions using the hashes that were precomputed. By calculating position hashes in a columnar tarversal path, we can avoid repeated expensive bounds checking and allow the JIT to unroll loops into much more efficient forms. Additionally, when attempting to insert the positions into the FlatGroupByHash after precomputing the hash code we can start loading the relevant portition of the hash table into memory sooner since we aren't intermixing computing the hash with the memory accesses.
8df5f3a to
1a75e64
Compare
|
Please note that we added a release notes entry for the performance improvement @pettyjamesm .. for future reference and as an example. |
Description
Implements column-wise hash calculations in FlatHashCompiler and changes FlatGroupByHash to use it to implement a batched approach to first computing a range of position hashes and then attempting to insert those positions using the hashes that were precomputed.
By calculating position hashes in a columnar traversal, we can avoid repeated expensive bounds checking per access, allow the JIT to emit more efficient unrolled loops, and access the memory in a way that friendlier and more predictable for CPU caches.
Additionally, when attempting to insert the positions into the FlatGroupByHash after precomputing the hash code we can start loading the relevant portition of the hash table into memory sooner since we aren't intermixing computing the hash with the memory accesses.
Since re-hashing is still performed row-at-a-time, the performance improvement is much more significant when the number of groups is small and when the number of columns in the hash calculation are larger.
BenchmarkGroupByHash.addPagesresults generally improve between 5-50% depending on the scenario:Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: