Single column bigint join round 2#13432
Conversation
|
@arhimondr Can you check if this branch passes the verification that failed in #13380? |
sopel39
left a comment
There was a problem hiding this comment.
Does it still give good benchmarking results?
|
|
||
| private final int mask; | ||
| private final int[] key; | ||
| private final long[] values; |
There was a problem hiding this comment.
this array is redundant to the one in pagesHashStrategy. I think we can get rid of the array in pagesHashStrategy.
There was a problem hiding this comment.
The PagesHash is still used to produce the output page in appenTo method so we can get rid of it only if it does not exist in the output. Working on that. I hope that this can be merged as it is and then we can work on further reducing the memory footprint.
There was a problem hiding this comment.
I hope that this can be merged as it is and then we can work on further reducing the memory footprint.
I think we can merge it only if we use bigint path on small hash tables. Otherwise there will still be increased memory usage.
Benchmarks are still running, but common sense tells me that the results are going to be worse. However, after adding batched execution it should get better again. |
01dfbf2 to
8b91c80
Compare
|
I've added a cut-off. If the number of positions in a single |
|
Ok, so it's tabled after #13352 |
|
This PR is a prerequisite for #13352 |
|
We can also concat those two PRs and merge as one |
8b91c80 to
e90bb50
Compare
|
The cut-off now makes the join use the DefaultPagesHash, which means some perf gain is going to be visible after merging this PR |
sopel39
left a comment
There was a problem hiding this comment.
@arhimondr would you be able to test this version?
e90bb50 to
fae76c4
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
mostly lgtm.
there is one potential issue with generated join classes cache
|
|
||
| // This implementation assumes: | ||
| // -There is only one join channel and it is of type bigint | ||
| // -arrays used in the hash are always a power of 2. |
There was a problem hiding this comment.
Suppose we want to preserve as much memory as possible. In that case, we could use an approach similar to io.trino.operator.HashGenerator#getPartition to find the hash table slot and not rely on the power of two hash table size.
There was a problem hiding this comment.
Outside of scope here. The BigintPagesHash is a close of the default one. Bigger changes may land in subsequent PRs.
BTW this PR is definitely not about saving memory. Saving memory on hash tables will usually result in performance regressions.
There was a problem hiding this comment.
I made some benchmarks and there are two conclusions:
- calculating hash bucket like in
HashGenerator#getPartitionis slightly slower than a simple bit mask. Slightly, but it is visible in benchmarks - Having a too big hash table instead of the one perfectly size increases performance simply because the average load is usually smaller than the max load factor. So this is a simple tradeoff between performance and memory and there is no incentive to change it to match perfectly the load factor
I do like this way of thinking though
|
|
||
| private final int mask; | ||
| private final int[] key; | ||
| private final long[] values; |
There was a problem hiding this comment.
why key is in singular and values in plural form?
There was a problem hiding this comment.
Good question. I'll make a separate PR to fix this in both implementations once this lands
|
|
||
| // index pages | ||
| for (int position = 0; position < stepSize; position++) { | ||
| int realPosition = position + stepBeginPosition; |
There was a problem hiding this comment.
At least to me, the name position collocates with Block position. What do you think about renaming position to batchIndex and realPosition to addressIndex?
core/trino-main/src/main/java/io/trino/operator/join/JoinHashSupplier.java
Outdated
Show resolved
Hide resolved
arhimondr
left a comment
There was a problem hiding this comment.
Is this PR still being worked on? Please let me know when it's ready to run a set of test queries. However since there's a limit now I'm pretty confident the memory footprint shouldn't change significantly.
|
|
||
| // This implementation assumes arrays used in the hash are always a power of 2 | ||
| public final class PagesHash | ||
| public interface PagesHash |
There was a problem hiding this comment.
Interface call might prevent methods from being inlined. I wonder what is the thought process around that?
There was a problem hiding this comment.
Those classes are isolated per query and only one implementation is used so with any luck JIT will easily inline it.
| * This value is purposefully identical to that of IncrementalLoadFactorHashArraySizeSupplier#THRESHOLD_50, | ||
| * as higher load factor means more excessive memory consumption | ||
| */ | ||
| private static final int BIGINT_SINGLE_CHANNEL_MAX_ADDRESSES = 1 << 20; // 1024576 |
There was a problem hiding this comment.
So in theory the memory utilization shouldn't increase by more than 8MB, right?
I wonder how difficult would it be to avoid storing values twice (once in a block and once in an array)?
There was a problem hiding this comment.
Actually a bit less. We stored a single byte hash per hash bucket before this PR. Now we store 8 bytes per actual position. The load factor for < 1M positions is 0.5 meaning an average load of 0.375. So we swapped x bytes for 8*x*0.375=3*x bytes.
With number f positions > 1M, the load factor is 0.75, which translates to 4.5*x bytes.
The previous version that failed the verifier added a constant 8 bytes, regardless of the load factor
There was a problem hiding this comment.
I wonder how difficult would it be to avoid storing values twice (once in a block and once in an array)?
This is, unfortunately, really tricky, since the value blocks are used by other things like filters and sorting and the addressing convention is consistent across all channels, not only joined ones.
fae76c4 to
63f70f5
Compare
Sorry about the mess. The benchmark results started going crazy at some point and I am still trying to figure out the cause. |
After running gazillion of benchmarks I finally figured out why the benchmark are so bad. Isolating both implementation of |
|
@arhimondr Can you verify the second to last commit - ae20b9f. |
63f70f5 to
7d6bde3
Compare
| LookupSourceSupplier.class, | ||
| JoinHashSupplier.class, | ||
| JoinHash.class, | ||
| PagesHash.class, |
There was a problem hiding this comment.
Why this is needed? or maybe where it stats to slow down without it? This is JIT abc to optimize classes that use single implementation of interface. If this doesn't work at some level, this breaks our assumptions about Java perf
There was a problem hiding this comment.
I don't know. Just the fact that it works is enough for me
7d6bde3 to
a0b6aa3
Compare
|
benchmark-bigint-join-tpcds.pdf |
a0b6aa3 to
4d35836
Compare
|
I added PartitionedLookupSource to isolated classes and the regression went away. The gain is minimal compared to master but there is no regression. |
4d35836 to
b9856da
Compare
b9856da to
50263d8
Compare
|
Did some finishing (hopefully) touches:
The gains (for unpart orc) are:
Many small, cosmetic comments will be addressed in the follow-up PR |
core/trino-main/src/main/java/io/trino/operator/IncrementalLoadFactorHashArraySizeSupplier.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/join/BigintPagesHash.java
Outdated
Show resolved
Hide resolved
50263d8 to
63e5e61
Compare
In the majority of cases the join is on a single bigint column. This commit introduces a specific code path that will handle only single column bigint joins. This way we can skip the logic behind value comparisons. The new class holds bigint values in a long[] array. This provides far superior value comparison performance. It comes, however, with a higher memory consumption. That is why the limit of probe side positions is introduced, beyond which the old implementation is chosen.
63e5e61 to
1653c92
Compare
|
@arhimondr do we need to re-run your tests on this ? It looks good to land to me otherwise. |
|
@raunaqmorarka The test run came out clean, no out of memory failures. |
sopel39
left a comment
There was a problem hiding this comment.
nit: kill switch would still be nice
Description
This is the next approach to #13178, which has been reverted due to #13380.
This time the memory consumption is not so significantly increased. The additionally allocated long array is indexed by page positions, instead of hash buckets making it significantly smaller. Benchmarks show 3x less of an increase in peak memory usage of TPC benchmarks compared to #13178.
The performance gain should be smaller because of CPUs out-of-order execution is less possible with this approach. However, after merging of #13352 it should not matter.
For reviewers: The only change between this PR and #13178 is the addressing of
valuesarray inBigintPagesHashclass.improvement
core query engine
Increase the performance of join on single bigint column
Related issues, pull requests, and links
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text: