Skip to content

Single column bigint join round 2#13432

Merged
raunaqmorarka merged 3 commits intotrinodb:masterfrom
skrzypo987:skrzypo/095-bigint-join-round-2
Aug 19, 2022
Merged

Single column bigint join round 2#13432
raunaqmorarka merged 3 commits intotrinodb:masterfrom
skrzypo987:skrzypo/095-bigint-join-round-2

Conversation

@skrzypo987
Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 commented Aug 1, 2022

Description

This is the next approach to #13178, which has been reverted due to #13380.
This time the memory consumption is not so significantly increased. The additionally allocated long array is indexed by page positions, instead of hash buckets making it significantly smaller. Benchmarks show 3x less of an increase in peak memory usage of TPC benchmarks compared to #13178.
The performance gain should be smaller because of CPUs out-of-order execution is less possible with this approach. However, after merging of #13352 it should not matter.

For reviewers: The only change between this PR and #13178 is the addressing of values array in BigintPagesHash class.

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

Increase the performance of join on single bigint column

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Improve performance of joins over a single BIGINT column

@cla-bot cla-bot bot added the cla-signed label Aug 1, 2022
@skrzypo987 skrzypo987 requested a review from sopel39 August 1, 2022 04:55
@skrzypo987
Copy link
Copy Markdown
Member Author

@arhimondr Can you check if this branch passes the verification that failed in #13380?

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it still give good benchmarking results?


private final int mask;
private final int[] key;
private final long[] values;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this array is redundant to the one in pagesHashStrategy. I think we can get rid of the array in pagesHashStrategy.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PagesHash is still used to produce the output page in appenTo method so we can get rid of it only if it does not exist in the output. Working on that. I hope that this can be merged as it is and then we can work on further reducing the memory footprint.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that this can be merged as it is and then we can work on further reducing the memory footprint.

I think we can merge it only if we use bigint path on small hash tables. Otherwise there will still be increased memory usage.

@skrzypo987
Copy link
Copy Markdown
Member Author

Does it still give good benchmarking results?

Benchmarks are still running, but common sense tells me that the results are going to be worse. However, after adding batched execution it should get better again.

@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from 01dfbf2 to 8b91c80 Compare August 3, 2022 09:52
@skrzypo987
Copy link
Copy Markdown
Member Author

I've added a cut-off. If the number of positions in a single PagesHash exceeds 2^19, values are not stored in a separate array.
The good news is that the memory consumption remains low, probably eve slightly lower than previously.
The bad news is that for every position there is an additional if which destroys any perf gain observed before.
The good news is that this if is no longer a problem when we introduce batch processing (#13352). However, this PR itself does not present any gain.
@sopel39

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Aug 3, 2022

Ok, so it's tabled after #13352

@skrzypo987
Copy link
Copy Markdown
Member Author

This PR is a prerequisite for #13352
So we have a chicken and an egg problem

@skrzypo987
Copy link
Copy Markdown
Member Author

We can also concat those two PRs and merge as one

@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from 8b91c80 to e90bb50 Compare August 3, 2022 13:50
@skrzypo987
Copy link
Copy Markdown
Member Author

The cut-off now makes the join use the DefaultPagesHash, which means some perf gain is going to be visible after merging this PR

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr would you be able to test this version?

@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from e90bb50 to fae76c4 Compare August 8, 2022 06:27
Copy link
Copy Markdown
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm.
there is one potential issue with generated join classes cache


// This implementation assumes:
// -There is only one join channel and it is of type bigint
// -arrays used in the hash are always a power of 2.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we want to preserve as much memory as possible. In that case, we could use an approach similar to io.trino.operator.HashGenerator#getPartition to find the hash table slot and not rely on the power of two hash table size.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of scope here. The BigintPagesHash is a close of the default one. Bigger changes may land in subsequent PRs.
BTW this PR is definitely not about saving memory. Saving memory on hash tables will usually result in performance regressions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some benchmarks and there are two conclusions:

  • calculating hash bucket like in HashGenerator#getPartition is slightly slower than a simple bit mask. Slightly, but it is visible in benchmarks
  • Having a too big hash table instead of the one perfectly size increases performance simply because the average load is usually smaller than the max load factor. So this is a simple tradeoff between performance and memory and there is no incentive to change it to match perfectly the load factor
    I do like this way of thinking though


private final int mask;
private final int[] key;
private final long[] values;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why key is in singular and values in plural form?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I'll make a separate PR to fix this in both implementations once this lands


// index pages
for (int position = 0; position < stepSize; position++) {
int realPosition = position + stepBeginPosition;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least to me, the name position collocates with Block position. What do you think about renaming position to batchIndex and realPosition to addressIndex?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, out of scope

Copy link
Copy Markdown
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR still being worked on? Please let me know when it's ready to run a set of test queries. However since there's a limit now I'm pretty confident the memory footprint shouldn't change significantly.


// This implementation assumes arrays used in the hash are always a power of 2
public final class PagesHash
public interface PagesHash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface call might prevent methods from being inlined. I wonder what is the thought process around that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those classes are isolated per query and only one implementation is used so with any luck JIT will easily inline it.

* This value is purposefully identical to that of IncrementalLoadFactorHashArraySizeSupplier#THRESHOLD_50,
* as higher load factor means more excessive memory consumption
*/
private static final int BIGINT_SINGLE_CHANNEL_MAX_ADDRESSES = 1 << 20; // 1024576
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in theory the memory utilization shouldn't increase by more than 8MB, right?

I wonder how difficult would it be to avoid storing values twice (once in a block and once in an array)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually a bit less. We stored a single byte hash per hash bucket before this PR. Now we store 8 bytes per actual position. The load factor for < 1M positions is 0.5 meaning an average load of 0.375. So we swapped x bytes for 8*x*0.375=3*x bytes.
With number f positions > 1M, the load factor is 0.75, which translates to 4.5*x bytes.
The previous version that failed the verifier added a constant 8 bytes, regardless of the load factor

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how difficult would it be to avoid storing values twice (once in a block and once in an array)?

This is, unfortunately, really tricky, since the value blocks are used by other things like filters and sorting and the addressing convention is consistent across all channels, not only joined ones.

@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from fae76c4 to 63f70f5 Compare August 9, 2022 04:37
@skrzypo987
Copy link
Copy Markdown
Member Author

s this PR still being worked on?

Sorry about the mess. The benchmark results started going crazy at some point and I am still trying to figure out the cause.
I will ping you once it is ready

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have newest benchmarks with lowered limit?

@skrzypo987
Copy link
Copy Markdown
Member Author

Do you have newest benchmarks with lowered limit?

After running gazillion of benchmarks I finally figured out why the benchmark are so bad. Isolating both implementation of PagesHash prevents inlining and preformance regresses significantly. Unfortunately we do not know how many positions will be used at the time of isolation, so this totally ruins the cutoff feature as it is now.

@skrzypo987
Copy link
Copy Markdown
Member Author

@arhimondr Can you verify the second to last commit - ae20b9f.
This is the one without the cut-off but should still use less memory than the one that got reverted.

@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from 63f70f5 to 7d6bde3 Compare August 10, 2022 06:15
LookupSourceSupplier.class,
JoinHashSupplier.class,
JoinHash.class,
PagesHash.class,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is needed? or maybe where it stats to slow down without it? This is JIT abc to optimize classes that use single implementation of interface. If this doesn't work at some level, this breaks our assumptions about Java perf

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Just the fact that it works is enough for me

@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from 7d6bde3 to a0b6aa3 Compare August 10, 2022 14:48
@skrzypo987
Copy link
Copy Markdown
Member Author

benchmark-bigint-join-tpcds.pdf
The tpcds benchmark finished. The version without the limit is slightly faster (~1%), but the limit 1M commit regresses about 5% cpu time.
I am going to cherry-pick the queries that suck and try to find a common pattern

@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from a0b6aa3 to 4d35836 Compare August 11, 2022 16:28
@skrzypo987
Copy link
Copy Markdown
Member Author

I added PartitionedLookupSource to isolated classes and the regression went away. The gain is minimal compared to master but there is no regression.
benchmark-bigint-join-tpcds.pdf

@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from 4d35836 to b9856da Compare August 11, 2022 17:26
@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from b9856da to 50263d8 Compare August 17, 2022 05:00
@skrzypo987
Copy link
Copy Markdown
Member Author

Did some finishing (hopefully) touches:

  • The threshold is fixed, no config option anymore
  • Last two commits are squashed. The limit is introduced along with the actual change

The gains (for unpart orc) are:

  • tpch 6% cpu time gain
  • tpcds < 1% cpu time gain

Many small, cosmetic comments will be addressed in the follow-up PR

Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from 50263d8 to 63e5e61 Compare August 17, 2022 05:27
In the majority of cases the join is on a single bigint column. This commit
introduces a specific code path that will handle only single column bigint
joins. This way we can skip the logic behind value comparisons.

The new class holds bigint values in a long[] array. This provides far superior
value comparison performance. It comes, however, with a higher memory consumption.
That is why the limit of probe side positions is introduced, beyond which the old
implementation is chosen.
@skrzypo987 skrzypo987 force-pushed the skrzypo/095-bigint-join-round-2 branch from 63e5e61 to 1653c92 Compare August 17, 2022 08:42
@raunaqmorarka
Copy link
Copy Markdown
Member

@arhimondr do we need to re-run your tests on this ? It looks good to land to me otherwise.

@arhimondr
Copy link
Copy Markdown
Contributor

@raunaqmorarka The test run came out clean, no out of memory failures.

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: kill switch would still be nice

@raunaqmorarka raunaqmorarka merged commit 1b6724a into trinodb:master Aug 19, 2022
@github-actions github-actions bot added this to the 394 milestone Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants