Skip to content

Bigint group by hash optimizations#12095

Merged
sopel39 merged 6 commits intotrinodb:masterfrom
skrzypo987:skrzypo/068-bigint-group-by-hash-optimizations
Apr 28, 2022
Merged

Bigint group by hash optimizations#12095
sopel39 merged 6 commits intotrinodb:masterfrom
skrzypo987:skrzypo/068-bigint-group-by-hash-optimizations

Conversation

@skrzypo987
Copy link
Member

@skrzypo987 skrzypo987 commented Apr 22, 2022

Description

A set of small optimizations for the BigIntGroupByHash class

Is this change a fix, improvement, new feature, refactoring, or other?

improvement, refactoring

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

Nothing changes to a non-technical user

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Improve performance of bigint aggregation

@skrzypo987 skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 9939a98 to ec1b447 Compare April 22, 2022 09:51
@cla-bot cla-bot bot added the cla-signed label Apr 22, 2022
@skrzypo987 skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from ec1b447 to 7c9ce26 Compare April 23, 2022 12:10
@radek-kondziolka
Copy link
Contributor

Could you attach some macro-benchmark results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do it? We iterate by values instead of "groups". Then,we have one more sequential random access in the 314 line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. "sequential random access" sequential or random?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean random. I cannot understand why we change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous approach was iterating over a group and then jumping around both groupIds and values arrays which was completely random.
Now we iterate over those two arrays in sequential order. Also newGroupIds whill be traversed quasi-sequentailly since it follow the same order as groupIds - it is ordered by hash.
Reach out to me offline if you need a more robust description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I get it.

Copy link
Contributor

@radek-kondziolka radek-kondziolka Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that change make possible that some query fails now whileas that query does not fail before (when spill to disk is enabled)?
I mean the situation when we have to process the last page and that page conatins 100k rows that are going to be assigned to one group.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that the query that was absolutely super close to failing would fail here and not before.
But the case when:
1.There is a lot of groups
2.But the last page contains only one (and is non-null)
is quite weird.
Also pages with 100k records defeats the purpose of page. We sometimes produce 8k pages in orc reader but that is max.
The initial max fill is around 12k, so with small number of groups this won't trigger a resize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just an example. Obviously we do not need a such specific case to reveal that issue. Actually, the last page can contain much less rows - like 2k or 300 or less. And it is not required for the last page to contain one group. The last page can contain a lot of group as well and query still could fail (whileas it succeed before).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the requirements for that case are less probable than you think.
First of all in order to run out of memory query needs to be big.
So for a big enough query we have a case that:

  • There is a lot of groups (so that the hash table is big enough). Let's assume 8M buckets (tpch/q18)
  • The last page contains one or more groups that has not been seen before. Given that there already has been billions of incoming rows processed this is highly unlikely.
  • The few added groups hit exactly the current limit. Given that the hash table is 4M buckets, there is a sort of 1/4M chance of that per any new group.
  • Increasing the hash table size takes additional 4M*12B = 48MB. So now this number needs to hit exactly the memory limit. If the query is big enough to spawn millions of groups, the node/cluster probably has hundreds of GB of memory.

So, as you can see, the probability of hitting such an issue is minimal.
If you have a more likely example, please share it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine that probability of such case is minimal. Rather, the question for me is: is that acceptable for in comparison to the performance gain from that. Very possibly, yes - we can accept it. It is kind of tradeoff (for me).

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments up to Replace block builder with a simple long[]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more expensive than ensureCapacity as ensureCapacity only adds segments. Can it be slower sometimes in practice? Multi channel still has io.trino.operator.MultiChannelGroupByHash#groupAddressByGroupId

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory I am sure there is a case when this is slower. But for our data I don't think so.
Copying memory is insanely fast and this is amortized O(1). There is also only one allocation compared to one per 1024 values in BigArray.

skrzypo987 added 4 commits April 26, 2022 06:57
Instead of iterating over groups and jumping all over the hash table,
we iterate over the hash table itself minimizing random memory access.
The change is done only in BigintGroupByHash. MultiChannelGroupByHash
is already properly implemented

Before
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_1M_GROUPS         ADD  avgt   30  26,855 ± 0,359  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0  BIGINT_10M_GROUPS         ADD  avgt   30  79,430 ± 0,678  ns/op
After
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_1M_GROUPS         ADD  avgt   30  26,559 ± 0,612  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0  BIGINT_10M_GROUPS         ADD  avgt   30  73,690 ± 0,855  ns/op
The hash table is at most 10^30 positions so an ordinary array will be sufficient.
This also slightly reduces memory footprint.

Before
BenchmarkGroupByHashOnSimulatedData.groupBy             0     BIGINT_2_GROUPS         ADD  avgt   10   3,965 ± 0,182  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0     BIGINT_2_GROUPS  GET_GROUPS  avgt   10   8,022 ± 0,364  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_10_GROUPS         ADD  avgt   10   3,778 ± 0,086  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_10_GROUPS  GET_GROUPS  avgt   10   7,681 ± 0,127  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1K_GROUPS         ADD  avgt   10   4,628 ± 0,187  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1K_GROUPS  GET_GROUPS  avgt   10   7,372 ± 0,117  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10K_GROUPS         ADD  avgt   10  11,239 ± 0,791  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10K_GROUPS  GET_GROUPS  avgt   10  15,104 ± 0,237  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0  BIGINT_100K_GROUPS         ADD  avgt   10  10,455 ± 0,175  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0  BIGINT_100K_GROUPS  GET_GROUPS  avgt   10  14,039 ± 0,333  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1M_GROUPS         ADD  avgt   10  28,611 ± 1,289  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1M_GROUPS  GET_GROUPS  avgt   10  34,819 ± 2,511  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10M_GROUPS         ADD  avgt   10  71,855 ± 2,965  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10M_GROUPS  GET_GROUPS  avgt   10  79,189 ± 1,162  ns/op
After
BenchmarkGroupByHashOnSimulatedData.groupBy             0     BIGINT_2_GROUPS         ADD  avgt   10   3,841 ± 0,143  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0     BIGINT_2_GROUPS  GET_GROUPS  avgt   10   4,940 ± 0,150  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_10_GROUPS         ADD  avgt   10   3,628 ± 0,101  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_10_GROUPS  GET_GROUPS  avgt   10   4,866 ± 0,070  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1K_GROUPS         ADD  avgt   10   4,898 ± 0,369  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1K_GROUPS  GET_GROUPS  avgt   10   6,218 ± 0,574  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10K_GROUPS         ADD  avgt   10  10,671 ± 0,364  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10K_GROUPS  GET_GROUPS  avgt   10  13,108 ± 0,266  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0  BIGINT_100K_GROUPS         ADD  avgt   10  10,968 ± 0,535  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0  BIGINT_100K_GROUPS  GET_GROUPS  avgt   10  12,169 ± 0,096  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1M_GROUPS         ADD  avgt   10  27,862 ± 1,444  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1M_GROUPS  GET_GROUPS  avgt   10  31,153 ± 1,053  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10M_GROUPS         ADD  avgt   10  71,691 ± 2,855  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10M_GROUPS  GET_GROUPS  avgt   10  76,061 ± 2,208  ns/op
For simplicity and tiny performance gain.
@skrzypo987 skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 7c9ce26 to 3ae7e75 Compare April 26, 2022 05:14
@skrzypo987
Copy link
Member Author

Comments addressed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it mostly helps for:

BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10K_GROUPS         ADD  avgt   10  10,671 ± 0,364  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10K_GROUPS  GET_GROUPS  avgt   10  13,108 ± 0,266  ns/op

why not other cases?
Does this makes a diff in practice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a slight change. I don't think we should expect huge gains here. It's rather a simplification than a performance trick.

@skrzypo987 skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 3ae7e75 to 577b991 Compare April 27, 2022 05:20
@skrzypo987 skrzypo987 requested a review from sopel39 April 27, 2022 11:01
@skrzypo987 skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 577b991 to 256abfc Compare April 27, 2022 12:24
Previously the hash table capacity was checked every row to see whether a rehash
is needed. Now the input page is split into batches and it is assumed that every
row in batch will create a new group (which is rarely the case) and rehashing
is done in advance before processing.
This may slightly increase memory footprint for small number of groups, however
there is a tiny performance gain as the capacity is not checked every row.

Before:
BenchmarkGroupByHashOnSimulatedData.groupBy             0     BIGINT_2_GROUPS  GET_GROUPS  avgt   10   4,647 ± 0,082  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_10_GROUPS  GET_GROUPS  avgt   10   4,737 ± 0,035  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1K_GROUPS  GET_GROUPS  avgt   10   5,663 ± 0,286  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10K_GROUPS  GET_GROUPS  avgt   10  12,073 ± 0,429  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0  BIGINT_100K_GROUPS  GET_GROUPS  avgt   10  12,098 ± 0,117  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1M_GROUPS  GET_GROUPS  avgt   10  28,623 ± 0,812  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10M_GROUPS  GET_GROUPS  avgt   10  71,115 ± 0,812  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5     BIGINT_2_GROUPS  GET_GROUPS  avgt   10   7,293 ± 0,044  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5    BIGINT_10_GROUPS  GET_GROUPS  avgt   10   7,136 ± 0,054  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5    BIGINT_1K_GROUPS  GET_GROUPS  avgt   10   8,012 ± 0,212  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5   BIGINT_10K_GROUPS  GET_GROUPS  avgt   10  10,844 ± 0,059  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5  BIGINT_100K_GROUPS  GET_GROUPS  avgt   10  10,973 ± 0,070  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5    BIGINT_1M_GROUPS  GET_GROUPS  avgt   10  23,575 ± 0,243  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5   BIGINT_10M_GROUPS  GET_GROUPS  avgt   10  41,599 ± 0,754  ns/op
After:
BenchmarkGroupByHashOnSimulatedData.groupBy             0     BIGINT_2_GROUPS  GET_GROUPS  avgt   10   3,982 ± 0,119  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_10_GROUPS  GET_GROUPS  avgt   10   4,140 ± 0,036  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1K_GROUPS  GET_GROUPS  avgt   10   5,321 ± 0,018  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10K_GROUPS  GET_GROUPS  avgt   10  11,894 ± 0,088  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0  BIGINT_100K_GROUPS  GET_GROUPS  avgt   10  11,794 ± 0,065  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0    BIGINT_1M_GROUPS  GET_GROUPS  avgt   10  29,032 ± 0,911  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy             0   BIGINT_10M_GROUPS  GET_GROUPS  avgt   10  69,878 ± 0,659  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5     BIGINT_2_GROUPS  GET_GROUPS  avgt   10   7,280 ± 0,078  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5    BIGINT_10_GROUPS  GET_GROUPS  avgt   10   7,199 ± 0,084  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5    BIGINT_1K_GROUPS  GET_GROUPS  avgt   10   8,161 ± 0,092  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5   BIGINT_10K_GROUPS  GET_GROUPS  avgt   10  10,999 ± 0,079  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5  BIGINT_100K_GROUPS  GET_GROUPS  avgt   10  10,727 ± 0,107  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5    BIGINT_1M_GROUPS  GET_GROUPS  avgt   10  23,582 ± 0,503  ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy            .5   BIGINT_10M_GROUPS  GET_GROUPS  avgt   10  40,936 ± 0,211  ns/op
@skrzypo987 skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 256abfc to 73c0289 Compare April 28, 2022 07:16
BIGINT_100K_GROUPS(new ChannelDefinition(ColumnType.BIGINT, 100_000)),
BIGINT_1M_GROUPS(new ChannelDefinition(ColumnType.BIGINT, 1_000_000)),
BIGINT_10M_GROUPS(new ChannelDefinition(ColumnType.BIGINT, 10_000_000)),
BIGINT_2_GROUPS_1_SMALL_DICTIONARY(new ChannelDefinition(ColumnType.BIGINT, 2, 1, 50)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add textual description to these enums (can be separate PR). It's hard to understand what they benchmark after some time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implements GroupByHash
{
private static final int INSTANCE_SIZE = ClassLayout.parseClass(BigintGroupByHash.class).instanceSize();
private static final int BATCH_SIZE = 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is bigger than expected groups for io.trino.sql.planner.LocalExecutionPlanner.Visitor#visitTableWriter and io.trino.sql.planner.LocalExecutionPlanner.Visitor#visitTableFinish (200)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected number of groups is hardcoded as 10000 so I don't think that matters.

@sopel39 sopel39 merged commit 75dd718 into trinodb:master Apr 28, 2022
@sopel39 sopel39 mentioned this pull request Apr 28, 2022
@github-actions github-actions bot added this to the 379 milestone Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants