Bigint group by hash optimizations by skrzypo987 · Pull Request #12095 · trinodb/trino

skrzypo987 · 2022-04-22T09:48:17Z

Description

A set of small optimizations for the BigIntGroupByHash class

Is this change a fix, improvement, new feature, refactoring, or other?

improvement, refactoring

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

Nothing changes to a non-technical user

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Improve performance of bigint aggregation

radek-kondziolka · 2022-04-25T09:26:30Z

Could you attach some macro-benchmark results?

radek-kondziolka · 2022-04-25T11:44:52Z

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java

Why do we do it? We iterate by values instead of "groups". Then,we have one more sequential random access in the 314 line?

I don't follow. "sequential random access" sequential or random?

I mean random. I cannot understand why we change it.

Previous approach was iterating over a group and then jumping around both groupIds and values arrays which was completely random.
Now we iterate over those two arrays in sequential order. Also newGroupIds whill be traversed quasi-sequentailly since it follow the same order as groupIds - it is ordered by hash.
Reach out to me offline if you need a more robust description.

Yeah, I get it.

radek-kondziolka · 2022-04-25T12:42:30Z

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java

Does that change make possible that some query fails now whileas that query does not fail before (when spill to disk is enabled)?
I mean the situation when we have to process the last page and that page conatins 100k rows that are going to be assigned to one group.

It is possible that the query that was absolutely super close to failing would fail here and not before.
But the case when:
1.There is a lot of groups
2.But the last page contains only one (and is non-null)
is quite weird.
Also pages with 100k records defeats the purpose of page. We sometimes produce 8k pages in orc reader but that is max.
The initial max fill is around 12k, so with small number of groups this won't trigger a resize

It was just an example. Obviously we do not need a such specific case to reveal that issue. Actually, the last page can contain much less rows - like 2k or 300 or less. And it is not required for the last page to contain one group. The last page can contain a lot of group as well and query still could fail (whileas it succeed before).

I think that the requirements for that case are less probable than you think.
First of all in order to run out of memory query needs to be big.
So for a big enough query we have a case that:

There is a lot of groups (so that the hash table is big enough). Let's assume 8M buckets (tpch/q18)

The last page contains one or more groups that has not been seen before. Given that there already has been billions of incoming rows processed this is highly unlikely.

The few added groups hit exactly the current limit. Given that the hash table is 4M buckets, there is a sort of 1/4M chance of that per any new group.

Increasing the hash table size takes additional 4M*12B = 48MB. So now this number needs to hit exactly the memory limit. If the query is big enough to spawn millions of groups, the node/cluster probably has hundreds of GB of memory.

So, as you can see, the probability of hitting such an issue is minimal.
If you have a more likely example, please share it here.

I can imagine that probability of such case is minimal. Rather, the question for me is: is that acceptable for in comparison to the performance gain from that. Very possibly, yes - we can accept it. It is kind of tradeoff (for me).

sopel39

lgtm % comments up to Replace block builder with a simple long[]

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java

sopel39 · 2022-04-25T21:54:25Z

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java

This is more expensive than ensureCapacity as ensureCapacity only adds segments. Can it be slower sometimes in practice? Multi channel still has io.trino.operator.MultiChannelGroupByHash#groupAddressByGroupId

In theory I am sure there is a case when this is slower. But for our data I don't think so.
Copying memory is insanely fast and this is amortized O(1). There is also only one allocation compared to one per 1024 values in BigArray.

Instead of iterating over groups and jumping all over the hash table, we iterate over the hash table itself minimizing random memory access. The change is done only in BigintGroupByHash. MultiChannelGroupByHash is already properly implemented Before BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS ADD avgt 30 26,855 ± 0,359 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS ADD avgt 30 79,430 ± 0,678 ns/op After BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS ADD avgt 30 26,559 ± 0,612 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS ADD avgt 30 73,690 ± 0,855 ns/op

The hash table is at most 10^30 positions so an ordinary array will be sufficient. This also slightly reduces memory footprint. Before BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS ADD avgt 10 3,965 ± 0,182 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS GET_GROUPS avgt 10 8,022 ± 0,364 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS ADD avgt 10 3,778 ± 0,086 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS GET_GROUPS avgt 10 7,681 ± 0,127 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS ADD avgt 10 4,628 ± 0,187 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS GET_GROUPS avgt 10 7,372 ± 0,117 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS ADD avgt 10 11,239 ± 0,791 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 15,104 ± 0,237 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS ADD avgt 10 10,455 ± 0,175 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS GET_GROUPS avgt 10 14,039 ± 0,333 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS ADD avgt 10 28,611 ± 1,289 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS GET_GROUPS avgt 10 34,819 ± 2,511 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS ADD avgt 10 71,855 ± 2,965 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS GET_GROUPS avgt 10 79,189 ± 1,162 ns/op After BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS ADD avgt 10 3,841 ± 0,143 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS GET_GROUPS avgt 10 4,940 ± 0,150 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS ADD avgt 10 3,628 ± 0,101 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS GET_GROUPS avgt 10 4,866 ± 0,070 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS ADD avgt 10 4,898 ± 0,369 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS GET_GROUPS avgt 10 6,218 ± 0,574 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS ADD avgt 10 10,671 ± 0,364 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 13,108 ± 0,266 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS ADD avgt 10 10,968 ± 0,535 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS GET_GROUPS avgt 10 12,169 ± 0,096 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS ADD avgt 10 27,862 ± 1,444 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS GET_GROUPS avgt 10 31,153 ± 1,053 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS ADD avgt 10 71,691 ± 2,855 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS GET_GROUPS avgt 10 76,061 ± 2,208 ns/op

For simplicity and tiny performance gain.

skrzypo987 · 2022-04-26T05:15:21Z

Comments addressed

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java

sopel39 · 2022-04-26T11:50:15Z

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java

It seems it mostly helps for:

BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS ADD avgt 10 10,671 ± 0,364 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 13,108 ± 0,266 ns/op

why not other cases?
Does this makes a diff in practice

It's a slight change. I don't think we should expect huge gains here. It's rather a simplification than a performance trick.

Previously the hash table capacity was checked every row to see whether a rehash is needed. Now the input page is split into batches and it is assumed that every row in batch will create a new group (which is rarely the case) and rehashing is done in advance before processing. This may slightly increase memory footprint for small number of groups, however there is a tiny performance gain as the capacity is not checked every row. Before: BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS GET_GROUPS avgt 10 4,647 ± 0,082 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS GET_GROUPS avgt 10 4,737 ± 0,035 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS GET_GROUPS avgt 10 5,663 ± 0,286 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 12,073 ± 0,429 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS GET_GROUPS avgt 10 12,098 ± 0,117 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS GET_GROUPS avgt 10 28,623 ± 0,812 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS GET_GROUPS avgt 10 71,115 ± 0,812 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_2_GROUPS GET_GROUPS avgt 10 7,293 ± 0,044 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10_GROUPS GET_GROUPS avgt 10 7,136 ± 0,054 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_1K_GROUPS GET_GROUPS avgt 10 8,012 ± 0,212 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10K_GROUPS GET_GROUPS avgt 10 10,844 ± 0,059 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_100K_GROUPS GET_GROUPS avgt 10 10,973 ± 0,070 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_1M_GROUPS GET_GROUPS avgt 10 23,575 ± 0,243 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10M_GROUPS GET_GROUPS avgt 10 41,599 ± 0,754 ns/op After: BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS GET_GROUPS avgt 10 3,982 ± 0,119 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS GET_GROUPS avgt 10 4,140 ± 0,036 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS GET_GROUPS avgt 10 5,321 ± 0,018 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 11,894 ± 0,088 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS GET_GROUPS avgt 10 11,794 ± 0,065 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS GET_GROUPS avgt 10 29,032 ± 0,911 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS GET_GROUPS avgt 10 69,878 ± 0,659 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_2_GROUPS GET_GROUPS avgt 10 7,280 ± 0,078 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10_GROUPS GET_GROUPS avgt 10 7,199 ± 0,084 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_1K_GROUPS GET_GROUPS avgt 10 8,161 ± 0,092 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10K_GROUPS GET_GROUPS avgt 10 10,999 ± 0,079 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_100K_GROUPS GET_GROUPS avgt 10 10,727 ± 0,107 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_1M_GROUPS GET_GROUPS avgt 10 23,582 ± 0,503 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10M_GROUPS GET_GROUPS avgt 10 40,936 ± 0,211 ns/op

sopel39 · 2022-04-28T09:49:22Z

core/trino-main/src/test/java/io/trino/operator/BenchmarkGroupByHashOnSimulatedData.java

+        BIGINT_100K_GROUPS(new ChannelDefinition(ColumnType.BIGINT, 100_000)),
+        BIGINT_1M_GROUPS(new ChannelDefinition(ColumnType.BIGINT, 1_000_000)),
+        BIGINT_10M_GROUPS(new ChannelDefinition(ColumnType.BIGINT, 10_000_000)),
+        BIGINT_2_GROUPS_1_SMALL_DICTIONARY(new ChannelDefinition(ColumnType.BIGINT, 2, 1, 50)),


nit: add textual description to these enums (can be separate PR). It's hard to understand what they benchmark after some time

sopel39 · 2022-04-28T09:59:01Z

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java

        implements GroupByHash
 {
    private static final int INSTANCE_SIZE = ClassLayout.parseClass(BigintGroupByHash.class).instanceSize();
+    private static final int BATCH_SIZE = 1024;


nit: this is bigger than expected groups for io.trino.sql.planner.LocalExecutionPlanner.Visitor#visitTableWriter and io.trino.sql.planner.LocalExecutionPlanner.Visitor#visitTableFinish (200)

Expected number of groups is hardcoded as 10000 so I don't think that matters.

Add new benchmark for group by hash

7f37db1

skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 9939a98 to ec1b447 Compare April 22, 2022 09:51

cla-bot bot added the cla-signed label Apr 22, 2022

skrzypo987 requested review from lukasz-stec, radek-kondziolka and sopel39 April 22, 2022 09:51

skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from ec1b447 to 7c9ce26 Compare April 23, 2022 12:10

radek-kondziolka reviewed Apr 25, 2022

View reviewed changes

sopel39 reviewed Apr 25, 2022

View reviewed changes

skrzypo987 added 4 commits April 26, 2022 06:57

Allocate proper array size

d60ea59

Replace block builder with a simple long[]

372bfa9

For simplicity and tiny performance gain.

skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 7c9ce26 to 3ae7e75 Compare April 26, 2022 05:14

skrzypo987 requested review from radek-kondziolka and sopel39 April 26, 2022 05:15

sopel39 reviewed Apr 26, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java Outdated Show resolved Hide resolved

sopel39 reviewed Apr 26, 2022

View reviewed changes

skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 3ae7e75 to 577b991 Compare April 27, 2022 05:20

skrzypo987 requested a review from sopel39 April 27, 2022 11:01

skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 577b991 to 256abfc Compare April 27, 2022 12:24

skrzypo987 force-pushed the skrzypo/068-bigint-group-by-hash-optimizations branch from 256abfc to 73c0289 Compare April 28, 2022 07:16

sopel39 approved these changes Apr 28, 2022

View reviewed changes

sopel39 merged commit 75dd718 into trinodb:master Apr 28, 2022

sopel39 mentioned this pull request Apr 28, 2022

Release notes for 379 #12088

Closed

github-actions bot added this to the 379 milestone Apr 28, 2022

mosabua mentioned this pull request Apr 28, 2022

Add Trino 379 release notes #12106

Merged

Conversation

skrzypo987 commented Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

radek-kondziolka commented Apr 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

radek-kondziolka Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skrzypo987 commented Apr 26, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

skrzypo987 commented Apr 22, 2022 •

edited

Loading

radek-kondziolka Apr 25, 2022 •

edited

Loading