Bigint group by hash optimizations#12095
Conversation
9939a98 to
ec1b447
Compare
ec1b447 to
7c9ce26
Compare
|
Could you attach some macro-benchmark results? |
There was a problem hiding this comment.
Why do we do it? We iterate by values instead of "groups". Then,we have one more sequential random access in the 314 line?
There was a problem hiding this comment.
I don't follow. "sequential random access" sequential or random?
There was a problem hiding this comment.
I mean random. I cannot understand why we change it.
There was a problem hiding this comment.
Previous approach was iterating over a group and then jumping around both groupIds and values arrays which was completely random.
Now we iterate over those two arrays in sequential order. Also newGroupIds whill be traversed quasi-sequentailly since it follow the same order as groupIds - it is ordered by hash.
Reach out to me offline if you need a more robust description.
There was a problem hiding this comment.
Does that change make possible that some query fails now whileas that query does not fail before (when spill to disk is enabled)?
I mean the situation when we have to process the last page and that page conatins 100k rows that are going to be assigned to one group.
There was a problem hiding this comment.
It is possible that the query that was absolutely super close to failing would fail here and not before.
But the case when:
1.There is a lot of groups
2.But the last page contains only one (and is non-null)
is quite weird.
Also pages with 100k records defeats the purpose of page. We sometimes produce 8k pages in orc reader but that is max.
The initial max fill is around 12k, so with small number of groups this won't trigger a resize
There was a problem hiding this comment.
It was just an example. Obviously we do not need a such specific case to reveal that issue. Actually, the last page can contain much less rows - like 2k or 300 or less. And it is not required for the last page to contain one group. The last page can contain a lot of group as well and query still could fail (whileas it succeed before).
There was a problem hiding this comment.
I think that the requirements for that case are less probable than you think.
First of all in order to run out of memory query needs to be big.
So for a big enough query we have a case that:
- There is a lot of groups (so that the hash table is big enough). Let's assume 8M buckets (tpch/q18)
- The last page contains one or more groups that has not been seen before. Given that there already has been billions of incoming rows processed this is highly unlikely.
- The few added groups hit exactly the current limit. Given that the hash table is 4M buckets, there is a sort of 1/4M chance of that per any new group.
- Increasing the hash table size takes additional 4M*12B = 48MB. So now this number needs to hit exactly the memory limit. If the query is big enough to spawn millions of groups, the node/cluster probably has hundreds of GB of memory.
So, as you can see, the probability of hitting such an issue is minimal.
If you have a more likely example, please share it here.
There was a problem hiding this comment.
I can imagine that probability of such case is minimal. Rather, the question for me is: is that acceptable for in comparison to the performance gain from that. Very possibly, yes - we can accept it. It is kind of tradeoff (for me).
sopel39
left a comment
There was a problem hiding this comment.
lgtm % comments up to Replace block builder with a simple long[]
core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This is more expensive than ensureCapacity as ensureCapacity only adds segments. Can it be slower sometimes in practice? Multi channel still has io.trino.operator.MultiChannelGroupByHash#groupAddressByGroupId
There was a problem hiding this comment.
In theory I am sure there is a case when this is slower. But for our data I don't think so.
Copying memory is insanely fast and this is amortized O(1). There is also only one allocation compared to one per 1024 values in BigArray.
Instead of iterating over groups and jumping all over the hash table, we iterate over the hash table itself minimizing random memory access. The change is done only in BigintGroupByHash. MultiChannelGroupByHash is already properly implemented Before BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS ADD avgt 30 26,855 ± 0,359 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS ADD avgt 30 79,430 ± 0,678 ns/op After BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS ADD avgt 30 26,559 ± 0,612 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS ADD avgt 30 73,690 ± 0,855 ns/op
The hash table is at most 10^30 positions so an ordinary array will be sufficient. This also slightly reduces memory footprint. Before BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS ADD avgt 10 3,965 ± 0,182 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS GET_GROUPS avgt 10 8,022 ± 0,364 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS ADD avgt 10 3,778 ± 0,086 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS GET_GROUPS avgt 10 7,681 ± 0,127 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS ADD avgt 10 4,628 ± 0,187 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS GET_GROUPS avgt 10 7,372 ± 0,117 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS ADD avgt 10 11,239 ± 0,791 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 15,104 ± 0,237 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS ADD avgt 10 10,455 ± 0,175 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS GET_GROUPS avgt 10 14,039 ± 0,333 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS ADD avgt 10 28,611 ± 1,289 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS GET_GROUPS avgt 10 34,819 ± 2,511 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS ADD avgt 10 71,855 ± 2,965 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS GET_GROUPS avgt 10 79,189 ± 1,162 ns/op After BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS ADD avgt 10 3,841 ± 0,143 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS GET_GROUPS avgt 10 4,940 ± 0,150 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS ADD avgt 10 3,628 ± 0,101 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS GET_GROUPS avgt 10 4,866 ± 0,070 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS ADD avgt 10 4,898 ± 0,369 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS GET_GROUPS avgt 10 6,218 ± 0,574 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS ADD avgt 10 10,671 ± 0,364 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 13,108 ± 0,266 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS ADD avgt 10 10,968 ± 0,535 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS GET_GROUPS avgt 10 12,169 ± 0,096 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS ADD avgt 10 27,862 ± 1,444 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS GET_GROUPS avgt 10 31,153 ± 1,053 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS ADD avgt 10 71,691 ± 2,855 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS GET_GROUPS avgt 10 76,061 ± 2,208 ns/op
For simplicity and tiny performance gain.
7c9ce26 to
3ae7e75
Compare
|
Comments addressed |
core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
It seems it mostly helps for:
BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS ADD avgt 10 10,671 ± 0,364 ns/op
BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 13,108 ± 0,266 ns/op
why not other cases?
Does this makes a diff in practice
There was a problem hiding this comment.
It's a slight change. I don't think we should expect huge gains here. It's rather a simplification than a performance trick.
3ae7e75 to
577b991
Compare
577b991 to
256abfc
Compare
Previously the hash table capacity was checked every row to see whether a rehash is needed. Now the input page is split into batches and it is assumed that every row in batch will create a new group (which is rarely the case) and rehashing is done in advance before processing. This may slightly increase memory footprint for small number of groups, however there is a tiny performance gain as the capacity is not checked every row. Before: BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS GET_GROUPS avgt 10 4,647 ± 0,082 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS GET_GROUPS avgt 10 4,737 ± 0,035 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS GET_GROUPS avgt 10 5,663 ± 0,286 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 12,073 ± 0,429 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS GET_GROUPS avgt 10 12,098 ± 0,117 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS GET_GROUPS avgt 10 28,623 ± 0,812 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS GET_GROUPS avgt 10 71,115 ± 0,812 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_2_GROUPS GET_GROUPS avgt 10 7,293 ± 0,044 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10_GROUPS GET_GROUPS avgt 10 7,136 ± 0,054 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_1K_GROUPS GET_GROUPS avgt 10 8,012 ± 0,212 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10K_GROUPS GET_GROUPS avgt 10 10,844 ± 0,059 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_100K_GROUPS GET_GROUPS avgt 10 10,973 ± 0,070 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_1M_GROUPS GET_GROUPS avgt 10 23,575 ± 0,243 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10M_GROUPS GET_GROUPS avgt 10 41,599 ± 0,754 ns/op After: BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_2_GROUPS GET_GROUPS avgt 10 3,982 ± 0,119 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10_GROUPS GET_GROUPS avgt 10 4,140 ± 0,036 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1K_GROUPS GET_GROUPS avgt 10 5,321 ± 0,018 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10K_GROUPS GET_GROUPS avgt 10 11,894 ± 0,088 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_100K_GROUPS GET_GROUPS avgt 10 11,794 ± 0,065 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_1M_GROUPS GET_GROUPS avgt 10 29,032 ± 0,911 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy 0 BIGINT_10M_GROUPS GET_GROUPS avgt 10 69,878 ± 0,659 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_2_GROUPS GET_GROUPS avgt 10 7,280 ± 0,078 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10_GROUPS GET_GROUPS avgt 10 7,199 ± 0,084 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_1K_GROUPS GET_GROUPS avgt 10 8,161 ± 0,092 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10K_GROUPS GET_GROUPS avgt 10 10,999 ± 0,079 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_100K_GROUPS GET_GROUPS avgt 10 10,727 ± 0,107 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_1M_GROUPS GET_GROUPS avgt 10 23,582 ± 0,503 ns/op BenchmarkGroupByHashOnSimulatedData.groupBy .5 BIGINT_10M_GROUPS GET_GROUPS avgt 10 40,936 ± 0,211 ns/op
256abfc to
73c0289
Compare
| BIGINT_100K_GROUPS(new ChannelDefinition(ColumnType.BIGINT, 100_000)), | ||
| BIGINT_1M_GROUPS(new ChannelDefinition(ColumnType.BIGINT, 1_000_000)), | ||
| BIGINT_10M_GROUPS(new ChannelDefinition(ColumnType.BIGINT, 10_000_000)), | ||
| BIGINT_2_GROUPS_1_SMALL_DICTIONARY(new ChannelDefinition(ColumnType.BIGINT, 2, 1, 50)), |
There was a problem hiding this comment.
nit: add textual description to these enums (can be separate PR). It's hard to understand what they benchmark after some time
| implements GroupByHash | ||
| { | ||
| private static final int INSTANCE_SIZE = ClassLayout.parseClass(BigintGroupByHash.class).instanceSize(); | ||
| private static final int BATCH_SIZE = 1024; |
There was a problem hiding this comment.
nit: this is bigger than expected groups for io.trino.sql.planner.LocalExecutionPlanner.Visitor#visitTableWriter and io.trino.sql.planner.LocalExecutionPlanner.Visitor#visitTableFinish (200)
There was a problem hiding this comment.
Expected number of groups is hardcoded as 10000 so I don't think that matters.
Description
A set of small optimizations for the BigIntGroupByHash class
improvement, refactoring
core query engine
Nothing changes to a non-technical user
Related issues, pull requests, and links
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: