Row wise group by on fixed width types by lukasz-stec · Pull Request #10706 · trinodb/trino

lukasz-stec · 2022-01-20T10:22:50Z

This adds another GroupByHash implementation that works only on multiple fixed-width types and can store hash table state in a row-wise manner.

Hash table implementation

I tested two hash table implementations. One that stores almost everything in one big array (SingleTableHashTableData) and the second one that stores only group ids in a hash table but group values in the separate array (SeparateTableHashTableData). The single table is better for CPU because it only has one random read per row during hash table population (putIfAbsent). A separate table is better for memory utilization (it does not waste memory for empty hash table slots) but does at least two random memory reads (one to get groupId from hash table and the second one to get the values for a given group to compare to current row). For this reason, SingleTableHashTableData is used. SeparateTableHashTableData could be potentially used to store variable width data (VARCHARs).

Memory layout per hash table entry looks like this:

    0  - 4 : group id
    4  - 12 : hash
    12  - 12 + channelCount : isNull
    12 + channelCount - 12 + channelCount + valuesLength : values

Code generation

To avoid virtual method calls ( + to have more independent code), the current implementation uses one-time source code generation + multiple classes classloader isolation instead of runtime byte code generation.
This is mainly to improve readability and maintainability. It's far easier to manually improve the generated code and then potentially change the generator. Also, it's easier to understand the code and analyze its performance in the tools like a profiler.
The main issue is that class isolation is complicated now. I think it's possible to improve it.

Tpch/tpcds benchmarks

there is a slight (~2%) but stable improvement in terms of CPU (i.e. for queries that improve, the improvement is much bigger than variability)

Partitioned orc sf1000

label	TPCH wall time	TPC-DS wall time	TPCH CPU time	TPC-DS CPU time	TPCH Network GB	TPC-DS Network GB	TPCH wall time % change	TPC-DS wall time % change	TPCH CPU time % change	TPC-DS CPU time % change	TPCH Network GB % change	TPC-DS Network GB % change
baseline	847	1162	113307	134906	2660	2004	0	0	0	0	0	0
row-wise	828	1141	111698	131389	2660	2006	-2.28	-1.78	-1.42	-2.61	0	0.07

fixed-width-varhandle-single-table_with_mem_stats.pdf

Unpartitioned orc sf1000

label	TPCH wall time	TPC-DS wall time	TPCH CPU time	TPC-DS CPU time	TPCH Network GB	TPC-DS Network GB	TPCH peak mem	TPC-DS peak mem	TPCH wall time % change	TPC-DS wall time % change	TPCH CPU time % change	TPC-DS CPU time % change	TPCH Network GB % change	TPC-DS Network GB % change	TPCH peak mem % change	TPC-DS peak mem % change
baseline orc unpart sf1000	765	1817	100501	242056	2365	3899	2076369935	1122027991	0	0	0	0	0	0	0	0
row-wise orc unpart sf1000	759	1797	99439	237994	2365	3898	2072221145	1112983221	-0.8	-1.14	-1.06	-1.68	0	-0.02	-0.2	-0.81

fixed-width-varhandle-single-table_unpart.pdf

Jmh benchmark results.

Generally, a win across the board but significant improvement is for cases with multiple columns and many groups (i.e. hash table does not fit in L3).
Cases with 1 column here are for illustration only as 1 BigInt column is handled differently anyway (BigIntGroupByHash).

Benchmark	channelCount	dataType	groupCount	hashEnabled	positionCount	rehash	baseline	row-wise	row-wise %
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	8	TRUE	400000	TRUE	12.154	11.809	-2.838571664
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	8	TRUE	1000000	TRUE	17.087	10.374	-39.28717739
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	8	TRUE	10000000	TRUE	16.066	11.468	-28.61944479
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	100000	TRUE	400000	TRUE	54.214	40.622	-25.07101487
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	100000	TRUE	1000000	TRUE	42.644	31.33	-26.53128224
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	100000	TRUE	10000000	TRUE	33.43	23.24	-30.48160335
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	400000	TRUE	400000	TRUE	95.884	92.774	-3.243502566
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	400000	TRUE	1000000	TRUE	66.395	59.265	-10.73876045
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	400000	TRUE	10000000	TRUE	80.537	53.539	-33.52248035
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	1000000	TRUE	1000000	TRUE	122.168	109.249	-10.57478227
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	1000000	TRUE	10000000	TRUE	133.13	82.606	-37.95087508
BenchmarkGroupByHash.groupByHashPreCompute	1	BIGINT	3000000	TRUE	10000000	TRUE	166.151	119.592	-28.02210038
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	8	TRUE	400000	TRUE	19.811	16.535	-16.53626773
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	8	TRUE	1000000	TRUE	23.269	16.499	-29.09450342
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	8	TRUE	10000000	TRUE	23.243	17.051	-26.64027879
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	100000	TRUE	400000	TRUE	72.642	45.51	-37.35029322
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	100000	TRUE	1000000	TRUE	63.362	40.039	-36.8091285
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	100000	TRUE	10000000	TRUE	43.94	29.961	-31.81383705
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	400000	TRUE	400000	TRUE	127.03	111.982	-11.84602063
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	400000	TRUE	1000000	TRUE	97.573	86.728	-11.11475511
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	400000	TRUE	10000000	TRUE	125.269	87.279	-30.32673686
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	1000000	TRUE	1000000	TRUE	192.258	150.697	-21.61730591
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	1000000	TRUE	10000000	TRUE	282.49	122.211	-56.73793763
BenchmarkGroupByHash.groupByHashPreCompute	2	BIGINT	3000000	TRUE	10000000	TRUE	232.14	170.581	-26.51804945
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	8	TRUE	400000	TRUE	35.061	22.304	-36.3851573
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	8	TRUE	1000000	TRUE	36.199	23.023	-36.39879555
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	8	TRUE	10000000	TRUE	35.287	23.046	-34.68982912
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	100000	TRUE	400000	TRUE	104.561	70.343	-32.72539475
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	100000	TRUE	1000000	TRUE	98.415	68.298	-30.60204237
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	100000	TRUE	10000000	TRUE	92.981	63.708	-31.48277605
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	400000	TRUE	400000	TRUE	187.343	175.644	-6.244695558
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	400000	TRUE	1000000	TRUE	193.253	149.939	-22.41310614
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	400000	TRUE	10000000	TRUE	235.105	130.851	-44.34359116
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	1000000	TRUE	1000000	TRUE	237.468	228.146	-3.925581552
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	1000000	TRUE	10000000	TRUE	301.636	169.296	-43.87407339
BenchmarkGroupByHash.groupByHashPreCompute	4	BIGINT	3000000	TRUE	10000000	TRUE	375.312	215.302	-42.63386196
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	8	TRUE	400000	TRUE	33.944	22.393	-34.02957813
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	8	TRUE	1000000	TRUE	33.58	22.721	-32.33770101
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	8	TRUE	10000000	TRUE	33.499	22.828	-31.85468223
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	100000	TRUE	400000	TRUE	112.459	71.396	-36.51375168
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	100000	TRUE	1000000	TRUE	102.979	57.611	-44.05558415
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	100000	TRUE	10000000	TRUE	77.158	48.338	-37.35192721
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	400000	TRUE	400000	TRUE	173.659	142.193	-18.11941794
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	400000	TRUE	1000000	TRUE	150.297	114.326	-23.93327877
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	400000	TRUE	10000000	TRUE	164.059	105.476	-35.70849511
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	1000000	TRUE	1000000	TRUE	216.189	185.921	-14.00071234
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	1000000	TRUE	10000000	TRUE	262.304	150.348	-42.68177382
BenchmarkGroupByHash.groupByHashPreCompute	4	TINYINT_SMALLINT_INTEGER_BIGINT	3000000	TRUE	10000000	TRUE	354.643	198.462	-44.03893493
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	8	TRUE	400000	TRUE	59.339	36.757	-38.05591601
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	8	TRUE	1000000	TRUE	59.936	37.438	-37.53670582
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	8	TRUE	10000000	TRUE	59.285	41.667	-29.71746648
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	100000	TRUE	400000	TRUE	201.448	135.095	-32.93802867
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	100000	TRUE	1000000	TRUE	198.653	120.085	-39.55037175
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	100000	TRUE	10000000	TRUE	190.603	108.591	-43.02765434
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	400000	TRUE	400000	TRUE	339.351	297.942	-12.20240989
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	400000	TRUE	1000000	TRUE	405.758	228.623	-43.6553315
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	400000	TRUE	10000000	TRUE	490.873	173.874	-64.57861809
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	1000000	TRUE	1000000	TRUE	413.285	323.824	-21.64632155
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	1000000	TRUE	10000000	TRUE	586.831	211.627	-63.93731756
BenchmarkGroupByHash.groupByHashPreCompute	8	BIGINT	3000000	TRUE	10000000	TRUE	637.8	271.012	-57.50830981
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	8	TRUE	400000	TRUE	65.396	39.463	-39.6553306
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	8	TRUE	1000000	TRUE	65.142	40.113	-38.42221608
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	8	TRUE	10000000	TRUE	65.218	39.722	-39.09350179
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	100000	TRUE	400000	TRUE	185.124	98.542	-46.76973272
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	100000	TRUE	1000000	TRUE	175.062	91.293	-47.85104706
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	100000	TRUE	10000000	TRUE	149.707	70.62	-52.82785708
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	400000	TRUE	400000	TRUE	271.349	216.294	-20.28936904
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	400000	TRUE	1000000	TRUE	308.117	188.871	-38.70153221
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	400000	TRUE	10000000	TRUE	375.874	148.939	-60.37528533
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	1000000	TRUE	1000000	TRUE	380.86	267.796	-29.68649898
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	1000000	TRUE	10000000	TRUE	526.767	189.927	-63.94478014
BenchmarkGroupByHash.groupByHashPreCompute	8	TINYINT_SMALLINT_INTEGER_BIGINT	3000000	TRUE	10000000	TRUE	593.935	242.625	-59.14957024
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	8	TRUE	400000	TRUE	90.145	45.559	-49.46031394
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	8	TRUE	1000000	TRUE	91.879	44.206	-51.88672058
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	8	TRUE	10000000	TRUE	89.358	44.618	-50.06826473
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	100000	TRUE	400000	TRUE	279.863	151.421	-45.89459843
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	100000	TRUE	1000000	TRUE	328.599	157.006	-52.21957462
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	100000	TRUE	10000000	TRUE	321.395	129.668	-59.65463059
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	400000	TRUE	400000	TRUE	447.179	325.034	-27.31456531
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	400000	TRUE	1000000	TRUE	590.287	269.389	-54.36304713
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	400000	TRUE	10000000	TRUE	719.169	194.288	-72.98437502
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	1000000	TRUE	1000000	TRUE	538.421	349.492	-35.08945602
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	1000000	TRUE	10000000	TRUE	821.152	233.901	-71.51550505
BenchmarkGroupByHash.groupByHashPreCompute	10	BIGINT	3000000	TRUE	10000000	TRUE	834.155	305.253	-63.40572196

Experiments

The best gains from this change are for queries that have a group by with a lot of columns and number of groups in 10s K.
It would be even better for a larger number of groups but the current partial aggregation memory limitation (16MB) makes this not as good as it could be.
Below is a simple group by 8 BIGINT columns with 25K groups. It shows ~ 30% CPU drop and ~25% wall clock duration drop.

trino:tpch_sf1000_orc_part> set session use_enhanced_group_by=false;
SET SESSION
trino:tpch_sf1000_orc_part> EXPLAIN ANALYZE VERBOSE
                         -> select orderkey % 100000, orderkey % 100000 + 1,orderkey % 100000 + 2, orderkey % 100000 + 3, orderkey % 100000 + 4, orderkey % 100000 + 5, orderkey % 100000 + 6, orderkey % 100000 + 7, count(*)
                         -> from hive.tpch_sf1000_orc.lineitem
                         -> group by orderkey % 100000, orderkey % 100000 + 1,orderkey % 100000 + 2, orderkey % 100000 + 3, orderkey % 100000 + 4, orderkey % 100000 + 5, orderkey % 100000 + 6, orderkey % 100000 + 7;
 ...
 Query 20220201_161708_00117_du6m2, FINISHED, 7 nodes
http://localhost:8081/ui/query.html?20220201_161708_00117_du6m2
Splits: 3,260 total, 3,260 done (100.00%)
CPU Time: 2871.9s total, 2.09M rows/s, 1.77MB/s, 48% active
Per Node: 24.4 parallelism,   51M rows/s, 43.1MB/s
Parallelism: 170.8
Peak Memory: 79.2MB
16.81 [6B rows, 4.96GB] [357M rows/s, 302MB/s]

trino:tpch_sf1000_orc_part> set session use_enhanced_group_by=true;
SET SESSION
trino:tpch_sf1000_orc_part> EXPLAIN ANALYZE VERBOSE
                         -> select orderkey % 100000, orderkey % 100000 + 1,orderkey % 100000 + 2, orderkey % 100000 + 3, orderkey % 100000 + 4, orderkey % 100000 + 5, orderkey % 100000 + 6, orderkey % 100000 + 7, count(*)
                         -> from hive.tpch_sf1000_orc.lineitem
                         -> group by orderkey % 100000, orderkey % 100000 + 1,orderkey % 100000 + 2, orderkey % 100000 + 3, orderkey % 100000 + 4, orderkey % 100000 + 5, orderkey % 100000 + 6, orderkey % 100000 + 7;
...  
Query 20220201_161740_00119_du6m2, FINISHED, 7 nodes
http://localhost:8081/ui/query.html?20220201_161740_00119_du6m2
Splits: 3,260 total, 3,260 done (100.00%)
CPU Time: 2065.2s total 2.91M rows/s, 2.46MB/s, 47% active
Per Node: 23.7 parallelism, 68.7M rows/s, 58.1MB/s
Parallelism: 165.6
Peak Memory: 264MB
12.47 [6B rows, 4.96GB] [481M rows/s, 407MB/s]

skrzypo987

We'll talk offline

skrzypo987 · 2022-01-21T08:20:36Z

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java

You can extract the first commit to a separate PR and merge it.

ok, I'm gonna prepare the PR

skrzypo987 · 2022-01-21T08:45:09Z

core/trino-main/src/main/java/io/trino/operator/GroupByHash.java

Get rid of the static import for create

skrzypo987 · 2022-01-21T09:24:26Z

core/trino-main/src/main/java/io/trino/operator/hash/ColumnValueExtractor.java

I'd rather throw exception here

this is used to check if we can use the new GroupByHash implementation, so control flow in a normal case, i think Optional is better for this than exception.

skrzypo987 · 2022-01-21T09:39:28Z

core/trino-main/src/main/java/io/trino/operator/hash/HashTableDataGroupByHash.java

Just use long[] and the construct a block directly

I wanted to be consistent with MultiChannelGroupByHash but I agree it makes sense to use array directly.
Will try to do it.

skrzypo987 · 2022-01-21T09:40:58Z

core/trino-main/src/main/java/io/trino/operator/hash/IsolatedHashTableFactory.java

Just hardcode it

It's harder to change if it's hardcoded. E.g. if the generated classes are not compiling for some reason (like API change) it's easy to delete and regenerate them if they are not referenced directly.
It's also easy to make, hard to spot, mistake in hard-coded code here.

skrzypo987 · 2022-01-21T09:41:48Z

core/trino-main/src/main/java/io/trino/operator/hash/IsolatedHashTableFactory.java

I don't think 20 is needed. I've never seen a group by using more than a few columns.

Unless we want to generate it with bytecode generator.

tpcds has queries with group by 8 - 10 columns e.g. q24. I think 20 is a reasonable number. It covers most of the cases and it's not too big (e.g. it's not much runtime overhead).

skrzypo987 · 2022-01-21T09:49:34Z

core/trino-main/src/main/java/io/trino/operator/hash/fixedwidth/FixedWidthEntryStructure.java

Have you considered aligning values to cache lines? That may (or may not) speed up things significantly

I haven't tried it. It's a good idea to test it.
For entries smaller than cache line it would make sure we are accessing only one cache line per entry instead of possible two. We may be accessing multiple entries per lookup in case of a collision so this is not a clear win.
Obviously, this would increase memory usage, but it may be a good trade-off.

skrzypo987 · 2022-01-21T09:51:35Z

...-main/src/main/java/io/trino/operator/hash/fixedwidth/FixedWidthGroupByHashTableEntries.java

This can be done faster but I guess it's not a hot code path

Well, it's kind of in the hot path because it's used during rehash (i.e. when the current table is too small and we have to extend it).
How can this be sped up?

Anywhere between Arrays.fill and System.arraycopy of already filled parts.

But I don't think it's worth the effort

lukasz-stec

I addressed some comment +

extracted GroupByHashFactory commit
added dict and RLE support
added VarHandleFastByteBuffer

lukasz-stec · 2022-01-31T12:36:26Z

...-main/src/main/java/io/trino/operator/hash/fixedwidth/FixedWidthGroupByHashTableEntries.java

Well, it's kind of in the hot path because it's used during rehash (i.e. when the current table is too small and we have to extend it).
How can this be sped up?

lukasz-stec · 2022-01-31T12:42:49Z

core/trino-main/src/main/java/io/trino/operator/hash/fixedwidth/FixedWidthEntryStructure.java

I haven't tried it. It's a good idea to test it.
For entries smaller than cache line it would make sure we are accessing only one cache line per entry instead of possible two. We may be accessing multiple entries per lookup in case of a collision so this is not a clear win.
Obviously, this would increase memory usage, but it may be a good trade-off.

lukasz-stec · 2022-01-31T12:49:42Z

core/trino-main/src/main/java/io/trino/operator/hash/IsolatedHashTableFactory.java

tpcds has queries with group by 8 - 10 columns e.g. q24. I think 20 is a reasonable number. It covers most of the cases and it's not too big (e.g. it's not much runtime overhead).

lukasz-stec · 2022-01-31T12:57:03Z

core/trino-main/src/main/java/io/trino/operator/hash/IsolatedHashTableFactory.java

It's harder to change if it's hardcoded. E.g. if the generated classes are not compiling for some reason (like API change) it's easy to delete and regenerate them if they are not referenced directly.
It's also easy to make, hard to spot, mistake in hard-coded code here.

lukasz-stec · 2022-01-31T13:09:57Z

core/trino-main/src/main/java/io/trino/operator/hash/ColumnValueExtractor.java

this is used to check if we can use the new GroupByHash implementation, so control flow in a normal case, i think Optional is better for this than exception.

lukasz-stec · 2022-01-31T13:16:27Z

core/trino-main/src/main/java/io/trino/operator/BigintGroupByHash.java

ok, I'm gonna prepare the PR

lukasz-stec · 2022-01-31T13:19:41Z

core/trino-main/src/main/java/io/trino/operator/hash/HashTableDataGroupByHash.java

I wanted to be consistent with MultiChannelGroupByHash but I agree it makes sense to use array directly.
Will try to do it.

sopel39 · 2022-02-01T16:50:41Z

Your benchmark PDF is missing results for partitioned/unpartitioned data (for such a big change) and peak memory metrics

lukasz-stec · 2022-02-01T18:48:22Z

Your benchmark PDF is missing results for partitioned/unpartitioned data (for such a big change) and peak memory metrics

partitioned is there. I will run not partitioned

lukasz-stec · 2022-02-01T19:17:31Z

I updated partitioned benchmark with peek memory statistics

lukasz-stec · 2022-02-02T08:32:42Z

unpartitioned benchmark results added

skrzypo987 · 2022-02-07T08:28:40Z

core/trino-main/src/main/java/io/trino/FeaturesConfig.java

"enhanced" is not descriptive enough.

skrzypo987 · 2022-02-07T08:29:11Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

This can be done in a separate PR

skrzypo987 · 2022-02-07T08:47:29Z

core/trino-main/src/main/java/io/trino/operator/hash/ColumnValueExtractor.java

See if you can do that using instanceof FixedWidthType and getFixedSize method

skrzypo987 · 2022-02-07T08:48:19Z

core/trino-main/src/main/java/io/trino/operator/hash/ColumnValueExtractor.java

Isn't every type so far fixed size?

skrzypo987 · 2022-02-07T08:51:26Z

core/trino-main/src/main/java/io/trino/operator/hash/ColumnValueExtractor.java

What about other operators? There is a reason for decoupling data (block) type and operators based on that data.
Should any type has a specific equals operator, this will fail.

skrzypo987 · 2022-02-07T09:12:18Z

core/trino-main/src/main/java/io/trino/operator/hash/IsolatedHashTableFactory.java

Do these caches actually help?

without it, we would start with interpreted code for every aggregation, every split, even every page. this would both slow the query down and increase pressure on the jit. I did not benchmark it though

skrzypo987 · 2022-02-07T09:16:03Z

core/trino-main/src/main/java/io/trino/operator/hash/fastbb/VarHandleFastByteBuffer.java

I understand that the other FastByteBuffer will be deleted?

yes, possibly we could have another one (off-heap) if we need to handle large arrays

skrzypo987 · 2022-02-07T09:19:20Z

core/trino-main/src/main/java/io/trino/operator/hash/fastbb/VarHandleFastByteBuffer.java

This implementation limits possible number of groups. If the aggregation is made using several long fields it can easily get to 100B / row making this array unusable above 20M groups (2^31B / 100B).

that's a valid point.
This is not as bad because for partial aggregation we will never hit 20M because of the memory limit and the final step is partitioned so the total number of groups is ~ groups per single hash table X number of nodes x number of cores (e.g. ~3B for 6 nodes with 32 cores).
Also, 100 bytes is for 8 bigints so not the common case.
That said it needs to be handled. One way is to switch to OffHeapFastByteBuffer once we hit this limit (we would have to manage off heap then). Another is to switch to the old columnar implementation.

skrzypo987 · 2022-02-07T09:24:36Z

...-main/src/main/java/io/trino/operator/hash/fixedwidth/FixedWidthGroupByHashTableEntries.java

Anywhere between Arrays.fill and System.arraycopy of already filled parts.

skrzypo987 · 2022-02-07T09:25:25Z

...-main/src/main/java/io/trino/operator/hash/fixedwidth/FixedWidthGroupByHashTableEntries.java

But I don't think it's worth the effort

sopel39 · 2022-02-07T11:11:12Z

@lukasz-stec how could that peak memory didn't go up (or even dropped). Are we correctly accounting mem in this PR?

lukasz-stec · 2022-02-07T12:45:15Z

@lukasz-stec how could that peak memory didn't go up (or even dropped). Are we correctly accounting mem in this PR?

@sopel39 I suspect that peak memory is not "caused" by the HashAggregationOperator. I will double-check memory accounting though.

lukasz-stec

some comments

lukasz-stec · 2022-02-11T09:25:14Z

core/trino-main/src/main/java/io/trino/operator/hash/fastbb/VarHandleFastByteBuffer.java

that's a valid point.
This is not as bad because for partial aggregation we will never hit 20M because of the memory limit and the final step is partitioned so the total number of groups is ~ groups per single hash table X number of nodes x number of cores (e.g. ~3B for 6 nodes with 32 cores).
Also, 100 bytes is for 8 bigints so not the common case.
That said it needs to be handled. One way is to switch to OffHeapFastByteBuffer once we hit this limit (we would have to manage off heap then). Another is to switch to the old columnar implementation.

lukasz-stec · 2022-02-11T09:31:08Z

core/trino-main/src/main/java/io/trino/operator/hash/fastbb/VarHandleFastByteBuffer.java

yes, possibly we could have another one (off-heap) if we need to handle large arrays

lukasz-stec · 2022-02-11T09:35:04Z

core/trino-main/src/main/java/io/trino/operator/hash/IsolatedHashTableFactory.java

without it, we would start with interpreted code for every aggregation, every split, even every page. this would both slow the query down and increase pressure on the jit. I did not benchmark it though

This uses additional column to decide which set o columns final should use

Extend BenchmarkPartitionedOutputOperator.pollute with row-wise code path pollution

BIGINT.getLong is called with multiple Block types i.e. not only LongArrayBlock but also DictionaryBlock, RunLengthEncodedBlock. This sometime confuses JIT to do virtual call in the PrecomputedHashGenerator even if the PrecomputedHashGenerator is only called with LongArrayBlock and the call to PrecomputedHashGenerator is inlined.

To avoid block type pollution, use manual dispatch instead of class isolation.

lukasz-stec · 2022-09-02T14:20:19Z

I ran a benchmark for this (tpch/tpcds orc part sf1K) on top of the latest master that includes adaptive partial aggregation.
There is a 5% gain for tpch overall so not bad for a change that affects only a minority of aggregations (most aggregations in tpc benchmarks are on varchar columns).
IMO This shows that moving row-wise is a good direction to improve hash aggregation performance.

group-by-row-wise-fixed-width-orc-part-sf1K.pdf

sopel39 · 2022-09-21T09:01:06Z

IMO This shows that moving row-wise is a good direction to improve hash aggregation performance.

I think we can start with having row-wise-signature for fast hash lookups (in MultiChannelGroupByHash).
Keeping data row-wise within PagesHashStrategy is not essential for that and increases complexity

sopel39 · 2022-09-26T12:53:33Z

@lukasz-stec Rather than generating source code, we can use MethodHandle composition, see #14178

lukasz-stec · 2022-09-26T13:50:07Z

Rather than generating source code, we can use MethodHandle composition, see #14178

There are pros and cons to both approaches. it's way easier to read, understand and profile generated source code.
On the other hand, generated byte code and method handle composition can be more elastic, work for all cases and adapt to more variables like mayHaveNull etc.

mosabua · 2022-11-03T23:35:12Z

👋 @lukasz-stec @sopel39 .. is this still being worked on or can we close this PR?

lukasz-stec · 2022-11-04T07:46:58Z

Let's close it. Project hummingbird is going in this direction anyway.

cla-bot bot added the cla-signed label Jan 20, 2022

lukasz-stec requested a review from skrzypo987 January 20, 2022 10:23

skrzypo987 reviewed Jan 21, 2022

View reviewed changes

lukasz-stec mentioned this pull request Jan 24, 2022

Add row-wise GroupByHash #10578

Closed

lukasz-stec force-pushed the ls/row-wise-group-by/fixed-width branch 2 times, most recently from 8dcda03 to 87953e0 Compare February 1, 2022 12:33

lukasz-stec commented Feb 1, 2022

View reviewed changes

lukasz-stec requested a review from skrzypo987 February 1, 2022 12:36

lukasz-stec marked this pull request as ready for review February 1, 2022 16:37

lukasz-stec requested review from martint and sopel39 February 1, 2022 16:37

github-actions bot added the tests:hive label Feb 1, 2022

skrzypo987 reviewed Feb 7, 2022

View reviewed changes

lukasz-stec commented Feb 11, 2022

View reviewed changes

lukasz-stec added 10 commits April 6, 2022 13:05

Add support to send partial aggregation result as union

7ad7f38

This uses additional column to decide which set o columns final should use

mixed block + mask support working

430c303

unit tests + small improvements

260577f

hasNonNullRow support for VariableWidthBlockBuilder and RowBlockBuilder

86d4d51

Add Row with RLE field case to BenchmarkPartitionedOutputOperator

809d04c

Pollute row-wise processing

d447857

Extend BenchmarkPartitionedOutputOperator.pollute with row-wise code path pollution

Move PositionsAppender classes to separate files

35adede

Rename SmallintPositionsAppender to ShortPositionsAppender

46f7bff

Replace class isolation with manual dispatch

77e9c6b

To avoid block type pollution, use manual dispatch instead of class isolation.

lukasz-stec added 5 commits April 8, 2022 23:46

Add RLE support to PagePartitioner

a248992

Remove unnecessary channels arrays from contains method

3322394

Add GroupByHashFactory

a7d8b6c

Add row-wise GroupByHash

c1105e4

rebase on ls/008-pa-union-columns-over-poo-rle

f185141

lukasz-stec force-pushed the ls/row-wise-group-by/fixed-width branch from 87953e0 to f185141 Compare April 13, 2022 18:36

lukasz-stec marked this pull request as draft August 2, 2022 10:28

sopel39 mentioned this pull request Sep 26, 2022

Project Hummingbird #14237

Open

19 tasks

lukasz-stec closed this Nov 4, 2022

Conversation

lukasz-stec commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hash table implementation

Code generation

Tpch/tpcds benchmarks

Partitioned orc sf1000

Unpartitioned orc sf1000

Jmh benchmark results.

Experiments

Uh oh!

skrzypo987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 commented Feb 1, 2022

Uh oh!

lukasz-stec commented Feb 1, 2022

Uh oh!

lukasz-stec commented Feb 1, 2022

Uh oh!

lukasz-stec commented Feb 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

lukasz-stec commented Jan 20, 2022 •

edited

Loading