Push DictionaryBlock through remote partitioned exchange by lukasz-stec · Pull Request #14937 · trinodb/trino

lukasz-stec · 2022-11-07T16:40:11Z

Description

Dictionary-encoded blocks are currently flatened by the partitioned exchange operator. This prevents dictionary based optimizations from taking advantage of the encoded blocks (or results in additional overhead).
This PR adds support for dictionary encoded blocks through partitioned exchange for a case where the same dictionary (the same java object) is used by subsequent Blocks sent through PartitionedOutputOperator

Other than possible CPU optimizations, transmitting DictionaryBlocks over network reduces is more efficient than flat blocks.
As an example for query on tpch schema (sf10) encoded in orc files

explain analyze verbose select mktsegment from customer c, nation n where c.nationkey = n.nationkey;

mktsegment is dictionary encoded.

so when we look at the customer table scan we see

 Fragment 2 [SOURCE]                                                                                                                                                                                                     
     CPU: 308.06ms, Scheduled: 368.72ms, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 1500000 rows (18.62MB); per task: avg.: 500000.00 std.dev.: 35568.30, Output: 1500000 rows (45.78MB)                     
     Output buffer active time: 385.00ms, buffer utilization distribution (%): {p01=0.00, p05=0.00, p10=0.00, p25=0.00, p50=0.00, p75=0.01, p90=2.97, p95=3.04, p99=3.10, max=6.15}                                      
     Output layout: [nationkey, mktsegment, $hashvalue_6]                                                                                                                                                                
     Output partitioning: HASH [nationkey][$hashvalue_6]                                                                                                                                                                 
     ScanFilterProject[table = hive:tpch:customer, dynamicFilters = {"nationkey" = #df_332}]                                                                                                                             
         Layout: [nationkey:bigint, mktsegment:varchar(10), $hashvalue_6:bigint]                                                                                                                                         
         Estimates: {rows: 1500000 (45.78MB), cpu: 32.90M, memory: 0B, network: 0B}/{rows: 1500000 (45.78MB), cpu: 32.90M, memory: 0B, network: 0B}/{rows: 1500000 (45.78MB), cpu: 45.78M, memory: 0B, network: 0B}      
         CPU: 309.00ms (59.54%), Scheduled: 368.00ms (60.83%), Blocked: 0.00ns (0.00%), Output: 1500000 rows (45.78MB)                                                                                                   
         connector metrics:                                                                                                                                                                                              
           'OrcReaderCompressionFormat_ZLIB' = LongCount{total=75868244}                                                                                                                                                 
           'Physical input read time' = {duration=5.19ms}                                                                                                                                                                
         metrics:                                                                                                                                                                                                        
           'Blocked time distribution (s)' = {count=3.00, p01=0.00, p05=0.00, p10=0.00, p25=0.00, p50=0.00, p75=0.00, p90=0.00, p95=0.00, p99=0.00, min=0.00, max=0.00}                                                  
           'CPU time distribution (s)' = {count=3.00, p01=0.06, p05=0.06, p10=0.06, p25=0.06, p50=0.06, p75=0.07, p90=0.07, p95=0.07, p99=0.07, min=0.06, max=0.07}                                                      
           'Input block types' = {}                                                                                                                                                                                      
           'Input rows distribution' = {count=3.00, p01=467180.00, p05=467180.00, p10=467180.00, p25=467180.00, p50=483398.00, p75=549422.00, p90=549422.00, p95=549422.00, p99=549422.00, min=467180.00, max=549422.00} 
           'Output block types' = {DictionaryBlock=LongCount{total=312}, LongArrayBlock=LongCount{total=922}, VariableWidthBlock=LongCount{total=149}}                                                                   
           'Projection CPU time' = {duration=612.80us}                                                                                                                                                                   
           'Scheduled time distribution (s)' = {count=3.00, p01=0.07, p05=0.07, p10=0.07, p25=0.07, p50=0.07, p75=0.07, p90=0.07, p95=0.07, p99=0.07, min=0.07, max=0.07}                                                
         Input avg.: 500000.00 rows, Input std.dev.: 7.11%                                                                                                                                                               
         $hashvalue_6 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("nationkey"), 0))                                                                                                                       
         nationkey := nationkey:bigint:REGULAR                                                                                                                                                                           
         mktsegment := mktsegment:varchar(10):REGULAR                                                                                                                                                                    
         Input: 1500000 rows (18.62MB), Filtered: 0.00%                                                                                                                                                                  
         Dynamic filters:                                                                                                                                                                                                
             - df_332, [ SortedRangeSet[type=bigint, ranges=25, {[0], ..., [24]}] ], collection time=31.86ms

and with this optimization

 Fragment 2 [SOURCE]                                                                                                                                                                                                     
     CPU: 294.82ms, Scheduled: 317.47ms, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 1500000 rows (18.62MB); per task: avg.: 500000.00 std.dev.: 35568.30, Output: 1500000 rows (34.43MB)                     
     Output buffer active time: 342.36ms, buffer utilization distribution (%): {p01=0.00, p05=0.00, p10=0.00, p25=0.00, p50=0.00, p75=0.06, p90=3.40, p95=3.58, p99=3.94, max=7.09}                                      
     Output layout: [nationkey, mktsegment, $hashvalue_6]                                                                                                                                                                
     Output partitioning: HASH [nationkey][$hashvalue_6]                                                                                                                                                                 
     ScanFilterProject[table = hive:tpch:customer, dynamicFilters = {"nationkey" = #df_332}]                                                                                                                             
         Layout: [nationkey:bigint, mktsegment:varchar(10), $hashvalue_6:bigint]                                                                                                                                         
         Estimates: {rows: 1500000 (45.78MB), cpu: 32.90M, memory: 0B, network: 0B}/{rows: 1500000 (45.78MB), cpu: 32.90M, memory: 0B, network: 0B}/{rows: 1500000 (45.78MB), cpu: 45.78M, memory: 0B, network: 0B}      
         CPU: 294.00ms (59.51%), Scheduled: 317.00ms (59.36%), Blocked: 0.00ns (0.00%), Output: 1500000 rows (34.43MB)                                                                                                   
         connector metrics:                                                                                                                                                                                              
           'OrcReaderCompressionFormat_ZLIB' = LongCount{total=75868244}                                                                                                                                                 
           'Physical input read time' = {duration=15.62ms}                                                                                                                                                               
         metrics:                                                                                                                                                                                                        
           'Blocked time distribution (s)' = {count=3.00, p01=0.00, p05=0.00, p10=0.00, p25=0.00, p50=0.00, p75=0.00, p90=0.00, p95=0.00, p99=0.00, min=0.00, max=0.00}                                                  
           'CPU time distribution (s)' = {count=3.00, p01=0.06, p05=0.06, p10=0.06, p25=0.06, p50=0.06, p75=0.06, p90=0.06, p95=0.06, p99=0.06, min=0.06, max=0.06}                                                      
           'Input block types' = {}                                                                                                                                                                                      
           'Input rows distribution' = {count=3.00, p01=467180.00, p05=467180.00, p10=467180.00, p25=467180.00, p50=483398.00, p75=549422.00, p90=549422.00, p95=549422.00, p99=549422.00, min=467180.00, max=549422.00} 
           'Output block types' = {DictionaryBlock=LongCount{total=312}, LongArrayBlock=LongCount{total=922}, VariableWidthBlock=LongCount{total=149}}                                                                   
           'Projection CPU time' = {duration=498.71us}                                                                                                                                                                   
           'Scheduled time distribution (s)' = {count=3.00, p01=0.06, p05=0.06, p10=0.06, p25=0.06, p50=0.06, p75=0.07, p90=0.07, p95=0.07, p99=0.07, min=0.06, max=0.07}                                                
         Input avg.: 500000.00 rows, Input std.dev.: 7.11%                                                                                                                                                               
         $hashvalue_6 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("nationkey"), 0))                                                                                                                       
         nationkey := nationkey:bigint:REGULAR                                                                                                                                                                           
         mktsegment := mktsegment:varchar(10):REGULAR                                                                                                                                                                    
         Input: 1500000 rows (18.62MB), Filtered: 0.00%                                                                                                                                                                  
         Dynamic filters:                                                                                                                                                                                                
             - df_332, [ SortedRangeSet[type=bigint, ranges=25, {[0], ..., [24]}] ], collection time=33.37ms

so Output: 1500000 rows (45.78MB) goes down to Output: 1500000 rows (34.43MB).

Non-technical explanation

Increase dictionary-encoded block usage in the engine.

Release notes

( X) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

Dith3r · 2022-11-15T08:57:23Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

Why not use IntArrayList?

I need to calculate getRetainedSizeInBytes. For that, the actual size of the array is needed and IntArrayList does not expose that info.

lukasz-stec

CA

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

lukasz-stec · 2022-11-15T09:14:59Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

I need to calculate getRetainedSizeInBytes. For that, the actual size of the array is needed and IntArrayList does not expose that info.

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

sopel39 · 2022-11-15T09:49:03Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

Rename this to DictionaryAwarePositionsAppender. We usually just have DictionaryAwareXXX which handles both RLE and dicts

I would leave the name unchanged. Unnesting is a more important function of this class. Handling dictionaries is an additional feature.

Unnesting is a more important function of this class.

Unnesting became almost irrelevant at this point since now RLE and dicts cannot be nested

sopel39 · 2022-11-15T09:50:27Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

Keep state top-level as in io.trino.operator.output.RleAwarePositionsAppender

Since this class has now two responsibilities (unnesting and building a dictionary) it's more readable to separate them. It also makes it easier to reset the state of the appender.

Since this class has now two responsibilities (unnesting and building a dictionary)

This class doesn't really unnest much now. RleAwayPositionsAppender could do:

if (source instanceof RunLengthEncodedBlock) { delegate.appendRle(((RunLengthEncodedBlock) source).getValue(), positions.size()); }

itself really at this point, so UnnestingPositionsAppender would be all about dictionaries

RleAwayPositionsAppender is not always there. Unnesting part is actually to make sure only flat blocks are passed down to the flat appenders.
Maybe we should merge UnnestingPositionsAppender and RleAwayPositionsAppender into BlockTypeAwarePositionsAppender? although I fear it would make the code messier.

RleAwayPositionsAppender is not always there.

Why wouldn't it be always there?

it's not needed if the type is not comparable

it's not needed if the type is not comparable

I don't think it's worth extra complexity since there are not many types like that.

However. You could then have minimal UnnestingPositionsAppender without rle or dict builder support.

I don't mixing of current UnnestingPositionsAppender with dictionary awareness is needed

lukasz-stec

CA

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

lukasz-stec · 2022-11-15T10:08:12Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

Since this class has now two responsibilities (unnesting and building a dictionary) it's more readable to separate them. It also makes it easier to reset the state of the appender.

lukasz-stec · 2022-11-15T10:09:54Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

I would leave the name unchanged. Unnesting is a more important function of this class. Handling dictionaries is an additional feature.

lukasz-stec · 2022-11-15T16:19:53Z

Working of a performance regression so converted temporarily to a draft

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

lib/trino-orc/src/main/java/io/trino/orc/OrcReader.java

lukasz-stec · 2023-04-05T06:09:01Z

rebased on the master with the OrcReader#MAX_BATCH_SIZE change

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

raunaqmorarka · 2023-04-05T06:17:20Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

Why is dictionary flushed in every case here but in the other append it is flushed only for non-dictionary cases

I think we flush in every case other than we have the same dictionary (dictionaryBlockBuilder.canAppend to be precise).
We could also do it here as well at a cost of !closed && (dictionary == this.dictionary || this.dictionary == null) for every row. I will try to benchmark how much this impacts performance of the row by row processing

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

The DictionaryBlock is pushed through only if all the input blocks are DictionaryBlocks and use the same instance of the DictionaryBlock.dictionary.

This is to limit the negative impact of using dictionaries due to megamorphic calls but still getting the benefit of transporting dictionary blocks over network

lukasz-stec

comments answered and addressed

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

lukasz-stec · 2023-04-05T07:07:12Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

I think we flush in every case other than we have the same dictionary (dictionaryBlockBuilder.canAppend to be precise).
We could also do it here as well at a cost of !closed && (dictionary == this.dictionary || this.dictionary == null) for every row. I will try to benchmark how much this impacts performance of the row by row processing

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

lukasz-stec · 2023-04-05T09:45:16Z

ci failed with fatal: unable to access 'https://github.com/airlift/jvmkill/': The requested URL returned error: 429 🤷

Dith3r · 2023-04-06T11:40:35Z

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java

+            else {
+                newSize = initialEntryCount;
+            }
+            newSize = Math.max(newSize, capacity);


Is there any possibility that newSize will be lower than capacity?

well yes, i.e., capacity can be bigger than initialEntryCount or bigger than 1.5 * dictionaryIds.length (calculateNewArraySize)

Why not just do int newSize = calculateNewArraySize(max(initialEntryCount, capacity, dictionaryIds.length))

ofc this will work different that now. If capacity will 100, it will create an array of size 150.

we do not want to go over initialEntryCount as the appender can be "pre-sized" in #reset

sopel39 · 2023-04-13T10:52:45Z

@lukasz-stec what are we waiting for here?

lukasz-stec · 2023-04-13T20:29:54Z

stable benchmarks, mainly to see the impact of dictionary block flattening

cla-bot bot added the cla-signed label Nov 7, 2022

lukasz-stec added the performance label Nov 7, 2022

lukasz-stec force-pushed the ls/051-poo-dictionary-support branch from 5124e8a to fea1f25 Compare November 8, 2022 20:37

lukasz-stec marked this pull request as ready for review November 14, 2022 16:44

lukasz-stec requested review from Dith3r and sopel39 November 14, 2022 16:44

Dith3r reviewed Nov 15, 2022

View reviewed changes

lukasz-stec commented Nov 15, 2022

View reviewed changes

lukasz-stec requested a review from Dith3r November 15, 2022 09:21

Dith3r approved these changes Nov 15, 2022

View reviewed changes

sopel39 reviewed Nov 15, 2022

View reviewed changes

lukasz-stec commented Nov 15, 2022

View reviewed changes

lukasz-stec requested a review from sopel39 November 15, 2022 10:13

lukasz-stec force-pushed the ls/051-poo-dictionary-support branch from ba0e02c to a754634 Compare November 15, 2022 16:18

lukasz-stec marked this pull request as draft November 15, 2022 16:19

lukasz-stec force-pushed the ls/051-poo-dictionary-support branch from a754634 to f33747e Compare November 24, 2022 08:04

lukasz-stec mentioned this pull request Nov 28, 2022

Push dictionary through remote partitioned exchange #15215

Open

2 tasks

Dith3r reviewed Dec 13, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/output/UnnestingPositionsAppender.java Show resolved Hide resolved

Dith3r approved these changes Dec 14, 2022

View reviewed changes

kokosing force-pushed the master branch from 3f05134 to 58d6356 Compare March 14, 2023 11:34

lukasz-stec force-pushed the ls/051-poo-dictionary-support branch 9 times, most recently from a2e3297 to afe4bd5 Compare March 30, 2023 14:21

lukasz-stec force-pushed the ls/051-poo-dictionary-support branch 3 times, most recently from cbff50c to 41bf198 Compare April 3, 2023 15:16

raunaqmorarka reviewed Apr 3, 2023

View reviewed changes

lib/trino-orc/src/main/java/io/trino/orc/OrcReader.java Outdated Show resolved Hide resolved

github-actions bot added the tests:hive label Apr 3, 2023

lukasz-stec force-pushed the ls/051-poo-dictionary-support branch from 41bf198 to a731d5f Compare April 4, 2023 10:52

lukasz-stec marked this pull request as ready for review April 4, 2023 20:40

lukasz-stec requested a review from raunaqmorarka April 4, 2023 20:41

lukasz-stec force-pushed the ls/051-poo-dictionary-support branch from a731d5f to b4514ae Compare April 5, 2023 06:08

raunaqmorarka reviewed Apr 5, 2023

View reviewed changes

lukasz-stec added 2 commits April 5, 2023 09:44

Push DictionaryBlock through remote partitioned exchange

9ea8d4a

The DictionaryBlock is pushed through only if all the input blocks are DictionaryBlocks and use the same instance of the DictionaryBlock.dictionary.

Flatten dictionary blocks during decoding

478b4f5

This is to limit the negative impact of using dictionaries due to megamorphic calls but still getting the benefit of transporting dictionary blocks over network

lukasz-stec force-pushed the ls/051-poo-dictionary-support branch from b4514ae to 478b4f5 Compare April 5, 2023 07:44

lukasz-stec commented Apr 5, 2023

View reviewed changes

lukasz-stec requested a review from raunaqmorarka April 5, 2023 07:44

raunaqmorarka approved these changes Apr 6, 2023

View reviewed changes

Dith3r reviewed Apr 6, 2023

View reviewed changes

Dith3r approved these changes Apr 6, 2023

View reviewed changes

raunaqmorarka merged commit 53f0dc7 into trinodb:master Apr 18, 2023

raunaqmorarka deleted the ls/051-poo-dictionary-support branch April 18, 2023 08:48

github-actions bot added this to the 414 milestone Apr 18, 2023

colebow mentioned this pull request Apr 19, 2023

Add Trino 414 release notes #17128

Closed

Conversation

lukasz-stec commented Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Non-technical explanation

Release notes

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukasz-stec commented Nov 15, 2022

Uh oh!

Uh oh!

Uh oh!

lukasz-stec commented Apr 5, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lukasz-stec commented Apr 5, 2023

lukasz-stec commented Nov 7, 2022 •

edited

Loading

sopel39 Nov 15, 2022 •

edited

Loading

sopel39 Nov 15, 2022 •

edited

Loading