Mitigate Writer skewness when writing partitioned data with preferred partitioning enabled by gaurav8297 · Pull Request #14718 · trinodb/trino

gaurav8297 · 2022-10-23T11:57:09Z

Description

Issue: #13379

Problem

Improve the performance of partitioned writes specifically in case writers/partitions are skewed.
- Known issue: Scale writers when write is partitioned #10791
- Relevant slack thread: https://trinodb.slack.com/archives/CFLB9AMBN/p1642961339264200
Right now, prefer-partitioning only works if you have statistics and the number of partitions is greater than preferred-write-partitioning-min-number-of-partitions (default to 50). However, we know that stats are not always guaranteed to be present in which case partitioned writes will go through from an inefficient route. But with scaling, we could enable prefer-partitioning for any number of partitions thus we don't have to rely on statistics.

Benchmarks

Cluster with 6 worker nodes

1.) Single partition (260M rows)

Without preferred partitioning: 1:31 mins
With preferred partitioning (before): 18:34 mins
With preferred partitioning (after): 2:55 mins

2.) 8 partitions (600M rows)

Without preferred partitioning: 2:11 mins
With preferred partitioning (before): 6:23 mins
With preferred partitioning (after): 1:48 mins

3.) 2000+ partitions with almost no skewness (600M rows)

Without preferred partitioning: 13:23 mins
With preferred partitioning (before): 11:32 mins
With preferred partitioning (after): 10:55 mins

4.) 2000+ partitions with 6 skewed partitions (2.74B rows)

Without preferred partitioning: 22:18 mins (400GB peak memory)
With preferred partitioning (before): 55:26 mins (76.1GB peak memory)
With preferred partitioning (after): 20:33 mins (79.5GB peak memory)

In experiments 3 and 4, the finishing time is also substantial and included in the measurements (almost +9 mins)

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

core/trino-main/src/main/java/io/trino/metadata/Metadata.java

core/trino-main/src/main/java/io/trino/sql/planner/LocalExecutionPlanner.java

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

core/trino-main/src/main/java/io/trino/metadata/Metadata.java

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/BeginTableWrite.java

core/trino-main/src/main/java/io/trino/sql/planner/plan/TableWriterNode.java

sopel39 · 2022-10-24T08:25:54Z

core/trino-main/src/test/java/io/trino/sql/planner/sanity/TestValidateScaledWritersUsage.java

nit: consider introducing builder

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchangeSourceOperator.java

sopel39 · 2022-10-24T08:55:44Z

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

The name is bit generic. Could you also add javadoc?

It feels like the granularity like this is a little bit of an overkill. With the task concurrency of 32 the implementation would have to re-balance 4096 partitions and keep the state for each partition in memory. Generally there's only a few partitions that are skewed. I wonder if we should start with something lower, maybe 8? (32 * 8 = 256)

With 8 or some other low value, it could be possible that a few small partitions are scaled along with skewed ones because they belong to the same bucket. But, it might be okay since 256 is per node. WDYT? @sopel39

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

core/trino-main/src/main/java/io/trino/sql/planner/LocalExecutionPlanner.java

core/trino-main/src/main/java/io/trino/sql/planner/SystemPartitioningHandle.java

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddLocalExchanges.java

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningManager.java

core/trino-main/src/main/java/io/trino/metadata/Metadata.java

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

testing/trino-server-dev/etc/catalog/hive.properties

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningManager.java

core/trino-main/src/main/java/io/trino/execution/resourcegroups/IndexedPriorityQueue.java

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

core/trino-main/src/main/java/io/trino/execution/resourcegroups/IndexedPriorityQueue.java

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/BeginTableWrite.java

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorTableLayout.java

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

core/trino-main/src/main/java/io/trino/sql/planner/ScaleWriterPartitioningHandle.java

core/trino-main/src/main/java/io/trino/sql/planner/sanity/ValidateScaledWritersUsage.java

core/trino-main/src/main/java/io/trino/execution/scheduler/EventDrivenTaskSourceFactory.java

core/trino-main/src/main/java/io/trino/sql/planner/ScaleWriterPartitioningHandle.java

sopel39 · 2022-11-02T17:18:52Z

Up to "Introduce ScaleWriterPartitioningHandle " lgtm % comments

gaurav8297 · 2022-11-07T07:36:23Z

Do you know the reason of it?

I think there are two reasons:

No global scaling (Major): We don't have global scaling for preferred partitioning yet. However, with the tardigrade skewness work, this problem might get solved.
Local scaling speed (Minor): In preferred partitioning, we are scaling every 100MBs whereas in the un-part route we are doing it on every page.

sopel39

lgtm % Add scaleWriters flag in PartitioningHandle

core/trino-main/src/main/java/io/trino/sql/planner/LocalExecutionPlanner.java

core/trino-main/src/main/java/io/trino/sql/planner/PartitioningHandle.java

lgtm until Add scaleWriters flag in PartitioningHandle

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

sopel39 · 2022-11-08T14:57:32Z

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

MAX_PARTITIONS_TO_REBALANCE_PER_WRITER * 32 = 4096, which means in worst case writerPartitionIdToRowCount will be as big as page.

Consider reversing the relationship, e.g: partitionRebalancer fetching writerPartitionIdToRowCount from ScaleWriterPartitioningExchanger instead.

I thought about this but it just makes the code more complex. I wonder if it's worth the effort.

I reversed the relationship. PTAL

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

core/trino-main/src/main/java/io/trino/execution/resourcegroups/IndexedPriorityQueue.java

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

gaurav8297 · 2022-11-09T19:55:54Z

@gaurav8297 Have you captured the CPU utilization as well by any chance?

In trino, the CPU time is almost similar for all related queries. So to better understand, I created this doc containing CPU utilization from the AWS console.

https://docs.google.com/document/d/1Bg5z-EzavtkXnBfhNLdJE3ngXRgrYrmmTVRJwiMhpTE/edit?usp=sharing

cc @arhimondr @sopel39

gaurav8297 · 2022-11-10T11:36:07Z

@arhimondr @sopel39 PTAL again

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

arhimondr

LGTM % a couple of nits

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

gaurav8297 · 2022-11-17T01:52:12Z

@sopel39 PTAL again

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

core/trino-main/src/main/java/io/trino/sql/planner/PartitioningScheme.java

core/trino-main/src/main/java/io/trino/sql/planner/SystemPartitioningHandle.java

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddLocalExchanges.java

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

sopel39

lgtm % comments

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

sopel39

% comments

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

sopel39 · 2022-11-18T09:41:30Z

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

nit: it might be clearer if you had a field like lastScaleUpPhysicalWrittenBytes and didn't set physicalWrittenBytesAtLastRebalance to 0

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

core/trino-main/src/test/java/io/trino/operator/exchange/TestLocalExchange.java

core/trino-main/src/test/java/io/trino/operator/exchange/TestUniformPartitionRebalancer.java

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java

core/trino-main/src/test/java/io/trino/operator/exchange/TestLocalExchange.java

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddLocalExchanges.java

First figure out if there's a skewness across writers in a node, then find the biggest partitions in the skewed writers and scale them across writers which are on the lower end i.e. written smallest amount of data. Scaling will only happen if the skewness is above 70% and the partition to be scaled has written atleast writerMinSize since last scale up.

cla-bot bot added the cla-signed label Oct 23, 2022

github-actions bot added the tests:hive label Oct 23, 2022

martint reviewed Oct 23, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/metadata/Metadata.java Outdated Show resolved Hide resolved

martint reviewed Oct 23, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/metadata/Metadata.java Outdated Show resolved Hide resolved

gaurav8297 changed the title ~~Local scale writers for partitioned data~~ Scale partitions to multiple writers locally Oct 23, 2022

gaurav8297 mentioned this pull request Oct 23, 2022

Local scale writers for partitioned data #14140

Closed

sopel39 reviewed Oct 24, 2022

View reviewed changes

arhimondr reviewed Oct 24, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/metadata/Metadata.java Outdated Show resolved Hide resolved

findepi reviewed Oct 25, 2022

View reviewed changes

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java Outdated Show resolved Hide resolved

findepi reviewed Oct 25, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java Outdated Show resolved Hide resolved

findepi reviewed Oct 25, 2022

View reviewed changes

testing/trino-server-dev/etc/catalog/hive.properties Outdated Show resolved Hide resolved

sopel39 reviewed Oct 25, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningManager.java Outdated Show resolved Hide resolved

kabunchi reviewed Oct 26, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningManager.java Outdated Show resolved Hide resolved

Dith3r reviewed Oct 27, 2022

View reviewed changes

gaurav8297 marked this pull request as ready for review October 28, 2022 05:41

gaurav8297 requested a review from sopel39 October 28, 2022 05:51

Dith3r reviewed Oct 28, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java Outdated Show resolved Hide resolved

gaurav8297 requested review from Dith3r and kabunchi October 30, 2022 16:09

Dith3r approved these changes Oct 31, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/resourcegroups/IndexedPriorityQueue.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java Outdated Show resolved Hide resolved

gaurav8297 requested review from arhimondr and kabunchi and removed request for kabunchi November 2, 2022 10:28

gaurav8297 changed the title ~~Scale partitions to multiple writers locally~~ Mitigate Writer skewness when writing partitioned data with preferred partitioning enabled Nov 2, 2022

sopel39 reviewed Nov 2, 2022

View reviewed changes

gaurav8297 requested a review from sopel39 November 8, 2022 10:51

sopel39 previously approved these changes Nov 8, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/planner/LocalExecutionPlanner.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/sql/planner/PartitioningHandle.java Outdated Show resolved Hide resolved

sopel39 reviewed Nov 8, 2022

View reviewed changes

gaurav8297 mentioned this pull request Nov 8, 2022

Add multipleWritersPerPartitionSupported flag in ConnectorTableLayout + Refactor scale writer #14956

Merged

Skip addOrUpdate if priorities are also similar

75aee0a

gaurav8297 requested a review from sopel39 November 10, 2022 10:58

Dith3r reviewed Nov 10, 2022

View reviewed changes

arhimondr approved these changes Nov 10, 2022

View reviewed changes

arhimondr approved these changes Nov 14, 2022

View reviewed changes

sopel39 reviewed Nov 16, 2022

View reviewed changes

sopel39 reviewed Nov 17, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java Outdated Show resolved Hide resolved

sopel39 reviewed Nov 17, 2022

View reviewed changes

sopel39 approved these changes Nov 18, 2022

View reviewed changes

arhimondr reviewed Nov 18, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/exchange/UniformPartitionRebalancer.java Outdated Show resolved Hide resolved

core/trino-main/src/test/java/io/trino/operator/exchange/TestLocalExchange.java Outdated Show resolved Hide resolved

arhimondr reviewed Nov 18, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddLocalExchanges.java Outdated Show resolved Hide resolved

arhimondr merged commit 106ee5d into trinodb:master Nov 19, 2022

hashhar added this to the 404 milestone Nov 19, 2022

This was referenced Nov 21, 2022

Scale task writers based on throughput for partitioned tables with skewness #13379

Closed

Release notes for 405 #15058

Closed

colebow mentioned this pull request Nov 21, 2022

Add Trino 405 release notes #15139

Merged

gaurav8297 mentioned this pull request Dec 19, 2022

Always use preferred partitioning for local exchanges #15459

Merged

Conversation

gaurav8297 commented Oct 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Benchmarks

1.) Single partition (260M rows)

2.) 8 partitions (600M rows)

3.) 2000+ partitions with almost no skewness (600M rows)

4.) 2000+ partitions with 6 skewed partitions (2.74B rows)

Non-technical explanation

Release notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 Oct 24, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 Oct 24, 2022

Choose a reason for hiding this comment

Uh oh!

arhimondr Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

gaurav8297 Nov 6, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaurav8297 commented Oct 23, 2022 •

edited

Loading

gaurav8297 commented Nov 7, 2022 •

edited

Loading

gaurav8297 Nov 10, 2022 •

edited

Loading