Skip to content

Mitigate Writer skewness when writing partitioned data with preferred partitioning enabled#14718

Merged
arhimondr merged 2 commits intotrinodb:masterfrom
gaurav8297:gaurav8297/partition_scale_writer_new
Nov 19, 2022
Merged

Mitigate Writer skewness when writing partitioned data with preferred partitioning enabled#14718
arhimondr merged 2 commits intotrinodb:masterfrom
gaurav8297:gaurav8297/partition_scale_writer_new

Conversation

@gaurav8297
Copy link
Copy Markdown
Member

@gaurav8297 gaurav8297 commented Oct 23, 2022

Description

Issue: #13379

Problem

  • Improve the performance of partitioned writes specifically in case writers/partitions are skewed.

  • Right now, prefer-partitioning only works if you have statistics and the number of partitions is greater than preferred-write-partitioning-min-number-of-partitions (default to 50). However, we know that stats are not always guaranteed to be present in which case partitioned writes will go through from an inefficient route. But with scaling, we could enable prefer-partitioning for any number of partitions thus we don't have to rely on statistics.

Benchmarks

Cluster with 6 worker nodes

1.) Single partition (260M rows)

  • Without preferred partitioning: 1:31 mins
  • With preferred partitioning (before): 18:34 mins
  • With preferred partitioning (after): 2:55 mins

2.) 8 partitions (600M rows)

  • Without preferred partitioning: 2:11 mins
  • With preferred partitioning (before): 6:23 mins
  • With preferred partitioning (after): 1:48 mins

3.) 2000+ partitions with almost no skewness (600M rows)

  • Without preferred partitioning: 13:23 mins
  • With preferred partitioning (before): 11:32 mins
  • With preferred partitioning (after): 10:55 mins

4.) 2000+ partitions with 6 skewed partitions (2.74B rows)

  • Without preferred partitioning: 22:18 mins (400GB peak memory)
  • With preferred partitioning (before): 55:26 mins (76.1GB peak memory)
  • With preferred partitioning (after): 20:33 mins (79.5GB peak memory)

In experiments 3 and 4, the finishing time is also substantial and included in the measurements (almost +9 mins)

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@gaurav8297 gaurav8297 changed the title Local scale writers for partitioned data Scale partitions to multiple writers locally Oct 23, 2022
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider introducing builder

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is bit generic. Could you also add javadoc?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the granularity like this is a little bit of an overkill. With the task concurrency of 32 the implementation would have to re-balance 4096 partitions and keep the state for each partition in memory. Generally there's only a few partitions that are skewed. I wonder if we should start with something lower, maybe 8? (32 * 8 = 256)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 8 or some other low value, it could be possible that a few small partitions are scaled along with skewed ones because they belong to the same bucket. But, it might be okay since 256 is per node. WDYT? @sopel39

@gaurav8297 gaurav8297 marked this pull request as ready for review October 28, 2022 05:41
@gaurav8297 gaurav8297 requested a review from sopel39 October 28, 2022 05:51
@gaurav8297 gaurav8297 requested review from Dith3r and kabunchi October 30, 2022 16:09
@gaurav8297 gaurav8297 requested review from arhimondr and kabunchi and removed request for kabunchi November 2, 2022 10:28
@gaurav8297 gaurav8297 changed the title Scale partitions to multiple writers locally Mitigate Writer skewness when writing partitioned data with preferred partitioning enabled Nov 2, 2022
@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Nov 2, 2022

Up to "Introduce ScaleWriterPartitioningHandle " lgtm % comments

@gaurav8297
Copy link
Copy Markdown
Member Author

gaurav8297 commented Nov 7, 2022

Do you know the reason of it?

I think there are two reasons:

  • No global scaling (Major): We don't have global scaling for preferred partitioning yet. However, with the tardigrade skewness work, this problem might get solved.
  • Local scaling speed (Minor): In preferred partitioning, we are scaling every 100MBs whereas in the un-part route we are doing it on every page.

@gaurav8297 gaurav8297 requested a review from sopel39 November 8, 2022 10:51
sopel39
sopel39 previously approved these changes Nov 8, 2022
Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % Add scaleWriters flag in PartitioningHandle

@sopel39 sopel39 dismissed their stale review November 8, 2022 12:55

lgtm until Add scaleWriters flag in PartitioningHandle

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_PARTITIONS_TO_REBALANCE_PER_WRITER * 32 = 4096, which means in worst case writerPartitionIdToRowCount will be as big as page.

Consider reversing the relationship, e.g: partitionRebalancer fetching writerPartitionIdToRowCount from ScaleWriterPartitioningExchanger instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but it just makes the code more complex. I wonder if it's worth the effort.

Copy link
Copy Markdown
Member Author

@gaurav8297 gaurav8297 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reversed the relationship. PTAL

@gaurav8297
Copy link
Copy Markdown
Member Author

@gaurav8297 Have you captured the CPU utilization as well by any chance?

In trino, the CPU time is almost similar for all related queries. So to better understand, I created this doc containing CPU utilization from the AWS console.

https://docs.google.com/document/d/1Bg5z-EzavtkXnBfhNLdJE3ngXRgrYrmmTVRJwiMhpTE/edit?usp=sharing

cc @arhimondr @sopel39

@gaurav8297 gaurav8297 requested a review from sopel39 November 10, 2022 10:58
@gaurav8297
Copy link
Copy Markdown
Member Author

@arhimondr @sopel39 PTAL again

Copy link
Copy Markdown
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % a couple of nits

@gaurav8297
Copy link
Copy Markdown
Member Author

@sopel39 PTAL again

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% comments

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be clearer if you had a field like lastScaleUpPhysicalWrittenBytes and didn't set physicalWrittenBytesAtLastRebalance to 0

First figure out if there's a skewness across
writers in a node, then find the biggest partitions
in the skewed writers and scale them across writers
which are on the lower end i.e. written smallest
amount of data.

Scaling will only happen if the skewness is above 70%
and the partition to be scaled has written atleast
writerMinSize since last scale up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

8 participants