Local scale writers for partitioned data by gaurav8297 · Pull Request #14140 · trinodb/trino

gaurav8297 · 2022-09-15T11:44:25Z

The approach and problem are documented here: #13379

Note: Few tests are still pending which I'm working on

Description

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

Increase the performance of partitioned writes with skewness.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

lukasz-stec

some initial comments

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

core/trino-main/src/main/java/io/trino/operator/PartitionedTableWriterOperator.java

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

core/trino-main/src/main/java/io/trino/operator/PartitionedTableWriterOperator.java

gaurav8297 · 2022-09-28T23:16:24Z

Benchmarks (More than 2x improvements)

session:
preferred_write_partitioning_min_number_of_partitions=1
task_writer_count=2
task_scale_writers_min_writer_count=2
task_scale_writers_max_writer_count=8
task_scale_writers_partition_count=128

lineitem_test is partitioned over shipmode column which contains 7 big partitions.

Before (scaling disabled):

trino:insert_demo> Insert into lineitem_test select orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, tax, returnflag, linestatus, commitdate, rec
eiptdate, shipinstruct, shipmode FROM tpch_sf100_orc_part.lineitem;
INSERT: 600037902 rows

Query 20220928_230049_00011_6j66t, FINISHED, 7 nodes
Splits: 2,763 total, 2,763 done (100.00%)
6:00 [600M rows, 15.6GB] [1.66M rows/s, 44.5MB/s]

After (scaling enabled):

trino:insert_demo> Insert into lineitem_test select orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, tax, returnflag, linestatus, commitdate, rec
eiptdate, shipinstruct, shipmode FROM tpch_sf100_orc_part.lineitem;
INSERT: 600037902 rows

Query 20220928_225723_00008_6j66t, FINISHED, 7 nodes
Splits: 2,799 total, 2,799 done (100.00%)
2:45 [600M rows, 15.6GB] [3.63M rows/s, 96.9MB/s]

Pass partition channel types directly to LocalExchange instead of all types and then filtering inside the constructor.

core/trino-main/src/main/java/io/trino/operator/PartitionFunctionFactory.java

core/trino-main/src/main/java/io/trino/operator/DriverContext.java

core/trino-main/src/main/java/io/trino/operator/OperatorContext.java

Dith3r · 2022-10-06T15:33:37Z

core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java

How this value is related to scaleWritersMaxWriterCount?

core/trino-main/src/main/java/io/trino/operator/PartitionedTableWriterOperator.java

lukasz-stec · 2022-10-07T07:48:11Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

This is not very clear. What's the difference between artificial partitions and standard partitions?
Also, actual partitions created are (in the case of hive and iceberg) based on the values in partitioning columns. I don't think this config influences that, right?

lukasz-stec · 2022-10-07T09:03:39Z

core/trino-main/src/main/java/io/trino/operator/PartitionedTableWriterOperator.java

There is a lot of overlap in functionality and a lot of code copied from TableWriterOperator. Can this be somehow reused? This could improve the readability of this class by removing non-essential functionality.

Consider adding WriterUtils class for shared code

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

lukasz-stec · 2022-10-07T09:28:31Z

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

nit: I would drop bucketSize and use positionsList.size() directly, the name is confusing.

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

sopel39 · 2022-10-07T10:46:48Z

core/trino-main/src/main/java/io/trino/operator/DriverContext.java

nit: you could add a validation that there indeed aren't more operators producing stats

core/trino-main/src/main/java/io/trino/operator/DriverContext.java

core/trino-main/src/main/java/io/trino/operator/OperatorContext.java

sopel39 · 2022-10-07T10:59:56Z

core/trino-main/src/main/java/io/trino/operator/PartitionedTableWriterOperator.java

consider putting all insert operators in separate package.

core/trino-main/src/main/java/io/trino/operator/PartitionedTableWriterOperator.java

core/trino-main/src/main/java/io/trino/operator/PipelineContext.java

sopel39 · 2022-10-07T11:29:43Z

core/trino-main/src/test/java/io/trino/operator/TestPartitionedTableWriterOperator.java

nit: consider testing with system partitioning

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorPartitioningHandle.java

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

Dith3r · 2022-10-07T12:30:14Z

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

Why we start here from -1 and not from 0? then adding 1 before modulo would not be required.

then adding 1 before modulo would not be required

WDYM?

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

Dith3r · 2022-10-07T13:12:26Z

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

How this will detect that we have heavy used partition? same with physicalWrittenBytes >= writerMinSize * maxWriterCount after some time (and size) we will just start writers if we have memory buffers filled.

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

Dith3r · 2022-10-07T13:21:49Z

core/trino-main/src/main/java/io/trino/operator/exchange/ScaleWriterPartitioningExchanger.java

is this related to count? it is more like getting next worker id/pointer

Dith3r · 2022-10-07T13:27:39Z

core/trino-main/src/main/java/io/trino/sql/planner/LocalExecutionPlanner.java

Is this general rule that we are using Optional isPresent with ternary instead of we use map/orElseGet?

The only difference wrt to TableWriterOperator is that it creates a separate page sink per partition and reports the partition level physicalWrittenBytes which can be used for scaling skewed partitions at local exchange.

This new method helps the engine to identify whether writer scaling per partition is allowed.

sopel39 · 2022-10-10T09:25:22Z

core/trino-main/src/main/java/io/trino/operator/exchange/LocalExchange.java

@@ -88,7 +88,7 @@ public LocalExchange(
            int defaultConcurrency,


Refactor LocalExchange -> Pass partition channel types directly to LocalExchange

sopel39 · 2022-10-10T09:26:08Z

core/trino-main/src/main/java/io/trino/operator/AbstractTableWriterOperator.java

+import static java.util.Objects.requireNonNull;
+import static java.util.concurrent.TimeUnit.NANOSECONDS;
+
+public abstract class AbstractTableWriterOperator


Add commit that renames current operator first, then add a commit that makes it abstract

sopel39 · 2022-10-10T09:27:53Z

core/trino-main/src/main/java/io/trino/operator/AbstractTableWriterOperator.java

+        updateWrittenBytes();
+    }
+
+    protected abstract List<ListenableFuture<?>> writePage(Page page);


group all abstract protected methods below public methods

sopel39 · 2022-10-10T09:31:16Z

core/trino-main/src/main/java/io/trino/operator/PartitionFunctionFactory.java

+                bucketToPartition);
+    }
+
+    public Function<Page, Page> createPartitionPagePreparer(PartitioningHandle partitioning, List<Integer> partitionChannels)


public class SelectChannels implements PartitionFunction { public SelectChannels(PartitionFunction delegate, PartitioningHandle partitioning, List<Integer> partitionChannels) { .. } }

sopel39 · 2022-10-10T09:39:53Z

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorPartitioningHandle.java

    }
+
+    // Specifies if writing to partition has to be performed by a single writer instance
+    default boolean isSingleWriterPerPartition()


partitioning handle is insert agnostic. After some thought, it's better to put it in ConnectorMetadata:

boolean ConnectorMetadata#isSingleWriterPerPartition(ConnectorSession session, ConnectorPartitioningHandle partitioningHandle)

sopel39 · 2022-10-10T09:42:33Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

    public static final String SCALE_WRITERS = "scale_writers";
    public static final String TASK_SCALE_WRITERS_ENABLED = "task_scale_writers_enabled";
    public static final String TASK_SCALE_WRITERS_MAX_WRITER_COUNT = "task_scale_writers_max_writer_count";
+    public static final String TASK_SCALE_WRITERS_PARTITION_COUNT = "task_scale_writers_partition_count";


I think it's very technical. Let's just keep it as config (maybe) or just some high enough number (like 10k).

sopel39 · 2022-10-10T09:44:30Z

core/trino-main/src/main/java/io/trino/operator/DriverContext.java

+                        return physicalWrittenBytes + value;
+                    }));
+        }
+        return result;


return ImmutableMap.copyOf(...)

sopel39 · 2022-10-10T09:45:36Z

core/trino-main/src/main/java/io/trino/operator/PipelineContext.java

+                        return physicalWrittenBytes + value;
+                    }));
+        }
+        return result;


return ImmutableMap.copyOf(...)

gaurav8297 · 2022-10-23T20:40:06Z

Closing this in favour of #14718

cla-bot bot added the cla-signed label Sep 15, 2022

gaurav8297 changed the title ~~Local scale writers for partitioned data~~ [WIP] Local scale writers for partitioned data Sep 15, 2022

gaurav8297 mentioned this pull request Sep 16, 2022

Scale task writers based on throughput for partitioned tables with skewness #13379

Closed

github-actions bot added the tests:hive label Sep 19, 2022

gaurav8297 requested review from lukasz-stec, raunaqmorarka and sopel39 September 26, 2022 21:05

gaurav8297 changed the title ~~[WIP] Local scale writers for partitioned data~~ Local scale writers for partitioned data Sep 27, 2022

lukasz-stec reviewed Sep 28, 2022

View reviewed changes

gaurav8297 marked this pull request as ready for review September 28, 2022 23:22

gaurav8297 added 2 commits September 30, 2022 16:49

Fix bug in finding local scaling exchange node

d7498c0

Refactor LocalExchange

160c904

Pass partition channel types directly to LocalExchange instead of all types and then filtering inside the constructor.

sopel39 reviewed Oct 6, 2022

View reviewed changes

Dith3r reviewed Oct 6, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/PartitionedTableWriterOperator.java Outdated Show resolved Hide resolved

lukasz-stec reviewed Oct 7, 2022

View reviewed changes

sopel39 reviewed Oct 7, 2022

View reviewed changes

Dith3r reviewed Oct 7, 2022

View reviewed changes

gaurav8297 added 4 commits October 10, 2022 02:31

Add PartitionFunctionFactory

121df99

Add PartitionedTableWriterOperator

d299420

The only difference wrt to TableWriterOperator is that it creates a separate page sink per partition and reports the partition level physicalWrittenBytes which can be used for scaling skewed partitions at local exchange.

Introduce ConnectorPartitioningHandle#isSingleWriterPerPartition

2536964

This new method helps the engine to identify whether writer scaling per partition is allowed.

Scale local writers for skewed partitions

4f4acce

sopel39 reviewed Oct 10, 2022

View reviewed changes

gaurav8297 closed this Oct 23, 2022

arhimondr mentioned this pull request Oct 24, 2022

Enable preferred write partitioning for FTE by default #14735

Merged

		@@ -88,7 +88,7 @@ public LocalExchange(
		int defaultConcurrency,

Conversation

gaurav8297 commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaurav8297 commented Sep 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gaurav8297 commented Sep 15, 2022 •

edited

Loading

gaurav8297 commented Sep 28, 2022 •

edited

Loading

gaurav8297 commented Oct 23, 2022 •

edited

Loading