Skip to content

Local scale writers for partitioned data#14140

Closed
gaurav8297 wants to merge 6 commits intotrinodb:masterfrom
gaurav8297:gaurav8297/partition_scale_writer
Closed

Local scale writers for partitioned data#14140
gaurav8297 wants to merge 6 commits intotrinodb:masterfrom
gaurav8297:gaurav8297/partition_scale_writer

Conversation

@gaurav8297
Copy link
Copy Markdown
Member

@gaurav8297 gaurav8297 commented Sep 15, 2022

The approach and problem are documented here: #13379

Note: Few tests are still pending which I'm working on

Description

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

Increase the performance of partitioned writes with skewness.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Sep 15, 2022
@gaurav8297 gaurav8297 changed the title Local scale writers for partitioned data [WIP] Local scale writers for partitioned data Sep 15, 2022
@gaurav8297 gaurav8297 changed the title [WIP] Local scale writers for partitioned data Local scale writers for partitioned data Sep 27, 2022
Copy link
Copy Markdown
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial comments

@gaurav8297
Copy link
Copy Markdown
Member Author

gaurav8297 commented Sep 28, 2022

Benchmarks (More than 2x improvements)

session:
preferred_write_partitioning_min_number_of_partitions=1
task_writer_count=2
task_scale_writers_min_writer_count=2
task_scale_writers_max_writer_count=8
task_scale_writers_partition_count=128

lineitem_test is partitioned over shipmode column which contains 7 big partitions.

Before (scaling disabled):

trino:insert_demo> Insert into lineitem_test select orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, tax, returnflag, linestatus, commitdate, rec
eiptdate, shipinstruct, shipmode FROM tpch_sf100_orc_part.lineitem;
INSERT: 600037902 rows

Query 20220928_230049_00011_6j66t, FINISHED, 7 nodes
Splits: 2,763 total, 2,763 done (100.00%)
6:00 [600M rows, 15.6GB] [1.66M rows/s, 44.5MB/s]

After (scaling enabled):

trino:insert_demo> Insert into lineitem_test select orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, tax, returnflag, linestatus, commitdate, rec
eiptdate, shipinstruct, shipmode FROM tpch_sf100_orc_part.lineitem;
INSERT: 600037902 rows

Query 20220928_225723_00008_6j66t, FINISHED, 7 nodes
Splits: 2,799 total, 2,799 done (100.00%)
2:45 [600M rows, 15.6GB] [3.63M rows/s, 96.9MB/s]

@gaurav8297 gaurav8297 marked this pull request as ready for review September 28, 2022 23:22
Pass partition channel types directly to LocalExchange
instead of all types and then filtering inside the
constructor.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this value is related to scaleWritersMaxWriterCount?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very clear. What's the difference between artificial partitions and standard partitions?
Also, actual partitions created are (in the case of hive and iceberg) based on the values in partitioning columns. I don't think this config influences that, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of overlap in functionality and a lot of code copied from TableWriterOperator. Can this be somehow reused? This could improve the readability of this class by removing non-essential functionality.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding WriterUtils class for shared code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would drop bucketSize and use positionsList.size() directly, the name is confusing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could add a validation that there indeed aren't more operators producing stats

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider putting all insert operators in separate package.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider testing with system partitioning

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we start here from -1 and not from 0? then adding 1 before modulo would not be required.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then adding 1 before modulo would not be required

WDYM?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this will detect that we have heavy used partition? same with physicalWrittenBytes >= writerMinSize * maxWriterCount after some time (and size) we will just start writers if we have memory buffers filled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this related to count? it is more like getting next worker id/pointer

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this general rule that we are using Optional isPresent with ternary instead of we use map/orElseGet?

The only difference wrt to TableWriterOperator is that
it creates a separate page sink per partition and reports
the partition level physicalWrittenBytes which can be used
for scaling skewed partitions at local exchange.
This new method helps the engine to identify whether
writer scaling per partition is allowed.
@@ -88,7 +88,7 @@ public LocalExchange(
int defaultConcurrency,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor LocalExchange -> Pass partition channel types directly to LocalExchange

import static java.util.Objects.requireNonNull;
import static java.util.concurrent.TimeUnit.NANOSECONDS;

public abstract class AbstractTableWriterOperator
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add commit that renames current operator first, then add a commit that makes it abstract

updateWrittenBytes();
}

protected abstract List<ListenableFuture<?>> writePage(Page page);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group all abstract protected methods below public methods

bucketToPartition);
}

public Function<Page, Page> createPartitionPagePreparer(PartitioningHandle partitioning, List<Integer> partitionChannels)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public class SelectChannels
  implements PartitionFunction {
  public SelectChannels(PartitionFunction delegate, PartitioningHandle partitioning, List<Integer> partitionChannels) {
    ..
  }
}

}

// Specifies if writing to partition has to be performed by a single writer instance
default boolean isSingleWriterPerPartition()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitioning handle is insert agnostic. After some thought, it's better to put it in ConnectorMetadata:

boolean ConnectorMetadata#isSingleWriterPerPartition(ConnectorSession session, ConnectorPartitioningHandle partitioningHandle)

public static final String SCALE_WRITERS = "scale_writers";
public static final String TASK_SCALE_WRITERS_ENABLED = "task_scale_writers_enabled";
public static final String TASK_SCALE_WRITERS_MAX_WRITER_COUNT = "task_scale_writers_max_writer_count";
public static final String TASK_SCALE_WRITERS_PARTITION_COUNT = "task_scale_writers_partition_count";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's very technical. Let's just keep it as config (maybe) or just some high enough number (like 10k).

return physicalWrittenBytes + value;
}));
}
return result;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ImmutableMap.copyOf(...)

return physicalWrittenBytes + value;
}));
}
return result;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ImmutableMap.copyOf(...)

@gaurav8297
Copy link
Copy Markdown
Member Author

gaurav8297 commented Oct 23, 2022

Closing this in favour of #14718

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants