Attach weights to IcebergSplits by alexjo2144 · Pull Request #12579 · trinodb/trino

alexjo2144 · 2022-05-27T16:42:57Z

Description

Performance improvements for queries against Iceberg tables with small files.

Is this change a fix, improvement, new feature, refactoring, or other?

Performance improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Iceberg connector with all catalogs

How would you describe this change to a non-technical end user or system administrator?

Improve query execution time when the table contains many small files

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Iceberg
* Use split weights to improve scheduling and query performance on tables with many small files.  ({issue}`12162`)

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

findepi · 2022-05-31T08:24:19Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSessionProperties.java

                        true))
+                .add(doubleProperty(
+                        SPLIT_WEIGHT_MAX,
+                        "split weight max",


I don't understand the meaning of this session property.
is it for benchmarking only and can be removed?

I think we can likely remove it. Maybe @pettyjamesm can you weigh in on why there's the minimum_assigned_split_weight session property in Hive?

In the Hive implementation, the minimum split weight is a control mechanism to avoid over-queueing really small files, where the computed weight proportional to the target split size would be very small (eg: 1KB files with a split size of 64MB).

In those scenario's, the scheduler would be allowed to assign a huge number of splits to worker split queues which could make the task update request JSON payload huge and/or skew task completion time unevenly across workers. In practice, splits have a certain fixed overhead regardless of input bytes that needed to be accounted for, so I added a minimum weight setting to address that.

I didn't have a use-case for allowing splits to be "larger" than standard, and didn't want to regress queries with, for example: "unsplittable" GZIP compressed text files larger than the target split size- so in the trino-hive implementation I chose to clamp the maximum split weight to SplitWeight.standard() (ie: proportion=1.0). The code in trino-main and trino-spi will support larger than standard weights, but I'm not aware of a use-case where it makes sense to queue significantly fewer splits than the scheduler is configured to support.

Makes sense. Would users actually know when to modify the session property or do you think we can get the same effect with a hard-coded minimum?

I tested experimentally with some scan-heavy queries JSON files in S3 (eg: select count(*) from table) and saw the point of diminishing returns around the current default minimum value of 0.05, or a 20x increase in queue depth. It's possible that someone running in a different environment (maybe HDFS storage nodes with local SSDs?) could see further improvements by setting a lower value than I could measure, so I left it configurable just in case.

In those scenario's, the scheduler would be allowed to assign a huge number of splits to worker split queues which could make the task update request JSON payload huge

this should be taken care of by the engine, not the connector

splits have a certain fixed overhead regardless of input bytes that needed to be accounted for, so I added a minimum weight setting to address that.

good point,
so we should model a split weight as constant + bytes

Would users actually know when to modify the session property or do you think we can get the same effect with a hard-coded minimum?

from the expertly conversation above, i wouldn't be able to set these properties reasonably

i think we should have some constant factor instead.

so we should model a split weight as constant + bytes

Turns out that's what Iceberg does for Spark: https://github.com/apache/iceberg/blame/9ab94f87de036c9cd91cf8353906a576b4a516ff/core/src/main/java/org/apache/iceberg/util/TableScanUtil.java#L70

I didn't account for delete files either. Splits of the same size could read a lot more data if they have delete files to read. I'm not sure if that's worth accounting for yet

this should be taken care of by the engine, not the connector

The connector is supposed to implement the weighting scheme reasonably, using whatever heuristics and control bounds that make sense to that connector. In the case of hive and iceberg, there is this risk without a minimum weight- for other connectors the splits themselves might actually benefit from allowing the scheduler to assign 100x more splits with that are extremely cheap to process and serialize as JSON other splits which that connector considers "standard".

@pettyjamesm a connector doesn't know engine's overhead of sending and processing a split, so cannot determine whether "100x more" is ok

The connector should know what the JSON payload overhead of it's own split type is on a per-split basis, as well as the overhead associated with processing a given split if it involves things like, eg: opening a connection to S3 or issuing a query to a MySQL database. It doesn't seem unreasonable to me that connectors be expected to implement split weighting responsibly if they choose to implement it at all.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplit.java

findepi · 2022-05-31T08:28:22Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplit.java

+    @Override
+    public SplitWeight getSplitWeight()
+    {
+        return splitWeight;


SplitWeight.fromProportion(length)

besides being simpler, it's better to avoid a new field, so that it's clear what the weight is, and that it's reasonable & consistent with other split's state

We could try SplitWeight.fromRawValue(length) but that comes with the implication that 100 is the value of a "standard size split" which isn't right for length. I think if we do that the weights will imply that regular size splits are very large and may be scheduled slower. I haven't tested it though.

My understanding from skimming the code is that if all splits return fromRawValue(100), or fromProportion(1.0) splits will be scheduled exactly how they are now, which is what we want for full size splits.

That's correct. The internal representation is an integer value, normalized where a single standard split weight value is 100. This is done to avoid floating point error accumulation. This is considered to be an implementation detail, and we made the choice based on the initial PR feedback to expose the a way for connectors to express split weights relative to 1.0 in case someone felt the need to adjust the expression of a "standard" weight for higher granularity, eg: normalize the "standard" weight to 1,000.

To address the initial feedback point, you need to know the split's length in bytes relative to the configured target split size, which means that you would need to compute the value and store it as a field unless the target split size was also hard-coded.

I now understand this isn't a good idea.

To address the initial feedback point, you need to know the split's length in bytes relative to the configured target split size

that's weird concept from SPI perspective, since "configured target split size" is not a SPI concept itself

@pettyjamesm if a connector uses fromProportion(d) is relation between d and 1.0 really important?
how?

(and where is it documented? :) )

A split with a weight of 1.0 is "standard". If all splits are standard weight, then the scheduler will place exactly as many splits per node per task as the scheduler config specifies. If all splits have a weight of 0.5, then the scheduler is allowed to place 2x as many splits per node and task, and with all splits weighing 2.0- 1/2 as many.

It's not explicitly documented, although the PrestoDB PR which merged after the Trino PR did include references to this behavior in the NodeScheduler Properties Reference. We should probably port those documentation changes.

We should probably port those documentation changes.

thanks, please do

core/trino-spi/src/main/java/io/trino/spi/SplitWeight.java

The Iceberg planTasks method buckets small files together into combined scan tasks, but these combined tasks are not used. Instead just plan individual FileScanTasks with the target size.

pettyjamesm · 2022-05-31T18:36:46Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

Could task.length() / tableScan.targetSplitSize() yield a value > 1.0? If so, you'll probably want to wrap that with Math.min(1.0, (double) task.length() / tableScan.targetSplitSize())

Looks like if the file format defines split offsets those are used instead, and I guess they could be bigger than the target split size.

So if a split is larger than standard, we want to cap that at 1.0 rather than accounting for the larger size in the queue?

That was the choice I went with in the case of trino-hive since I didn't want to regress queries. At some point, you do have to set a cap on the upper bound or you could have splits that never get scheduled because there's never enough room in a worker queue to accept it- but you could choose to go maybe as high as 2.0 if you think that's worthwhile, but the safer bet is definitely 1.0. I don't know enough about how tableScan.targetSize() and split generation in general work in Iceberg to know whether it's a real concern.

is capping to 1.0 required by SplitWeight?
the class doesn't convey that

There is no requirement for split weights to be capped at 1.0 and larger weights are allowed if the connector wishes to express that fewer splits should be allowed to be placed per node and task. I chose not to experiment too heavily with "larger than standard" weights in the Hive connector implementation, but there wasn't any good reason to prevent other connectors from choosing to do so if it made sense for their context. It's possible that even with the Hive connector, there are scenarios that could benefit- but the information available to make a decision like that is fairly limited and I didn't pursue experimenting with it.

larger weights are allowed if the connector wishes to express that fewer splits should be allowed to be placed per node and task.

I think connector shouldn't make such a decision. After all, it doesn't control the nodes and the task scheduler and it doesn't know queue lengths, etc.

in fact, this discussion shows that this is (a) an expertly to tune and (b) tuning this is a problem that blurs engine/connector separation. Things may need to be configured on both sides for optimal behavior.

I think connector shouldn't make such a decision. After all, it doesn't control the nodes and the task scheduler and it doesn't know queue lengths, etc.

The connector doesn't need to actively participate in real time with the scheduler and know things like queue lengths, that was kind of the point for going with the "weight based" approach.

in fact, this discussion shows that this is (a) an expertly to tune

I'm not so sure that's true- because "tuning" the value would always require experimentation to find the point at which workers are no longer bottlenecked on not having enough splits.

(b) tuning this is a problem that blurs engine/connector separation. Things may need to be configured on both sides for optimal behavior.

I agree that the implementation is extending the connector's ability to influence the split scheduler, but that's by design. The risk of being misconfigured, however- is fairly small. There's a wide region of "harmless suboptimality" where a better weight assignment scheme either a) split weights are too high and performance could be better if workers didn't spend idle cycles waiting for the next batch of splits to be assigned or b) split weights are too low and the split queues are deeper than they need to be to stay busy. In practice, the "catastrophicly misconfigured" opportunity requires a connector to be extremely aggressive in the weight assignments, which would be a bug that connectors should be expected to fix and can be mitigated by disabling weighted scheduling within that connector if need be.

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergConfig.java

pettyjamesm

The split weight commit LGTM

findepi · 2022-06-02T07:05:30Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java

Capture in code comment why 0.05.
Add similar in Hive.

findepi · 2022-06-02T07:07:32Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSessionProperties.java

I don't think we should have a session property for this. Per my understanding, this is a safety toggle.
In particular, a user may set value to 0.0 (or close to 0.0), destabilizing the cluster.

(btw if you convince me this stays, it needs validator as in

trino/plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java

Line 485 in a76ee40

value -> {

)

i think we should remove this.
this pr #12656 removes this from hive

@alexjo2144 we can go ahead with this PR, and i can update mine PR to remove from both

findepi · 2022-06-02T07:08:49Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergSplitSource.java

use config's default (new IcebergConfig().get...)

i pushed a fixup with this

The weight is equal to the split size divided by the target split size.

colebow · 2022-06-06T16:38:54Z

With config properties changing, we need to modify our docs. @alexjo2144 are you able to do this, or would you like me to take the lead?

cc @findepi

alexjo2144 · 2022-06-06T17:02:36Z

Thanks for the reminder. I can put a pr up with the doc change

cla-bot bot added the cla-signed label May 27, 2022

alexjo2144 mentioned this pull request May 27, 2022

Bundle Iceberg reads into combined scan tasks #12359

Closed

alexjo2144 requested review from findepi, kokosing, martint and pettyjamesm May 27, 2022 17:22

findepi reviewed May 31, 2022

View reviewed changes

alexjo2144 force-pushed the iceberg/split-weights branch 2 times, most recently from 8be28dd to 13ea8d8 Compare May 31, 2022 17:51

pettyjamesm reviewed May 31, 2022

View reviewed changes

core/trino-spi/src/main/java/io/trino/spi/SplitWeight.java Outdated Show resolved Hide resolved

alexjo2144 force-pushed the iceberg/split-weights branch from 13ea8d8 to a59e782 Compare May 31, 2022 18:08

Skip unused bin packing in IcebergSplitSource

638d7f7

The Iceberg planTasks method buckets small files together into combined scan tasks, but these combined tasks are not used. Instead just plan individual FileScanTasks with the target size.

alexjo2144 force-pushed the iceberg/split-weights branch from a59e782 to c00b7a2 Compare May 31, 2022 18:28

pettyjamesm reviewed May 31, 2022

View reviewed changes

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergConfig.java Outdated Show resolved Hide resolved

alexjo2144 force-pushed the iceberg/split-weights branch 2 times, most recently from e51a458 to dbbfc86 Compare May 31, 2022 19:22

pettyjamesm approved these changes May 31, 2022

View reviewed changes

findepi reviewed Jun 2, 2022

View reviewed changes

findepi approved these changes Jun 4, 2022

View reviewed changes

Attach weights to IcebergSplits

52e0faf

The weight is equal to the split size divided by the target split size.

findepi force-pushed the iceberg/split-weights branch from dbbfc86 to 52e0faf Compare June 4, 2022 19:03

findepi merged commit 651c542 into trinodb:master Jun 4, 2022

findepi mentioned this pull request Jun 4, 2022

Release notes for 385 #12686

Closed

github-actions bot added this to the 385 milestone Jun 4, 2022

alexjo2144 deleted the iceberg/split-weights branch June 6, 2022 13:21

colebow mentioned this pull request Jun 6, 2022

Add Trino 385 release notes #12703

Closed

alexjo2144 mentioned this pull request Jun 8, 2022

Document the Iceberg split weight property #12744

Merged

alexjo2144 mentioned this pull request Sep 8, 2023

Slow scan performance on Iceberg table with many small files due to 1-1 mapping IcebergSplit to Iceberg single FileScanTask #12162

Closed

Conversation

alexjo2144 commented May 27, 2022

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pettyjamesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!