Implement runtime adaptive partitioning for FTE by linzebing · Pull Request #18349 · trinodb/trino

linzebing · 2023-07-19T14:46:01Z

Having high number of partitions hurts the performance of FTE quite seriously, which presents a tradeoff between scalability and performance. This PR tries address this problem by calculating the total input size of each stage at runtime, and change the number of partitions at runtime by inserting extra shuffles to upstream stages.

With this change, users should be able to make the best of both worlds --- majority of the queries can finish with a small partition count efficiently, and those huge queries will still be able to finish with a larger partition count.

TODO:

Verify this works for all query shapes of TPC-DS and TPC-H
Extract the logic in EventDrivenFaultTolerantQueryScheduler to separate classes/files
Add tests to test overridePartitionCountRecursively
Avoid duplicate isReadyForExecution calls in a scheduling loop (probably won't do in this PR. Tradeoff between code organization and efficiency. Can't see an elegant way to avoid this)
Verify for TPC-DS sf10000 ETL suite

losipiuk · 2023-07-19T18:54:55Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

Not reading deep it looks like the repartioning logic should be extracted to separate method updateStagesPartitioning? and then called either from optimizePlan (preferably) or from schedule (if for some reason it needs to be there)

Hmm that means I need to call isReadyForExecution twice. Not sure what is more important --- code organization or efficiency.

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

losipiuk · 2023-07-20T12:35:52Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

it is not obvious to me why partitioned sources is always emtpy - can you add explanatory comment in the code?

partitioned sources will always be empty --- but remote partitioned sources are not. The input to the runtimeAdaptiveRemoteSourceNode will only be from the 'runtimeAdaptiveRemoteSourceNode'

losipiuk · 2023-07-20T12:40:31Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

@sopel39 can you verify if we can always just insert extra stage which does the additional partitioning? Are we sure that partitiningScheme taken from sourceFragment will be able to operate on the output of the sourceFragment?

@losipiuk I'm not sure what this code is about. Extra stage will always hurt performance. It is possible to maybe improve existing operators in source stage?

@sopel39 : updated commit message. Basically we find performance hurts when number of partitions is large, and we see that 200 partitions will be enough for most of the cases for FTE. So small queries will just use 200 partitions, for huge queries if we see the size of intermediate stage is large, then we change the partition count (and therefore need to insert extra shuffle stages)

losipiuk · 2023-07-20T12:42:36Z

.../trino-testing/src/main/java/io/trino/testing/FaultTolerantExecutionConnectorTestHelper.java

I would suggest having separate test which enforces that. Could be whole AbstractTestAggregations.java and maybe more but with different engine configuration.

Make sure to add assertions which verify that runtime repartitining actually happens so we know that test actually provides coverage.

I would suggest having separate test which enforces that

Yes I can do that. Probably for both AbstractTestAggregations.java and AbstractTestJoins.java.

Make sure to add assertions which verify that runtime repartitining actually happens so we know that test actually provides coverage.

Hmm I don't see an easy way to do this

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

sopel39 · 2023-07-21T12:37:09Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

@losipiuk I'm not sure what this code is about. Extra stage will always hurt performance. It is possible to maybe improve existing operators in source stage?

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

core/trino-main/src/main/java/io/trino/execution/QueryManagerConfig.java

losipiuk · 2023-08-15T11:13:48Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

Let's change signature of private SubPlan optimizePlan(SubPlan plan) to
Optional<SubPlan> optimizePlan()

We now already exploit the fact that current presorted plan is cached on the planInTopologicalOrder. With current code structure it is not obvious that plan passed as arguement and plan shared in state match each-other.

Hmm I don't quite understand. When will we return Optional.empty()?

I was thinking returning Optional.empty() if the plan should not change. It requires some call-flow refactoring. Can be a followup.

losipiuk · 2023-08-15T11:33:12Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

isReadyForExecutionCache.computeIfAbsent(...) ?

We probably can put the logic of how the cache is used into isReadyForExecution itself and just use the method. And clear the cache on each scheduler call.

isReadyForExecutionCache.computeIfAbsent(...) ?

Cache is fresh at the beginning.

We probably can put the logic of how the cache is used into isReadyForExecution itself and just use the method.

I took a look and find that it only complicates the logic of isReadyForExecution. The current version is more concise.

losipiuk · 2023-08-15T11:46:16Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

Is the assumption that stage which requires repartitining will never be started speculatively?

OK so this review comment is not consistent with the code:) If a stage has already started, then we will not be able to insert repartition stages before it.

Yeah - I know - that is why I ask if that is not a problem. That is can we end up not being able to fix the query (bump partition count) because some stage already started executing speculatively?

That's not possible since we always check for input data size of a stage before it's started, and this applies to speculatively started stages as well.

You are probably asking stages whose parents were speculatively started. In that case we use the extrapolated estimated output data size to determine whether we want to change partition count at runtime.

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

losipiuk · 2023-08-15T12:08:44Z

core/trino-main/src/main/java/io/trino/sql/planner/RuntimeAdaptivePartitioningRewriter.java

Hmm.

Can it be that we have sth like

root | X (output partitioning: FIXED_BROADCAST_DISTRIBUTION) | ... | Y | Z (output partitioning: HASH)

if Z output is large now then processing of Y may fail.

If such tree shape is not possible can we validate that it actually does not happen with an assertion?

If the case you mentioned above happens, how can planner determine that the final output can be broadcasted? My guess is it's determine on input data sizes and filter ratio, which make your suspection an impossible scenario.

Of course, we can always be conservative here and apply runtime adaptive partitioning recursively.

Yeah - it feels not very probable - but I also not see a big gain of special casing this. If you want to keep the code please explain in code comment the rationale and why we believe it is ok to do so.

Added comment

core/trino-main/src/main/java/io/trino/sql/planner/RuntimeAdaptivePartitioningRewriter.java

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

losipiuk · 2023-08-15T13:45:46Z

...lerant-tests/src/test/java/io/trino/faulttolerant/TestOverridePartitionCountRecursively.java

nit: is this test specifically about testing of skipping broadcast subtrees, or it generally tests rewriter mechanism. Looks like the latter. Change the name?

It's actually both. For testCreateTableAs it's the same, it also generally tests the rewriter mechanism.

...lerant-tests/src/test/java/io/trino/faulttolerant/TestOverridePartitionCountRecursively.java

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

.../trino-testing/src/main/java/io/trino/testing/FaultTolerantExecutionConnectorTestHelper.java

Having high number of partitions hurts the performance of FTE quite seriously, which presents a tradeoff between scalability and performance. This PR tries address this problem by calculating the total input size of each stage at runtime, and change the number of partitions at runtime by inserting extra shuffles to upstream stages. With this change, users should be able to make the best of both worlds --- majority of the queries can finish with a small partition count efficiently, and those huge queries will still be able to finish with a larger partition count.

`getStageInfo` is called asynchronously, so it's possible that the plan we get is a stale one, yet stages have been updated. To fix the race, we just switch the order, such that plan will not be staler than stageInfos.

losipiuk · 2023-08-17T10:27:22Z

LGTM.

Some questions:

Are newly created stages visible in WebUI right now?
Also what about query completion event?
And output of EXPLAIN ANALYZE?

colebow · 2023-08-17T17:58:01Z

This looks like it needs a release note. Can you propose one? @linzebing

cc @losipiuk

linzebing · 2023-08-17T18:42:47Z

@colebow : discussing offline.

cla-bot bot added the cla-signed label Jul 19, 2023

linzebing force-pushed the runtime-adaptive-v2 branch from 01856c3 to 942e9cc Compare July 19, 2023 17:46

losipiuk reviewed Jul 19, 2023

View reviewed changes

losipiuk reviewed Jul 20, 2023

View reviewed changes

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java Outdated Show resolved Hide resolved

losipiuk reviewed Jul 20, 2023

View reviewed changes

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java Outdated Show resolved Hide resolved

losipiuk reviewed Jul 20, 2023

View reviewed changes

sopel39 reviewed Jul 21, 2023

View reviewed changes

linzebing force-pushed the runtime-adaptive-v2 branch 9 times, most recently from e7a8198 to 27dcf7f Compare July 27, 2023 13:32

linzebing marked this pull request as ready for review July 27, 2023 13:44

linzebing requested review from arhimondr, gaurav8297, losipiuk and sopel39 July 27, 2023 13:44

linzebing force-pushed the runtime-adaptive-v2 branch 5 times, most recently from 95902ab to 08d0879 Compare July 31, 2023 00:49

linzebing changed the title ~~Implement runtime adaptive partitioning~~ Implement runtime adaptive partitioning for FTE Jul 31, 2023

linzebing commented Aug 2, 2023

View reviewed changes

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java Outdated Show resolved Hide resolved

linzebing force-pushed the runtime-adaptive-v2 branch from fa224fd to a98cb7f Compare August 7, 2023 18:38

losipiuk reviewed Aug 15, 2023

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/QueryManagerConfig.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 15, 2023

View reviewed changes

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 15, 2023

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/planner/RuntimeAdaptivePartitioningRewriter.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 15, 2023

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/planner/RuntimeAdaptivePartitioningRewriter.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 15, 2023

View reviewed changes

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 15, 2023

View reviewed changes

...lerant-tests/src/test/java/io/trino/faulttolerant/TestOverridePartitionCountRecursively.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 15, 2023

View reviewed changes

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 15, 2023

View reviewed changes

.../trino-testing/src/main/java/io/trino/testing/FaultTolerantExecutionConnectorTestHelper.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 15, 2023

View reviewed changes

.../trino-testing/src/main/java/io/trino/testing/FaultTolerantExecutionConnectorTestHelper.java Outdated Show resolved Hide resolved

linzebing mentioned this pull request Aug 16, 2023

Improve memory usage estimation for runtime adaptive execution #18698

Open

linzebing added 5 commits August 16, 2023 10:24

Introduce PlanFragmentIdAllocator

d46b962

Fix race condition on check of reportedFragments

fd7b50c

`getStageInfo` is called asynchronously, so it's possible that the plan we get is a stale one, yet stages have been updated. To fix the race, we just switch the order, such that plan will not be staler than stageInfos.

Disable speculative execution after runtime adaptive partitioning

1fe6f30

Improve memory consumption estimation of stages

cb7936a

linzebing force-pushed the runtime-adaptive-v2 branch from a98cb7f to cb7936a Compare August 16, 2023 19:30

losipiuk approved these changes Aug 17, 2023

View reviewed changes

losipiuk merged commit 041a7ad into trinodb:master Aug 17, 2023

github-actions bot added this to the 424 milestone Aug 17, 2023

colebow mentioned this pull request Aug 17, 2023

Add Trino 424 release notes #18704

Merged

gaurav8297 mentioned this pull request Aug 29, 2024

Adaptive Planning in FTE #23182

Open

7 tasks

Conversation

linzebing commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

losipiuk Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

losipiuk commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colebow commented Aug 17, 2023

Uh oh!

linzebing commented Aug 17, 2023

Uh oh!

linzebing commented Jul 19, 2023 •

edited

Loading

losipiuk Jul 19, 2023 •

edited

Loading

losipiuk commented Aug 17, 2023 •

edited

Loading