Skip to content

Implement runtime adaptive partitioning for FTE#18349

Merged
losipiuk merged 5 commits intotrinodb:masterfrom
linzebing:runtime-adaptive-v2
Aug 17, 2023
Merged

Implement runtime adaptive partitioning for FTE#18349
losipiuk merged 5 commits intotrinodb:masterfrom
linzebing:runtime-adaptive-v2

Conversation

@linzebing
Copy link
Copy Markdown
Member

@linzebing linzebing commented Jul 19, 2023

Having high number of partitions hurts the performance of FTE quite seriously, which presents a tradeoff between scalability and performance. This PR tries address this problem by calculating the total input size of each stage at runtime, and change the number of partitions at runtime by inserting extra shuffles to upstream stages.

With this change, users should be able to make the best of both worlds --- majority of the queries can finish with a small partition count efficiently, and those huge queries will still be able to finish with a larger partition count.

TODO:

  • Verify this works for all query shapes of TPC-DS and TPC-H
  • Extract the logic in EventDrivenFaultTolerantQueryScheduler to separate classes/files
  • Add tests to test overridePartitionCountRecursively
  • Avoid duplicate isReadyForExecution calls in a scheduling loop (probably won't do in this PR. Tradeoff between code organization and efficiency. Can't see an elegant way to avoid this)
  • Verify for TPC-DS sf10000 ETL suite

@cla-bot cla-bot bot added the cla-signed label Jul 19, 2023
@linzebing linzebing force-pushed the runtime-adaptive-v2 branch from 01856c3 to 942e9cc Compare July 19, 2023 17:46
Comment on lines 805 to 843
Copy link
Copy Markdown
Member

@losipiuk losipiuk Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not reading deep it looks like the repartioning logic should be extracted to separate method updateStagesPartitioning? and then called either from optimizePlan (preferably) or from schedule (if for some reason it needs to be there)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that means I need to call isReadyForExecution twice. Not sure what is more important --- code organization or efficiency.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not obvious to me why partitioned sources is always emtpy - can you add explanatory comment in the code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitioned sources will always be empty --- but remote partitioned sources are not. The input to the runtimeAdaptiveRemoteSourceNode will only be from the 'runtimeAdaptiveRemoteSourceNode'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 can you verify if we can always just insert extra stage which does the additional partitioning? Are we sure that partitiningScheme taken from sourceFragment will be able to operate on the output of the sourceFragment?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@losipiuk I'm not sure what this code is about. Extra stage will always hurt performance. It is possible to maybe improve existing operators in source stage?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 : updated commit message. Basically we find performance hurts when number of partitions is large, and we see that 200 partitions will be enough for most of the cases for FTE. So small queries will just use 200 partitions, for huge queries if we see the size of intermediate stage is large, then we change the partition count (and therefore need to insert extra shuffle stages)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest having separate test which enforces that. Could be whole AbstractTestAggregations.java and maybe more but with different engine configuration.

Make sure to add assertions which verify that runtime repartitining actually happens so we know that test actually provides coverage.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest having separate test which enforces that

Yes I can do that. Probably for both AbstractTestAggregations.java and AbstractTestJoins.java.

Make sure to add assertions which verify that runtime repartitining actually happens so we know that test actually provides coverage.

Hmm I don't see an easy way to do this

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@losipiuk I'm not sure what this code is about. Extra stage will always hurt performance. It is possible to maybe improve existing operators in source stage?

@linzebing linzebing force-pushed the runtime-adaptive-v2 branch 9 times, most recently from e7a8198 to 27dcf7f Compare July 27, 2023 13:32
@linzebing linzebing marked this pull request as ready for review July 27, 2023 13:44
@linzebing linzebing force-pushed the runtime-adaptive-v2 branch 5 times, most recently from 95902ab to 08d0879 Compare July 31, 2023 00:49
@linzebing linzebing changed the title Implement runtime adaptive partitioning Implement runtime adaptive partitioning for FTE Jul 31, 2023
@linzebing linzebing force-pushed the runtime-adaptive-v2 branch from fa224fd to a98cb7f Compare August 7, 2023 18:38
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change signature of private SubPlan optimizePlan(SubPlan plan) to
Optional<SubPlan> optimizePlan()

We now already exploit the fact that current presorted plan is cached on the planInTopologicalOrder. With current code structure it is not obvious that plan passed as arguement and plan shared in state match each-other.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't quite understand. When will we return Optional.empty()?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking returning Optional.empty() if the plan should not change. It requires some call-flow refactoring. Can be a followup.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isReadyForExecutionCache.computeIfAbsent(...) ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can put the logic of how the cache is used into isReadyForExecution itself and just use the method. And clear the cache on each scheduler call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isReadyForExecutionCache.computeIfAbsent(...) ?

Cache is fresh at the beginning.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can put the logic of how the cache is used into isReadyForExecution itself and just use the method.

I took a look and find that it only complicates the logic of isReadyForExecution. The current version is more concise.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the assumption that stage which requires repartitining will never be started speculatively?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so this review comment is not consistent with the code:) If a stage has already started, then we will not be able to insert repartition stages before it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I know - that is why I ask if that is not a problem. That is can we end up not being able to fix the query (bump partition count) because some stage already started executing speculatively?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not possible since we always check for input data size of a stage before it's started, and this applies to speculatively started stages as well.

You are probably asking stages whose parents were speculatively started. In that case we use the extrapolated estimated output data size to determine whether we want to change partition count at runtime.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.

Can it be that we have sth like

root
|  
X (output partitioning: FIXED_BROADCAST_DISTRIBUTION)
|
...
|
Y 
|
Z (output partitioning: HASH)

if Z output is large now then processing of Y may fail.

If such tree shape is not possible can we validate that it actually does not happen with an assertion?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the case you mentioned above happens, how can planner determine that the final output can be broadcasted? My guess is it's determine on input data sizes and filter ratio, which make your suspection an impossible scenario.

Of course, we can always be conservative here and apply runtime adaptive partitioning recursively.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it feels not very probable - but I also not see a big gain of special casing this. If you want to keep the code please explain in code comment the rationale and why we believe it is ok to do so.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this test specifically about testing of skipping broadcast subtrees, or it generally tests rewriter mechanism. Looks like the latter. Change the name?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually both. For testCreateTableAs it's the same, it also generally tests the rewriter mechanism.

Having high number of partitions hurts the performance of FTE quite seriously, which presents a tradeoff between scalability and performance. This PR tries address this problem by calculating the total input size of each stage at runtime, and change the number of partitions at runtime by inserting extra shuffles to upstream stages.

With this change, users should be able to make the best of both worlds --- majority of the queries can finish with a small partition count efficiently, and those huge queries will still be able to finish with a larger partition count.
`getStageInfo` is called asynchronously, so it's possible that the plan we get is a stale one, yet stages have been updated. To fix the race, we just switch the order, such that plan will not be staler than stageInfos.
@linzebing linzebing force-pushed the runtime-adaptive-v2 branch from a98cb7f to cb7936a Compare August 16, 2023 19:30
@losipiuk
Copy link
Copy Markdown
Member

losipiuk commented Aug 17, 2023

LGTM.

Some questions:

  • Are newly created stages visible in WebUI right now?
  • Also what about query completion event?
  • And output of EXPLAIN ANALYZE?

@losipiuk losipiuk merged commit 041a7ad into trinodb:master Aug 17, 2023
@github-actions github-actions bot added this to the 424 milestone Aug 17, 2023
@colebow
Copy link
Copy Markdown
Member

colebow commented Aug 17, 2023

This looks like it needs a release note. Can you propose one? @linzebing

cc @losipiuk

@linzebing
Copy link
Copy Markdown
Member Author

@colebow : discussing offline.

@gaurav8297 gaurav8297 mentioned this pull request Aug 29, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants