Fault tolerant scheduler 2.0 by arhimondr · Pull Request #14205 · trinodb/trino

arhimondr · 2022-09-20T02:51:20Z

Description

This PR lays down foundation for the future advancements in fault tolerant execution. The proposed structure of the scheduler is aimed at making it possible to implemen:

Adaptive replanning by adjusting the scheduler to allow mutable plans
Speculative execution by allowing scheduling without a necessity for a full barrier between stages
Prioritized scheduling by maintaining a single task queue per query
Advanced autoscalling by allowing to expose task queue statistics that can be taking into account when deciding optimal cluster size dynamically

The scheduler is implemented as an event loop to minimize synchronization necessity and allow developers to think about scheduling as a single threaded process

Non-technical explanation

N/A

TODO

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

arhimondr · 2022-09-20T02:51:53Z

On top of #14072, still WIP

arhimondr · 2022-09-27T21:17:45Z

Benchmark results:

+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
| base_cpu_time_millis | base_wall_time_millis | test_cpu_time_millis | test_wall_time_millis | cpu_diff | wall_diff |
+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
|          18184316267 |              77292978 |          18536131298 |              76952203 |  1.01935 |   0.99559 |
+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+

+-------------------------------+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
| suite                         | base_cpu_time_millis | base_wall_time_millis | test_cpu_time_millis | test_wall_time_millis | cpu_diff | wall_diff |
+-------------------------------+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
| tpcds_sf10000_partitioned     |           2151027639 |              10666623 |           2192108379 |              10325315 |  1.01910 |   0.96800 |
| tpcds_sf10000_partitioned_etl |          12575752138 |              45907876 |          12749500471 |              45367419 |  1.01382 |   0.98823 |
| tpcds_sf100_partitioned       |             25285611 |               1298684 |             24642584 |               1285296 |  0.97457 |   0.98969 |
| tpcds_sf100_partitioned_etl   |            112666635 |               5309658 |            116431690 |               5588347 |  1.03342 |   1.05249 |
| tpch_sf10000_bucketed         |            833789592 |               3341612 |            860208652 |               3195566 |  1.03169 |   0.95629 |
| tpch_sf10000_bucketed_etl     |           2459310174 |               9295363 |           2565149397 |               9585205 |  1.04304 |   1.03118 |
| tpch_sf100_bucketed           |              6506475 |                168296 |              5905512 |                160160 |  0.90764 |   0.95166 |
| tpch_sf100_bucketed_etl       |             19978003 |               1304866 |             22184613 |               1444895 |  1.11045 |   1.10731 |
+-------------------------------+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+

Detailed: https://gist.github.com/arhimondr/02ccc06a4f145fcd5f47b91ff068c013

arhimondr · 2022-09-27T21:18:42Z

Still on top of several PRs (#14328, #14329, #14330, #14320) but is ready for review.

arhimondr · 2022-09-29T14:32:14Z

Rebased

arhimondr · 2022-10-07T06:01:54Z

Ready for review

losipiuk

Partial review

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

losipiuk · 2022-10-10T10:11:07Z

core/trino-main/src/main/java/io/trino/execution/scheduler/EventDrivenTaskSource.java

It does not look like you need this interface. You only need the Callback for tests.
Also it looks awkward to have Callback internal to EventDrivenTaskSource as on interface level those two are not related at all.
I would suggest to just leave Callback as an interface; it can be moved to top-level or to StageTaskSource implementation.

Initially I was trying to model it after an existing TaskSource. But indeed, I don't think the EventDrivenTaskSourceFactory, EventDrivenTaskSource interfaces are needed. Going to remove them and rename:

StageEventDrivenTaskSourceFactory -> EventDrivenTaskSourceFactory

StageTaskSource -> EventDrivenTaskSource

Also going to move the Callback to the EventDrivenTaskSource and move the EventDrivenTaskSource out of the StageEventDrivenTaskSourceFactory (basically reducing the number of nested classes)

losipiuk · 2022-10-10T10:22:31Z

...trino-main/src/main/java/io/trino/execution/scheduler/StageEventDrivenTaskSourceFactory.java

nit: I'd suggest marking constructor and update as public to clearly mark what is the public interface of the class

(relevant to other classes too)

I think that's the only relevant class after moving classes around and removing unnecessary interfaces. Please let me know if I missed anything.

...trino-main/src/main/java/io/trino/execution/scheduler/StageEventDrivenTaskSourceFactory.java

losipiuk · 2022-10-10T10:50:11Z

...trino-main/src/main/java/io/trino/execution/scheduler/StageEventDrivenTaskSourceFactory.java

verify(partitionAssignment.getAssignedDataSizeInBytes() > 0)? Or maybe there is a chance it would not be true if split reports empty size?
Maybe for connectors which misbehave and report zero-size splits we should also mark as full based on splits count?

Maybe for connectors which misbehave and report zero-size splits we should also mark as full based on splits count?

Totally forgot to implement that. We do indeed have the fault_tolerant_execution_max_task_split_count property. Implemented.

Another, less straightforward problem is actually around an another session property I forgot about, the fault_tolerant_execution_min_task_split_count.

This property is there to ensure there's enough splits assigned to a single task to ensure a task is able to utilize thread level parallelism. Usually, when the file format is "splittable", it doesn't really matter. However for non splittable formats when only a single split per entire file is generated it seems like a good idea to provide enough splits for a single task to utilize all available threads.

One goal I was trying to achieve by ArbitraryDistributionSplitAssigner was to remove the Arbitrary / Source distribution duality (which in their essence are the same). However the fault_tolerant_execution_min_task_split_count only makes sense for table scan splits and doesn't make much sense for RemoteSplits (that can provide parallelism even within a single split). Currently in the ArbitraryDistributionSplitAssigner I'm trying to make as little difference as possible between a RemoteSplit and a ConnectorSplit. Implementing fault_tolerant_execution_min_task_split_count will most certainly make it more difficult. I'm a little bit on a fence whether we really want to have the fault_tolerant_execution_min_task_split_count or should we consider non splittable formats as a nieche use case?

...trino-main/src/main/java/io/trino/execution/scheduler/StageEventDrivenTaskSourceFactory.java

losipiuk · 2022-10-10T12:22:59Z

...trino-main/src/main/java/io/trino/execution/scheduler/StageEventDrivenTaskSourceFactory.java

startIfNotStarted?

Also maybe split the method into two - one for creating SplitLoader and other for starting those up with top level exception handlingn.
The indentation depth is below my comfort level right now.

startIfNotStarted?

We usually call it just start in other places. Do you think this one should be different?

I think (maybe this is not really the case) that typically a start() method would throw if "object being started" is already started. Not just ignore the call.

Hmm, we can do that. I don't know why did I implement it to simply ignore it. I don't think it is ever called more than once.

...trino-main/src/main/java/io/trino/execution/scheduler/StageEventDrivenTaskSourceFactory.java

losipiuk · 2022-10-10T13:57:19Z

...-main/src/test/java/io/trino/execution/scheduler/TestArbitraryDistributionSplitAssigner.java

hmm - the test logic is not simpler than the logic in the tested code. I wonder if it is possible to make assertions more explicit and not bloat this?

While the assignment algorithm is relatively straightforward the different sequence of interaction is what I wanted to test. The idea is to implement the algorithm in a straightforward way and make sure that different sequences of interaction with the assigner produce the same result.

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

losipiuk

Lots of dumb questions. Sorry

losipiuk · 2022-10-10T14:22:20Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

should this be called getRootStageInfo (I know current naming matches QueryScheduler interface

It is a tree internally, including all stage infos for all stages. It is more of a QueryInfo at this point. Not sure if getRootStageInfo wouldn't be interpreted as "stage info for the root stage only"

losipiuk · 2022-10-10T17:12:23Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

getBasicRootStageStats()?

Those stats are aggregate stats across all stages

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

losipiuk · 2022-10-11T13:26:43Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

I am too dumb to follow this one.

Yeah, probably the most beafy method of the entire scheduler. Essentially what it does - it schedules a task.

And to schedule a task you need to have splits and output selectors.

This method obtains splits either from an open descriptor (if a descriptor is still being built) or from a sealed descriptor stored in the task descriptor storage. It also merges in output selectors.

Open to suggestions of how to make it more readable.

Actually it is not that bad on the second go. What you could do is to extract:

private Set<PlanNodeId> getRemoteSourceIds() { // this can be cached Set<PlanNodeId> remoteSourceIds = new HashSet<>(); for (RemoteSourceNode remoteSource : stage.getFragment().getRemoteSourceNodes()) { remoteSourceIds.add(remoteSource.getId()); } return remoteSourceIds; } private Map<PlanNodeId, ExchangeSourceOutputSelector> getMergedSourceOutputSelectors() { ImmutableMap.Builder<PlanNodeId, ExchangeSourceOutputSelector> outputSelectors = ImmutableMap.builder(); for (RemoteSourceNode remoteSource : stage.getFragment().getRemoteSourceNodes()) { ExchangeSourceOutputSelector mergedSelector = null; for (PlanFragmentId sourceFragmentId : remoteSource.getSourceFragmentIds()) { ExchangeSourceOutputSelector sourceFragmentSelector = sourceOutputSelectors.get(sourceFragmentId); if (sourceFragmentSelector == null) { continue; } if (mergedSelector == null) { mergedSelector = sourceFragmentSelector; } else { mergedSelector = mergedSelector.merge(sourceFragmentSelector); } } if (mergedSelector != null) { outputSelectors.put(remoteSource.getId(), mergedSelector); } } return outputSelectors.buildOrThrow(); }

Then you will have just

Set<PlanNodeId> remoteSourceIds = getRemoteSourceIds(); Map<PlanNodeId, ExchangeSourceOutputSelector> outputSelectors = getMergedSourceOutputSelectors();

in schedule().

Maybe you could also build some common interface adapter over TaskDescriptor and OpenTaskDescriptor with methods:

public ListMultimap<PlanNodeId, Split> getSplits(); public boolean wasNoMoreSplits(PlanNodeId remoteSourcePlanNodeId);

then you could mostly unify handling of both. But not sure if that is worth a fuss.

Great suggestions. Refactored, it seems to be way more readable now. Please take a look.

losipiuk · 2022-10-11T13:27:50Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

typo tescriptor

Also what is the case that task completes but descriptor is not created yet? LIMIT?

In this implementation we don't wait for the descriptor to be created before scheduling. It is possible that a task finishes (as you mentioned in LIMIT case) before task descriptor is sealed. In such case we don't need to store it, as since the task is already finished there will be no retry.

In such case we don't need to store it, as since the task is already finished there will be no retry

Makes sense. Do we also drop sealed task descriptors from storage when tasks competes sucessefully?

Yes, see StagePartition#taskFinished

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

arhimondr · 2022-10-12T00:40:41Z

Thanks for the review. Went through the first section of comments. Going to continue tomorrow.

arhimondr · 2022-10-13T15:46:33Z

@losipiuk Updated

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

losipiuk · 2022-10-14T13:27:09Z

core/trino-main/src/main/java/io/trino/execution/QueryManagerConfig.java

Not sure about this one. Do we can get some adoption, while keeping it false for a release or two?

I ran multiple rounds of full scale testing and it seems to work fine. I would probably leave it on by default while leaving the old implementation as a fallback option if something goes wrong.

losipiuk · 2022-10-14T13:28:56Z

core/trino-main/src/test/java/io/trino/execution/scheduler/TestEventDrivenTaskSource.java

bail out quickly if failure != null?

These checks are important. I'm trying to verify that sources are getting closed in case of a failure.

arhimondr · 2022-10-14T20:44:13Z

Updated

core/trino-main/src/main/java/io/trino/execution/QueryManagerConfig.java

linzebing · 2022-10-12T20:41:54Z

core/trino-main/src/main/java/io/trino/execution/scheduler/SingleDistributionSplitAssigner.java

Nit: formatted

That's what auto-format does for me. Is it different on your end?

linzebing · 2022-10-12T22:15:17Z

core/trino-main/src/main/java/io/trino/execution/scheduler/HashDistributionSplitAssigner.java

I'm lost in this part of the logic. Can you elaborate?

When HashDistributionSplitAssigner is created an input size esitmate is provided (see Map<PlanNodeId, OutputDataSizeEstimate> outputDataSizeEstimates). Based on the estimates provided the HashDistributionSplitAssigner assigns output partitions to tasks (to avoid small tasks). In the previous version (with a full barrier) it was done based on the information obtained from ExchangeSourceHandle. However if speculative execution is allowed this must be done based on "estimates" as ExchangeSourceHandles may not yet be available.

My question is more like, if we have fixed number of partitions, shouldn't we simply try to distribute data evenly among the assignments? What's the sense of trying to respect targetPartitionSizeInBytes?

It could work but you need same input size statics still. Otherwise you do not know how many partitions you should group together, to be handled by single task.

@losipiuk : in the final result there is no information around size stats. It's a mapping from outputPartitionId to TaskPartitions. So I'm just more confused now...

The targetPartitionSizeInBytes is needed to avoid creating tiny partitions. For example if the total data size is only 1GB it should be enough to create a single partition mapping all the output partitions to a single task.

What I see if we output the map with partitionIds to a new TaskPartition(), so I'm not sure if we are achieving the above purposes stated above. Am I missing something obvious?

I have added the following to code to print out the returned result:

Map<Integer, TaskPartition> resultToBePrinted = result.buildOrThrow(); resultToBePrinted.forEach((partitionId, taskPartition) -> { System.out.println(partitionId); if (taskPartition.isIdAssigned()) { System.out.println(taskPartition.getId()); } else { System.out.println("Not assigned"); } });

I'm seeing:

0 Not assigned 1 Not assigned 2 Not assigned

So not sure this code piece has any effect at all.

Oh, i see.

TaskPartition is a placeholder. Basically what this algorithm does it groups certain output partitions to be processed by certain task partitions.

For example:

Output partitions 1,2,3 must be processed by a separate task

Output partitions 4,5 must be processed by a different task

Output partitions 6,7 must be processed by a yet another task

However we are trying to avoid assigning a certain numeric id to a task at this step. The problem is that the output data size is only an esimate, and in reallity not all the tasks may have some data to process.

For example when reading a bucketed table what we know is that there are 1000 buckets. So we assign 1000 task partitions one for each bucket. But then it is possible that data is missing for a certain bucket what would create a confusing hole in task ids (you may endup with tasks 1.0.0, 1.1.0, 1.5.0 missing the 1.2.0. 1.3.0, 1.4.0 tasks). Assigning numeric ids lazily allows to avoid such gaps.

arhimondr · 2022-10-17T17:02:59Z

Updated

linzebing · 2022-10-17T19:21:25Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

So this is a TODO item?

Yeah, basically I just wanted to show where the plan can be mutated. The adaptive planner will come later.

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

linzebing · 2022-10-17T20:51:56Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

What's the sense of having the WAKE_UP event?

This is to be used as a generic event to wake up a scheduler (for example if no state modification needed when something happened).

Then we don't need to schedule anything in this case? Doesn't feel we need to schedule anything in this case.

This is currently used to notify the scheduler that a node has been acquired. There's no need in any extra information that has to passed to the scheduler through an event, just let the scheduler know that a further progress can be made.

Move TableInfo extraction to TableInfo to make it reusable

The new scheduler allows changing query plan dynamically during execution, speculative execution as well as provides a single view into a query task queue allowing to set a priority for any certain task

arhimondr · 2022-10-20T17:24:12Z

Updated

cla-bot bot added the cla-signed label Sep 20, 2022

arhimondr force-pushed the event-driven-scheduler branch 7 times, most recently from 4482925 to 03f90be Compare September 27, 2022 16:36

arhimondr force-pushed the event-driven-scheduler branch from 03f90be to aa023f0 Compare September 27, 2022 21:16

arhimondr changed the title ~~[WIP] Fault tolerant scheduler 2.0~~ Fault tolerant scheduler 2.0 Sep 27, 2022

arhimondr requested review from linzebing and losipiuk September 27, 2022 21:18

arhimondr mentioned this pull request Sep 28, 2022

Additional integration test checks #14328

Merged

arhimondr force-pushed the event-driven-scheduler branch from aa023f0 to 214cad4 Compare September 29, 2022 14:31

arhimondr force-pushed the event-driven-scheduler branch 4 times, most recently from a3474c9 to 6f3f2e7 Compare October 7, 2022 05:58

arhimondr force-pushed the event-driven-scheduler branch from 6f3f2e7 to 56deff7 Compare October 7, 2022 19:07

losipiuk reviewed Oct 10, 2022

View reviewed changes

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java Outdated Show resolved Hide resolved

losipiuk reviewed Oct 11, 2022

View reviewed changes

arhimondr force-pushed the event-driven-scheduler branch from 56deff7 to cc44e88 Compare October 12, 2022 00:44

arhimondr force-pushed the event-driven-scheduler branch from cc44e88 to 756a430 Compare October 13, 2022 15:46

losipiuk reviewed Oct 14, 2022

View reviewed changes

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java Outdated Show resolved Hide resolved

losipiuk reviewed Oct 14, 2022

View reviewed changes

losipiuk approved these changes Oct 14, 2022

View reviewed changes

arhimondr force-pushed the event-driven-scheduler branch from 756a430 to d2c7e1e Compare October 14, 2022 20:44

linzebing reviewed Oct 17, 2022

View reviewed changes

arhimondr force-pushed the event-driven-scheduler branch from d2c7e1e to 4a2e852 Compare October 17, 2022 16:17

linzebing reviewed Oct 17, 2022

View reviewed changes

arhimondr added 3 commits October 20, 2022 12:47

Refactor StageManager

f2d70d4

Move TableInfo extraction to TableInfo to make it reusable

Collect uncompressed data size for output statistics

1071a99

Implement EventDrivenFaultTolerantQueryScheduler

4e09b35

The new scheduler allows changing query plan dynamically during execution, speculative execution as well as provides a single view into a query task queue allowing to set a priority for any certain task

arhimondr force-pushed the event-driven-scheduler branch from 4a2e852 to 4e09b35 Compare October 20, 2022 17:19

linzebing approved these changes Oct 20, 2022

View reviewed changes

arhimondr merged commit 333b728 into trinodb:master Oct 20, 2022

arhimondr deleted the event-driven-scheduler branch October 20, 2022 20:20

github-actions bot added this to the 401 milestone Oct 20, 2022

colebow mentioned this pull request Oct 25, 2022

Add Trino 401 release notes #14675

Merged

Conversation

arhimondr commented Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Non-technical explanation

Release notes

Uh oh!

arhimondr commented Sep 20, 2022

Uh oh!

arhimondr commented Sep 27, 2022

Uh oh!

arhimondr commented Sep 27, 2022

Uh oh!

arhimondr commented Sep 29, 2022

Uh oh!

arhimondr commented Oct 7, 2022

Uh oh!

losipiuk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

losipiuk Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

losipiuk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr commented Sep 20, 2022 •

edited

Loading

arhimondr Oct 11, 2022 •

edited

Loading

losipiuk Oct 12, 2022 •

edited

Loading

arhimondr Oct 13, 2022 •

edited

Loading