Implement task level retries by arhimondr · Pull Request #9818 · trinodb/trino

arhimondr · 2021-10-29T16:18:53Z

No description provided.

linzebing · 2021-10-29T23:27:22Z

core/trino-main/src/main/java/io/trino/execution/StageState.java

I feel this state is not necessary and adds confusion --- even for UI display. If the stage has finished its tasks, then it's finished; in the case of a retry, we should either show it's RUNNING again, or we make it explicit saying that it's a retry.

Suggested change

* Stage is finished running existing tasks but more tasks could be scheduled in the future.

* Stage has finished running existing tasks but more tasks could be scheduled in the future.

If the stage has finished its tasks, then it's finished; in the case of a retry, we should either show it's RUNNING again, or we make it explicit saying that it's a retry.

That's a good point. I also don't really like the "PENDING" state. The problem is that we need to have a terminal state, a state that would indicate that no more tasks can be scheduled in a given stage. This is needed for the final stage info creation that would create a final summary of runtime statistics for a given stage. I was thinking about something like COMPLETED / FINISHED, but the semantic difference between these two words is too subtle, and I thought i may introduce even more confusion.

core/trino-main/src/main/java/io/trino/execution/TaskManager.java

linzebing · 2021-10-29T23:31:11Z

core/trino-main/src/main/java/io/trino/execution/TaskManager.java

This can be declared as void

I was trying to be consistent with the other two methods, cancelTask and abortTask

linzebing · 2021-10-29T23:41:34Z

core/trino-main/src/main/java/io/trino/execution/scheduler/FixedBucketNodeMap.java

Why do we need ImmutableSet.copyOf here? I don't see a potential race condition.

This is to find a number of unique nodes. The immutability doesn't have any special meaning here, just a useful constructor method for a set. Ideally it should be refactored to provide a number of unique nodes explicitly, as a constructor parameter. But the refactor wasn't trivial. I decided to delay the refactor until later since the impact of this de-duplication is not that high (it is only called once per stage and fixed mappings is rather a corner case and are rare).

This would be more idiomatic for that purpose:

bucketToNode.stream() .distinct() .count();

linzebing · 2021-10-29T23:44:10Z

core/trino-main/src/main/java/io/trino/execution/scheduler/ScaledWriterScheduler.java

Given that we call scheduleTask with the last two parameters as ImmutableMultimap.of() very often, does it make sense to create an overloaded version of scheduleTask?

I was thinking about that. There's a subtle trade-off though. Overloads tend to add complexity, as now you need to think if there's any semantic difference between two overloads and what that semantic difference might be. Another trade-off is that overloads add extra overhead when trying to search through the code where the method is used. You need to check each overload separately.

core/trino-main/src/main/java/io/trino/server/TaskResource.java

.../trino-main/src/main/java/io/trino/server/testing/shuffle/LocalFileSystemShuffleService.java

core/trino-spi/src/main/java/io/trino/spi/shuffle/ShuffleService.java

core/trino-spi/src/main/java/io/trino/spi/shuffle/Shuffle.java

core/trino-spi/src/main/java/io/trino/spi/shuffle/ShuffleInput.java

.../trino-main/src/main/java/io/trino/server/testing/shuffle/LocalFileSystemShuffleService.java

linzebing · 2021-11-02T03:03:13Z

core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java

Why this needs to be synchronized now?

This method has been made public, and now it can be called by any thread, thus synchronization must be enforced (similarly to cancel and abort)

core/trino-main/src/test/java/io/trino/operator/TestingExchangeClientBuffer.java

linzebing · 2021-11-02T21:34:28Z

core/trino-main/src/main/java/io/trino/operator/DeduplicationExchangeClientBuffer.java

+    }
+
+    @Override
+    public synchronized void noMoreTasks()


Can retry tasks still be added after noMoreTasks gets called? Or it will only be called after all tasks finish/fail?

If it's the former, then I don't see why we need this method; if it's the latter, why call checkInputFinished in taskFailed/taskFinished?

Can retry tasks still be added after noMoreTasks gets called?

No. After this method is called no new tasks can be added.

if it's the latter, why call checkInputFinished in taskFailed/taskFinished?

The noMoreTasks might be called when no future retries are anticipated (when retries are disabled or when number of attempts is exhausted). Yet some tasks might still be running.

linzebing · 2021-11-02T22:40:16Z

core/trino-main/src/main/java/io/trino/execution/scheduler/BatchTaskScheduler.java

+            Multimap<PlanNodeId, Split> tableScanSplits = batchTask.getSplits();
+            Multimap<PlanNodeId, Split> remoteSplits = createRemoteSplits(batchTask.getShuffleInputs());
+
+            Multimap<PlanNodeId, Split> taskSplits = ImmutableListMultimap.<PlanNodeId, Split>builder()


For my understanding, one of tableScanSplits and remoteSplits will be empty because a batch task either reads from table scans or shuffle files.

In case of bucketed or collocated join it can read both, a table (thus the splits) and a remote exchange (if one of the tables is not bucketed and must be repartitioned)

core/trino-main/src/main/java/io/trino/execution/scheduler/SqlQueryScheduler.java

linzebing · 2021-11-03T17:39:18Z

core/trino-main/src/main/java/io/trino/execution/scheduler/SqlQueryScheduler.java

+                StageId stageId = stageExecution.getStageId();
+                allStages.put(stageId, stageExecution);
+                if (fragment.getPartitioning().isCoordinatorOnly()) {
+                    coordinatorStagesInTopologicalOrder.add(stageExecution);


Actually this is reverse topological order, not sure if we can have better naming here to avoid confusion

Yeah, it's a little non intuitive. But the way the plan is structured is top down. The root node has references to it's children. Since we are following the reference direction it is still a topological order (despite the plan leaves come last)

core/trino-main/src/main/java/io/trino/execution/scheduler/BatchTaskScheduler.java

linzebing · 2021-11-03T19:22:38Z

core/trino-main/src/main/java/io/trino/execution/scheduler/BatchTaskScheduler.java

+                            log.warn(failureInfo.toException(), "Task failed: %s", taskId);
+                            ErrorCode errorCode = failureInfo.getErrorCode();
+                            if (remainingRetryAttempts > 0 && (errorCode == null || errorCode.getType() != USER_ERROR)) {
+                                remainingRetryAttempts--;


So remainingRetryAttempts is on a stage-basis (instead of task-basis)

Currently it is a stage based, but I'm not sure if that's the right strategy. We should discuss what is the right strategy.

core/trino-main/src/main/java/io/trino/execution/scheduler/StreamingStageExecution.java

core/trino-main/src/main/java/io/trino/execution/scheduler/ResultsConsumer.java

core/trino-main/src/main/java/io/trino/execution/scheduler/StreamingStageExecution.java

core/trino-main/src/main/java/io/trino/execution/scheduler/SqlQueryScheduler.java

core/trino-spi/src/main/java/io/trino/spi/exchange/Exchange.java

core/trino-main/src/main/java/io/trino/execution/scheduler/BatchTask.java

core/trino-main/src/main/java/io/trino/server/testing/exchange/LocalFileSystemExchangeSink.java

...trino-main/src/main/java/io/trino/server/testing/exchange/LocalFileSystemExchangeSource.java

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSourceSplitter.java

core/trino-main/src/main/java/io/trino/execution/scheduler/StageTaskSourceFactory.java

core/trino-main/src/main/java/io/trino/exchange/ExchangeManagerRegistry.java

core/trino-main/src/main/java/io/trino/metadata/HandleJsonModule.java

core/trino-main/src/main/java/io/trino/exchange/ExchangeManagerRegistry.java

martint · 2022-01-14T18:17:36Z

core/trino-main/src/main/java/io/trino/exchange/ExchangeManagerRegistry.java

Currently it is inconsistent across the code base. But there are 260 matches when I search for Logger log and only 33 when I search for Logger LOG, so it looks like Logger log is preferred.

Yes, I'm aware of the inconsistency. But it should be LOG per convention since it's a static final variable. We should not contribute to the unconventional usage.

Yeah, code style convention says The names of variables declared class constants and of ANSI constants should be all uppercase with words separated by underscores ("_").: https://www.oracle.com/java/technologies/javase/codeconventions-namingconventions.html

However whether a Logger can be considered a constant is controversial. Generally constants are expected to be inherently immutable objects. Declaring something as private static final does not make something a constant. The declaration may also be used for static fields assigned once and accessible only from within a class.

For example codestyle document from Google says:

Constant names use CONSTANT_CASE: all uppercase letters, with each word separated from the next by a single underscore. But what is a constant, exactly?
Constants are static final fields whose contents are deeply immutable and whose methods have no detectable side effects. This includes primitives, Strings, immutable types, and immutable collections of immutable types. If any of the instance's observable state can change, it is not a constant. Merely intending to never mutate the object is not enough.

And they explicitly mention examples that shouldn't be considered constants and declaring a logger is one of them:

static final Logger logger = Logger.getLogger(MyClass.getName());

https://google.github.io/styleguide/javaguide.html#s5.2.4-constant-names

I don't have a particularly strong opinion here. It feels either should do as long as it is used consistently. In the current state it feels like Logger log will be more consistent with the other places in the code base.

If you feel particularly strong I can do the rename. Regardless we should probably follow up with a PR that would rename all other places to make it consistent and add a style check rule.

martint · 2022-01-14T18:29:19Z

core/trino-spi/src/main/java/io/trino/spi/exchange/Exchange.java

How does "the implementation" know if an attempt succeeded or failed?

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSink.java

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSource.java

martint · 2022-01-14T18:44:14Z

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSourceHandle.java

What's the purpose of these? Every object implements these methods, so they don't impose any requirement or constraint on implementations.

Unfortunately it is impossible to enforce presence of these methods compile time. The idea behind leaving them explicitly declared in this interface is to make it more difficult to miss for somebody who is going to be implementing this interface. Though yeah, it doesn't provide any strong guarantees.

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSourceSplitter.java

core/trino-main/src/main/java/io/trino/server/testing/exchange/LocalFileSystemExchange.java

martint · 2022-01-14T19:53:40Z

core/trino-main/src/main/java/io/trino/operator/ExchangeOperator.java

This could be called just exchangeClientSupplier.

While we don't have anything named "client" for external exchange IMO it still improves readability as it emphasizes that this supplies a client specifically for direct exchange.

core/trino-main/src/main/java/io/trino/operator/ExchangeOperator.java

martint · 2022-01-14T21:11:32Z

core/trino-main/src/main/java/io/trino/exchange/ExchangeManagerRegistry.java

Yes, I'm aware of the inconsistency. But it should be LOG per convention since it's a static final variable. We should not contribute to the unconventional usage.

core/trino-main/src/main/java/io/trino/split/RemoteSplit.java

core/trino-main/src/main/java/io/trino/execution/buffer/ExternalExchangeOutputBuffer.java

core/trino-main/src/main/java/io/trino/execution/buffer/LazyOutputBuffer.java

core/trino-main/src/main/java/io/trino/execution/buffer/ExternalExchangeOutputBuffer.java

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

core/trino-main/src/main/java/io/trino/execution/scheduler/FaultTolerantStageScheduler.java

core/trino-main/src/main/java/io/trino/execution/ExecutionFailureInfo.java

core/trino-main/src/main/java/io/trino/split/RemoteSplit.java

martint · 2022-01-20T01:07:51Z

core/trino-main/src/main/java/io/trino/execution/scheduler/FixedBucketNodeMap.java

This would be more idiomatic for that purpose:

bucketToNode.stream() .distinct() .count();

core/trino-main/src/main/java/io/trino/execution/scheduler/FaultTolerantStageScheduler.java

martint · 2022-01-20T01:21:25Z

core/trino-main/src/main/java/io/trino/execution/scheduler/FaultTolerantStageScheduler.java

+            cancelRunningTasks(abort);
+            cancelBlockedFuture();
+            releaseAcquiredNode();
+            closeTaskSource();
+            closeSinkExchange();


Only if a failure in closing would affect the query results (e.g., incomplete results), otherwise, we should just log an error and ignore.

core/trino-main/src/main/java/io/trino/execution/scheduler/FaultTolerantStageScheduler.java

core/trino-main/src/main/java/io/trino/execution/scheduler/FixedCountNodeAllocator.java

core/trino-main/src/main/java/io/trino/execution/scheduler/NodeRequirements.java

core/trino-main/src/main/java/io/trino/execution/scheduler/SqlQueryScheduler.java

Streaming upload to S3 allocates a 16MB buffer (by default) for each output stream. Failure recovery tests create a table partitioned into ~60 partitions. Since for each partition at least one file must be created the engine has to allocate ~1GB of buffer space. These buffer allocations push the memory reservation beyond the maximum heap size.

To avoid a clash when both testTargetMaxFileSizePartitioned and testTargetMaxFileSize are executed concurrently

cla-bot bot added the cla-signed label Oct 29, 2021

arhimondr force-pushed the implement-task-level-retries branch from eb068b3 to 5ae17d5 Compare October 29, 2021 19:16

arhimondr requested a review from linzebing October 29, 2021 19:23

arhimondr force-pushed the implement-task-level-retries branch from 5ae17d5 to 10eaae8 Compare November 1, 2021 21:31

linzebing reviewed Nov 1, 2021

View reviewed changes

arhimondr force-pushed the implement-task-level-retries branch 3 times, most recently from 5cc34ca to ea58f80 Compare November 2, 2021 18:18

linzebing reviewed Nov 3, 2021

View reviewed changes

arhimondr force-pushed the implement-task-level-retries branch from ea58f80 to d2490c5 Compare November 4, 2021 00:18

arhimondr mentioned this pull request Nov 4, 2021

Fix various issues discovered while working on failure recovery #9861

Merged

martint reviewed Nov 4, 2021

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/StreamingStageExecution.java Outdated Show resolved Hide resolved

martint reviewed Nov 4, 2021

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/ResultsConsumer.java Outdated Show resolved Hide resolved

martint reviewed Nov 4, 2021

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/StreamingStageExecution.java Outdated Show resolved Hide resolved

martint reviewed Nov 4, 2021

View reviewed changes

arhimondr force-pushed the implement-task-level-retries branch 2 times, most recently from f716aff to 01b40d3 Compare November 4, 2021 07:08

linzebing reviewed Nov 4, 2021

View reviewed changes

core/trino-main/src/main/java/io/trino/server/testing/exchange/LocalFileSystemExchangeSink.java Outdated Show resolved Hide resolved

arhimondr force-pushed the implement-task-level-retries branch 2 times, most recently from 1016d56 to 2a9c5fa Compare November 5, 2021 01:29

arhimondr mentioned this pull request Nov 5, 2021

Implement full query retries #9361

Merged

arhimondr force-pushed the implement-task-level-retries branch from 2a9c5fa to 149fafb Compare November 5, 2021 18:30

linzebing reviewed Nov 9, 2021

View reviewed changes

...trino-main/src/main/java/io/trino/server/testing/exchange/LocalFileSystemExchangeSource.java Outdated Show resolved Hide resolved

arhimondr mentioned this pull request Nov 10, 2021

Support Failure Recovery #9101

Closed

31 tasks

arhimondr force-pushed the implement-task-level-retries branch from 149fafb to 390c682 Compare November 15, 2021 17:10

linzebing reviewed Nov 17, 2021

View reviewed changes

core/trino-spi/src/main/java/io/trino/spi/exchange/ExchangeSourceSplitter.java Show resolved Hide resolved

linzebing reviewed Nov 17, 2021

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/StageTaskSourceFactory.java Show resolved Hide resolved

losipiuk reviewed Nov 17, 2021

View reviewed changes

core/trino-main/src/main/java/io/trino/exchange/ExchangeManagerRegistry.java Outdated Show resolved Hide resolved

losipiuk reviewed Nov 17, 2021

View reviewed changes

core/trino-main/src/main/java/io/trino/metadata/HandleJsonModule.java Outdated Show resolved Hide resolved

losipiuk reviewed Nov 17, 2021

View reviewed changes

core/trino-main/src/main/java/io/trino/exchange/ExchangeManagerRegistry.java Show resolved Hide resolved

arhimondr force-pushed the implement-task-level-retries branch from 54fb567 to 6d4a9b3 Compare January 13, 2022 22:29

github-actions bot added the tests:hive label Jan 13, 2022

arhimondr force-pushed the implement-task-level-retries branch from 6d4a9b3 to 0cf6b73 Compare January 14, 2022 18:34

martint reviewed Jan 14, 2022

View reviewed changes

arhimondr force-pushed the implement-task-level-retries branch 3 times, most recently from 3b607d7 to 15ce61f Compare January 19, 2022 18:43

martint reviewed Jan 19, 2022

View reviewed changes

martint reviewed Jan 20, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/ExecutionFailureInfo.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/split/RemoteSplit.java Outdated Show resolved Hide resolved

martint reviewed Jan 20, 2022

View reviewed changes

losipiuk mentioned this pull request Jan 20, 2022

Support async commit for ExchangeSink #10699

Merged

martint reviewed Jan 20, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/SqlQueryScheduler.java Outdated Show resolved Hide resolved

arhimondr added 12 commits January 20, 2022 15:17

Rename ExchangeClient into DirectExchangeClient

a1c8cea

Add spooling exchange interface

542aaee

Add reference implementation of spooling exchange plugin

2dad449

Integrate spooling exchange with exchange operator

c5e5cf4

Integrate spooling exchange with output buffers

9f6eff2

Support task level retries in DeduplicationDirectExchangeBuffer

d7b9a3f

Rename TaskSource to SplitAssignment

4149c9d

Implement task level failure recovery

b51399c

Add integration tests for fault tolerant execution

357f9de

Simplify assertions in TestDirectExchangeClient

9672c54

Change table name for testTargetMaxFileSizePartitioned

991145a

To avoid a clash when both testTargetMaxFileSizePartitioned and testTargetMaxFileSize are executed concurrently

arhimondr force-pushed the implement-task-level-retries branch from 15ce61f to 991145a Compare January 20, 2022 21:46

martint merged commit f81af8f into trinodb:master Jan 21, 2022

github-actions bot added this to the 369 milestone Jan 21, 2022

This was referenced Jan 21, 2022

Add Trino 369 release notes #10553

Merged

Release notes for 369 #10552

Closed

	* Stage is finished running existing tasks but more tasks could be scheduled in the future.
	* Stage has finished running existing tasks but more tasks could be scheduled in the future.

Conversation

arhimondr commented Oct 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linzebing Nov 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Jan 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linzebing Nov 2, 2021 •

edited

Loading

arhimondr Jan 14, 2022 •

edited

Loading