Support retries of streaming sections #13717

rschlussel · 2019-11-18T19:08:50Z

Addresses #13438

== RELEASE NOTES ==

General Changes
*  Add support for retrying failed stages from a materialized point instead of failing the entire query.  The number of retries allowed can be configured using the configuration property max-stage-retries and session property max_stage_retries. The default value is zero.  To take advantage of this feature, exchange_materialization_strategy must be set to 'ALL'.

* Add configuration property use-legacy-scheduler and session property use_legacy_scheduler to use a version of the query scheduler from before refactorings to enable full stage retries.  The default value is false. This is a temporary property to provide an easy way to roll back in case of bugs in the new scheduler.  This property will be removed in a couple releases once we have confidence in the stability of the new scheduler.

wenleix

Skimmed over first 4 commits. (Up to "Enable failure detector in TestingPrestoServer")

wenleix · 2019-11-20T22:04:16Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

wow :). Is this the often used pattern in execution? ( looks like built for optimizer~)

wenleix · 2019-11-20T22:12:44Z

presto-main/src/main/java/com/facebook/presto/server/testing/TestingPrestoServer.java

This comment is there 6 years ago 😮

rschlussel · 2019-11-20T22:48:02Z

Some TODOs from in person conversation with @arhimondr (in addition to #13730)

Get rid of concept of tentative failure
try to avoid synchronizing
use multimap for stageExecutions to keep track of all stage attempts
make sure nothing will fail when we close the schedulers from failed tasks (due to races in the scheduler loop still calling them)

wenleix

"Support retries of streaming sections" .

Skimmed. I need to take a more detailed look when I am clear-headed. But in general looks neat 😄 .

One question is about so many fields added into SqlQueryScheduler. The purpose of them are really to create SqlStageExecution. I am wondering if it makes sense to abstract out something like SqlStageExecutionFactory to take the responsibility for SqlStageExecution creation? This would make a clear separation between:

Stage schedule (done by SqlQueryScheduler)
Stage execution creation (done by SqlStageExecutionFactory, which is a field in SqlQueryScheduler)

Before this PR, we don't separate the stage schedule and stage execution creation . This is kind of OK because stage execution is only created once in the constructor of SqlQueryScheduler. But now since SqlStageExecution will be re-created for retry, so it seems make sense to abstract out the SqlStageExecutionFactory out?

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveRecoverableGroupedExecution.java

presto-main/src/main/java/com/facebook/presto/execution/StageExecutionState.java

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

rschlussel · 2019-11-21T13:52:51Z

Thanks @wenleix for the review! I like the idea of a StageExecutionFactory. I wouldn't worry about reviewing this too closely now (though major comments are appreciated), since some details have changed since i've been working on @arhimondr's comments (I haven't updated the PR since I haven't finished yet). Could you review #13730 instead, since I'm building on top of it now. That PR, as requested by andrii, delays all of the stage creation until it's ready to execute, instead of just scheduler creation like I do here.

linux-foundation-easycla · 2019-11-22T00:45:11Z

The committers are authorized under a signed CLA.

✅ Rebecca Schlussel (902b110, 9a21f43, 6c6f89d, 4118b3a, fca2898, 70f6900, 5906b1b, cae2f1da1242b3a5b3431af1f718ade50c8d1f5b, 8a89604ef61f0e19e96dcb27723eaf8a429f82b1)

rschlussel · 2019-11-22T01:08:58Z

TODO:

make the test not flaky and also turn off bucketed retry for stage retry test
fix the UI

rschlussel · 2019-11-25T23:28:17Z

I've extracted some things out of Add support for retrying streaming sections into their own commits. It's still a complex commit, but hopefully easier to review.

shixuan-fan

First commit to "Add configuration property max-stage-retries" LGTM

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

shixuan-fan

"Remove unused code from TestStageExecutionStateMachine" 👍
"Enable failure detector in tests" 👍

presto-raptor/src/test/java/com/facebook/presto/raptor/RaptorQueryRunner.java

shixuan-fan

"Extract SectionStageExecutionFactory from SqlQueryScheduler" LGTM. Didn't verify every line for code moves but it looks good when skimming.

shixuan-fan · 2019-11-26T18:30:00Z

presto-main/src/main/java/com/facebook/presto/execution/SqlQueryExecution.java

Totally unrelated to this PR. Since createSqlQueryScheduler() is a factory method, and we already passed stateMachine in, this seems kinda redundant.

shixuan-fan

"Remove id from StageExecutionInfo"

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

shixuan-fan

"Introduce method for empty StageExecutionInfo" and "Reorder fields and methods in SqlQueryScheduler" LGTM

shixuan-fan

"Add support for retrying streaming sections" I think it looks correct, but maybe I should pair review with @wenleix to make sure :D

shixuan-fan · 2019-11-26T19:03:30Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveRecoverableExecution.java

Maybe we should have an enum to indicate if we are creating a query runner for bucket recoverability or stage recoverability. It is not obvious that when materialized is true, we actually turned off bucket recoverability.

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

shixuan-fan · 2019-11-26T22:30:33Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

abort returning false means already aborted does not seem intuitive, and my intuition is that returning false means the abort call failed (I should have commented on the commit when it is introduced). Maybe we should have a isAborted() in SectionExecution?

we want the isAborted check and the abort() call to be atomic so that we only increment the retries once. This is the same as the StateMachine transitionToAborted, which returns false if the state was already a done state.

Disabled for raptor because currently they fail with failure-detector enabled and jmx tests because they rely on having a consistent number of nodes.

Make the exponential decay for failure detector, so we can configure it to make the recoverable exocution tests more stable

It's not used, and we don't want to create it when we create empty executionInfo when there haven't been any attempts yet

When we lazily create StageExecutions, we'll need to generate empty stageExecutionInfos for stages that don't have any executions yet

rschlussel · 2020-01-29T16:29:26Z

Commit "Add exponential decay config to failure detector" LGTM with a high level question.

Why configure it to be exponential decay make recoverable execution tests more stable? :) -- and is exponential decay reasonable used as production config?

Also there is a typo: "recoverable exocution " ;)

It definitely makes the test more stable. @arhimondr can you answer the question?

rschlussel · 2020-01-29T16:45:23Z

Test was accidentally running with the Legacy scheduler should be fixed now.

tdcmeehan

All previous commits LGTM ✅

tdcmeehan · 2020-01-29T18:43:04Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHivePushdownFilterQueries.java

                "SELECT orderkey, day(shipdate) FROM lineitem WHERE orderkey % 31 <> 0 UNION ALL " +
-                "SELECT orderkey, day(commitdate) FROM lineitem WHERE orderkey % 31 <> 0 UNION ALL " +
-                "SELECT orderkey, day(receiptdate) FROM lineitem WHERE orderkey % 31 <> 0");
+                        "SELECT orderkey, day(commitdate) FROM lineitem WHERE orderkey % 31 <> 0 UNION ALL " +


Super tiny nit, we can remove these formatting-only changes

rschlussel · 2020-01-29T19:26:23Z

"Make FixedSourcePartitionedScheduler more thread safe": LGTM.

Why this issue never happens before? -- Does it only become an issue when there is retry?

yes. because with retries we close schedulers of sections that we abort due to failure, so some of the schedulers we close might still be running. Previously we relied on the fact that scheduling would also end whenever there was a failure and everything would get closed then.

rschlussel · 2020-01-29T19:31:32Z

For commit "Add hacky query monitor to find bugs in scheduler", do we intend to merge it into codebase? :)

I guess what do reviewers think? Basically the query monitor says that if no sections are in the running state for more than a minute, then we assume something has gone wrong with the scheduler and fail the query. If we think it's okay for production i'll remove the word "hacky" from the commit. And otherwise I'll remove the commit altogether. I didn't see any failures due to this when running in verifier, but I'm not sure if we'll get false positives if there's a full gc or something.

tdcmeehan

Make FixedSourcePartitionedScheduler more thread safe ❓

tdcmeehan · 2020-01-30T17:06:22Z

...n/src/main/java/com/facebook/presto/execution/scheduler/FixedSourcePartitionedScheduler.java

Can't we still use the List interface?

tdcmeehan · 2020-01-30T17:11:31Z

...n/src/main/java/com/facebook/presto/execution/scheduler/FixedSourcePartitionedScheduler.java

I don't think you need volatile here, but you should annotate it with @GuardedBy("this")

tdcmeehan · 2020-01-30T17:16:03Z

...n/src/main/java/com/facebook/presto/execution/scheduler/FixedSourcePartitionedScheduler.java

Hmm... could we do something like just not call .clear() on the list instead? (Why do we clear() it?)

If the intent is to short circuit the loop where we break if cancelled, isn't that the point of the cancelled flag, and it won't work because the iterator is merrily iterating over the old, un-cleared version of the list. I missed the .remove() below, it makes sense now.

tdcmeehan · 2020-01-30T17:25:26Z

...n/src/main/java/com/facebook/presto/execution/scheduler/FixedSourcePartitionedScheduler.java

Should this be called closed?

I wonder if we can simplify everything here by just making cancelled (closed) an AtomicBoolean (to prevent double close), removing the clear() method invocation, and removing the synchronized blocks.

Good question. I'm actually not sure why it was being cleared in the first place. My only thought is so that the schedulers get garbage collected. Do you think if we don't clear the list the objects would hang around too long?

Also, we do need to exit the scheduler loop after closing because otherwise we hit "HiveSplitSource is already closed" errors.

I got confused by the indentation in GH, makes sense. I thought the synchronized block didn't include the remove.

tdcmeehan · 2020-01-30T17:26:43Z

...n/src/main/java/com/facebook/presto/execution/scheduler/FixedSourcePartitionedScheduler.java

If the intent is to short circuit the loop where we break if cancelled, isn't that the point of the cancelled flag, and it won't work because the iterator is merrily iterating over the old, un-cleared version of the list. I missed the .remove() below, it makes sense now.

wenleix · 2020-02-06T19:07:59Z

Speaking of the query monitor, as pointed out by @rschlussel :

I guess what do reviewers think? Basically the query monitor says that if no sections are in the running state for more than a minute, then we assume something has gone wrong with the scheduler and fail the query. If we think it's okay for production i'll remove the word "hacky" from the commit. And otherwise I'll remove the commit altogether. I didn't see any failures due to this when running in verifier, but I'm not sure if we'll get false positives if there's a full gc or something.

It sounds like is has some value. So maybe we can keep it with (another) config property? -- and maybe we can enable it by default on verifier to help catch potential issue, and eventually enabled it in prod?

What do you think? @arhimondr , @tdcmeehan , @shixuan-fan ?

Fix 2 issues related to closing the scheduler during scheduling 1) ConcurrentModificationException for the splitSources iterator 2) HiveSplitSource is already closed error from trying to schedule a HiveSplitSource that's been closed

Organize methods by order they are used in and group fields thematically

Add retriedCpuTime to track the cpu time spent on stages that eventually fail and get retried. This cpu time isn't tracked by the regular cpuTime. Additionally include failed tasks in the retriedCpuTime even though it is also tracked by the total cpu time Co-authored-by: Shixuan Fan <[email protected]>

arhimondr

LGTM % nits

arhimondr · 2020-02-06T22:45:20Z

...n/src/main/java/com/facebook/presto/execution/scheduler/FixedSourcePartitionedScheduler.java

It feels like this class wasn't designed to be multi threaded, and doesn't have to be one.

Instead of closing the scheduler from a different thread in the SectionExecution#abort only the SqlStageExecution has to be transitioned to FAILED there.

Then the same thread that does the scheduling (in SqlQueryScheduler#schedule) can check if the stage if in done state (e.g.: FAILED), and close the scheduler if so.

arhimondr · 2020-02-06T23:49:13Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveRecoverableExecution.java

            worker2.stopResponding();

-            assertEquals(result.get(60, SECONDS).getUpdateCount(), OptionalLong.of(expectedUpdateCount));
+            assertEquals(result.get(1000, SECONDS).getUpdateCount(), OptionalLong.of(expectedUpdateCount));


question: It's 15 minutes. Why do we need such a large timeouts? I remember running these tests, and they were finishing withing ~20 seconds. Has it changed?

Hasn't changed. 15 minutes there is unintentional. I can fix it in a follow up.

rschlussel · 2020-02-07T01:57:36Z

I removed the commit that checks for scheduler bugs and will submit it as a separate PR. @arhimondr I'll address your comments in a follow on pull request, as it requires more testing to ensure that changing when we close the scheduler won't introduce any other problems.

rschlussel force-pushed the stage-retries branch 2 times, most recently from b7810ab to 2acd39f Compare November 18, 2019 21:52

rschlussel mentioned this pull request Nov 20, 2019

Delay stage creation until ready for execution #13730

Closed

wenleix reviewed Nov 20, 2019

View reviewed changes

wenleix reviewed Nov 21, 2019

View reviewed changes

arhimondr force-pushed the stage-retries branch from 2acd39f to b61d9aa Compare November 22, 2019 00:45

rschlussel force-pushed the stage-retries branch 2 times, most recently from 8a89604 to ad5815d Compare November 22, 2019 01:06

rschlussel force-pushed the stage-retries branch 11 times, most recently from dc3d63b to 403ad13 Compare November 25, 2019 23:24

shixuan-fan reviewed Nov 26, 2019

View reviewed changes

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java Outdated Show resolved Hide resolved

shixuan-fan reviewed Nov 26, 2019

View reviewed changes

presto-raptor/src/test/java/com/facebook/presto/raptor/RaptorQueryRunner.java Outdated Show resolved Hide resolved

shixuan-fan reviewed Nov 26, 2019

View reviewed changes

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java Outdated Show resolved Hide resolved

shixuan-fan reviewed Nov 26, 2019

View reviewed changes

rschlussel and others added 7 commits January 29, 2020 10:22

Remove unused code from TestStageExecutionStateMachine

117b065

Enable failure detector in tests

9ca796d

Disabled for raptor because currently they fail with failure-detector enabled and jmx tests because they rely on having a consistent number of nodes.

Add exponential decay config to failure detector

35e204b

Make the exponential decay for failure detector, so we can configure it to make the recoverable exocution tests more stable

Extract SectionExecutionFactory from SqlQueryScheduler

5e23f43

Make ExecutionSchedule return StageExecutionAndScheduler

650a6d9

Remove id from StageExecutionInfo

854d86d

It's not used, and we don't want to create it when we create empty executionInfo when there haven't been any attempts yet

Introduce method for empty StageExecutionInfo

4e21e81

When we lazily create StageExecutions, we'll need to generate empty stageExecutionInfos for stages that don't have any executions yet

rschlussel force-pushed the stage-retries branch from 1c6b1df to 854a910 Compare January 29, 2020 16:27

wenleix requested a review from arhimondr January 29, 2020 18:27

tdcmeehan reviewed Jan 29, 2020

View reviewed changes

tdcmeehan reviewed Jan 30, 2020

View reviewed changes

rschlussel force-pushed the stage-retries branch from 854a910 to bfe6728 Compare February 6, 2020 19:03

rschlussel force-pushed the stage-retries branch from bfe6728 to 5552e2b Compare February 6, 2020 21:15

rschlussel and others added 6 commits February 6, 2020 17:53

Make FixedSourcePartitionedScheduler more thread safe

c5e7103

Fix 2 issues related to closing the scheduler during scheduling 1) ConcurrentModificationException for the splitSources iterator 2) HiveSplitSource is already closed error from trying to schedule a HiveSplitSource that's been closed

Reorder fields and methods in SqlQueryScheduler

2fb9a35

Organize methods by order they are used in and group fields thematically

Introduce LegacySqlQueryScheduler

46aea21

Add support for retrying streaming sections

098fe8c

Naming cleanup in SqlQueryScheduler

54f258c

rschlussel force-pushed the stage-retries branch from 5552e2b to 26016b6 Compare February 6, 2020 22:54

arhimondr approved these changes Feb 6, 2020

View reviewed changes

rschlussel merged commit 4c2010e into prestodb:master Feb 7, 2020

caithagoras mentioned this pull request Feb 20, 2020

Add release notes for 0.232 #14130

Merged

8 tasks

Support retries of streaming sections #13717

Support retries of streaming sections #13717

Uh oh!

Conversation

rschlussel commented Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenleix left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rschlussel commented Nov 20, 2019

Uh oh!

wenleix left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rschlussel commented Nov 21, 2019

Uh oh!

linux-foundation-easycla bot commented Nov 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschlussel commented Nov 22, 2019

Uh oh!

rschlussel commented Nov 25, 2019

Uh oh!

shixuan-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shixuan-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shixuan-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shixuan-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shixuan-fan left a comment

Choose a reason for hiding this comment

Uh oh!

shixuan-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rschlussel commented Jan 29, 2020

Uh oh!

rschlussel commented Jan 29, 2020

Uh oh!

tdcmeehan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rschlussel commented Jan 29, 2020

Uh oh!

rschlussel commented Jan 29, 2020

rschlussel commented Nov 18, 2019 •

edited

Loading

wenleix left a comment •

edited

Loading

wenleix left a comment •

edited

Loading

linux-foundation-easycla bot commented Nov 22, 2019 •

edited

Loading

tdcmeehan Jan 30, 2020 •

edited

Loading

tdcmeehan Jan 30, 2020 •

edited

Loading

tdcmeehan Jan 30, 2020 •

edited

Loading