Skip to content

Conversation

@rschlussel
Copy link
Contributor

@rschlussel rschlussel commented Nov 18, 2019

Addresses #13438

== RELEASE NOTES ==

General Changes
*  Add support for retrying failed stages from a materialized point instead of failing the entire query.  The number of retries allowed can be configured using the configuration property max-stage-retries and session property max_stage_retries. The default value is zero.  To take advantage of this feature, exchange_materialization_strategy must be set to 'ALL'.

* Add configuration property use-legacy-scheduler and session property use_legacy_scheduler to use a version of the query scheduler from before refactorings to enable full stage retries.  The default value is false. This is a temporary property to provide an easy way to roll back in case of bugs in the new scheduler.  This property will be removed in a couple releases once we have confidence in the stability of the new scheduler.

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed over first 4 commits. (Up to "Enable failure detector in TestingPrestoServer")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow :). Is this the often used pattern in execution? ( looks like built for optimizer~)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is there 6 years ago 😮

@rschlussel
Copy link
Contributor Author

Some TODOs from in person conversation with @arhimondr (in addition to #13730)

  1. Get rid of concept of tentative failure
  2. try to avoid synchronizing
  3. use multimap for stageExecutions to keep track of all stage attempts
  4. make sure nothing will fail when we close the schedulers from failed tasks (due to races in the scheduler loop still calling them)

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Support retries of streaming sections" .

Skimmed. I need to take a more detailed look when I am clear-headed. But in general looks neat 😄 .

One question is about so many fields added into SqlQueryScheduler. The purpose of them are really to create SqlStageExecution. I am wondering if it makes sense to abstract out something like SqlStageExecutionFactory to take the responsibility for SqlStageExecution creation? This would make a clear separation between:

  • Stage schedule (done by SqlQueryScheduler)
  • Stage execution creation (done by SqlStageExecutionFactory, which is a field in SqlQueryScheduler)

Before this PR, we don't separate the stage schedule and stage execution creation . This is kind of OK because stage execution is only created once in the constructor of SqlQueryScheduler. But now since SqlStageExecution will be re-created for retry, so it seems make sense to abstract out the SqlStageExecutionFactory out?

@rschlussel
Copy link
Contributor Author

Thanks @wenleix for the review! I like the idea of a StageExecutionFactory. I wouldn't worry about reviewing this too closely now (though major comments are appreciated), since some details have changed since i've been working on @arhimondr's comments (I haven't updated the PR since I haven't finished yet). Could you review #13730 instead, since I'm building on top of it now. That PR, as requested by andrii, delays all of the stage creation until it's ready to execute, instead of just scheduler creation like I do here.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 22, 2019

CLA Check
The committers are authorized under a signed CLA.

@rschlussel rschlussel force-pushed the stage-retries branch 2 times, most recently from 8a89604 to ad5815d Compare November 22, 2019 01:06
@rschlussel
Copy link
Contributor Author

TODO:

  • make the test not flaky and also turn off bucketed retry for stage retry test
  • fix the UI

@rschlussel rschlussel force-pushed the stage-retries branch 11 times, most recently from dc3d63b to 403ad13 Compare November 25, 2019 23:24
@rschlussel
Copy link
Contributor Author

I've extracted some things out of Add support for retrying streaming sections into their own commits. It's still a complex commit, but hopefully easier to review.

Copy link
Contributor

@shixuan-fan shixuan-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First commit to "Add configuration property max-stage-retries" LGTM

Copy link
Contributor

@shixuan-fan shixuan-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Remove unused code from TestStageExecutionStateMachine" 👍
"Enable failure detector in tests" 👍

Copy link
Contributor

@shixuan-fan shixuan-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Extract SectionStageExecutionFactory from SqlQueryScheduler" LGTM. Didn't verify every line for code moves but it looks good when skimming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally unrelated to this PR. Since createSqlQueryScheduler() is a factory method, and we already passed stateMachine in, this seems kinda redundant.

Copy link
Contributor

@shixuan-fan shixuan-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Remove id from StageExecutionInfo"

Copy link
Contributor

@shixuan-fan shixuan-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Introduce method for empty StageExecutionInfo" and "Reorder fields and methods in SqlQueryScheduler" LGTM

Copy link
Contributor

@shixuan-fan shixuan-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add support for retrying streaming sections" I think it looks correct, but maybe I should pair review with @wenleix to make sure :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have an enum to indicate if we are creating a query runner for bucket recoverability or stage recoverability. It is not obvious that when materialized is true, we actually turned off bucket recoverability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abort returning false means already aborted does not seem intuitive, and my intuition is that returning false means the abort call failed (I should have commented on the commit when it is introduced). Maybe we should have a isAborted() in SectionExecution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want the isAborted check and the abort() call to be atomic so that we only increment the retries once. This is the same as the StateMachine transitionToAborted, which returns false if the state was already a done state.

rschlussel and others added 7 commits January 29, 2020 10:22
Disabled for raptor because currently they fail with failure-detector
enabled and jmx tests because they rely on having a consistent number of
nodes.
Make the exponential decay for failure detector, so we can configure it
to make the recoverable exocution tests more stable
It's not used, and we don't want to create it when we create empty
executionInfo when there haven't been any attempts yet
When we lazily create StageExecutions, we'll need to generate empty
stageExecutionInfos for stages that don't have any executions yet
@rschlussel
Copy link
Contributor Author

Commit "Add exponential decay config to failure detector" LGTM with a high level question.

Why configure it to be exponential decay make recoverable execution tests more stable? :) -- and is exponential decay reasonable used as production config?

Also there is a typo: "recoverable exocution " ;)

It definitely makes the test more stable. @arhimondr can you answer the question?

@rschlussel
Copy link
Contributor Author

Test was accidentally running with the Legacy scheduler should be fixed now.

@wenleix wenleix requested a review from arhimondr January 29, 2020 18:27
Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previous commits LGTM ✅

"SELECT orderkey, day(shipdate) FROM lineitem WHERE orderkey % 31 <> 0 UNION ALL " +
"SELECT orderkey, day(commitdate) FROM lineitem WHERE orderkey % 31 <> 0 UNION ALL " +
"SELECT orderkey, day(receiptdate) FROM lineitem WHERE orderkey % 31 <> 0");
"SELECT orderkey, day(commitdate) FROM lineitem WHERE orderkey % 31 <> 0 UNION ALL " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super tiny nit, we can remove these formatting-only changes

@rschlussel
Copy link
Contributor Author

"Make FixedSourcePartitionedScheduler more thread safe": LGTM.

Why this issue never happens before? -- Does it only become an issue when there is retry?

yes. because with retries we close schedulers of sections that we abort due to failure, so some of the schedulers we close might still be running. Previously we relied on the fact that scheduling would also end whenever there was a failure and everything would get closed then.

@rschlussel
Copy link
Contributor Author

For commit "Add hacky query monitor to find bugs in scheduler", do we intend to merge it into codebase? :)

I guess what do reviewers think? Basically the query monitor says that if no sections are in the running state for more than a minute, then we assume something has gone wrong with the scheduler and fail the query. If we think it's okay for production i'll remove the word "hacky" from the commit. And otherwise I'll remove the commit altogether. I didn't see any failures due to this when running in verifier, but I'm not sure if we'll get false positives if there's a full gc or something.

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make FixedSourcePartitionedScheduler more thread safe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we still use the List interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need volatile here, but you should annotate it with @GuardedBy("this")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... could we do something like just not call .clear() on the list instead? (Why do we clear() it?)

Copy link
Contributor

@tdcmeehan tdcmeehan Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intent is to short circuit the loop where we break if cancelled, isn't that the point of the cancelled flag, and it won't work because the iterator is merrily iterating over the old, un-cleared version of the list. I missed the .remove() below, it makes sense now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called closed?

Copy link
Contributor

@tdcmeehan tdcmeehan Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can simplify everything here by just making cancelled (closed) an AtomicBoolean (to prevent double close), removing the clear() method invocation, and removing the synchronized blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I'm actually not sure why it was being cleared in the first place. My only thought is so that the schedulers get garbage collected. Do you think if we don't clear the list the objects would hang around too long?

Also, we do need to exit the scheduler loop after closing because otherwise we hit "HiveSplitSource is already closed" errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused by the indentation in GH, makes sense. I thought the synchronized block didn't include the remove.

Copy link
Contributor

@tdcmeehan tdcmeehan Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intent is to short circuit the loop where we break if cancelled, isn't that the point of the cancelled flag, and it won't work because the iterator is merrily iterating over the old, un-cleared version of the list. I missed the .remove() below, it makes sense now.

@wenleix
Copy link
Contributor

wenleix commented Feb 6, 2020

Speaking of the query monitor, as pointed out by @rschlussel :

I guess what do reviewers think? Basically the query monitor says that if no sections are in the running state for more than a minute, then we assume something has gone wrong with the scheduler and fail the query. If we think it's okay for production i'll remove the word "hacky" from the commit. And otherwise I'll remove the commit altogether. I didn't see any failures due to this when running in verifier, but I'm not sure if we'll get false positives if there's a full gc or something.

It sounds like is has some value. So maybe we can keep it with (another) config property? -- and maybe we can enable it by default on verifier to help catch potential issue, and eventually enabled it in prod?

What do you think? @arhimondr , @tdcmeehan , @shixuan-fan ?

rschlussel and others added 6 commits February 6, 2020 17:53
Fix 2 issues related to closing the scheduler during scheduling

1) ConcurrentModificationException for the splitSources iterator
2) HiveSplitSource is already closed error from trying to schedule a
HiveSplitSource that's been closed
Organize methods by order they are used in and group fields thematically
Add retriedCpuTime to track the cpu time spent on stages that eventually
fail and get retried.  This cpu time isn't tracked by the regular cpuTime.

Additionally include failed tasks in the retriedCpuTime even though it
is also tracked by the total cpu time

Co-authored-by: Shixuan Fan <[email protected]>
Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this class wasn't designed to be multi threaded, and doesn't have to be one.

Instead of closing the scheduler from a different thread in the SectionExecution#abort only the SqlStageExecution has to be transitioned to FAILED there.

Then the same thread that does the scheduling (in SqlQueryScheduler#schedule) can check if the stage if in done state (e.g.: FAILED), and close the scheduler if so.

worker2.stopResponding();

assertEquals(result.get(60, SECONDS).getUpdateCount(), OptionalLong.of(expectedUpdateCount));
assertEquals(result.get(1000, SECONDS).getUpdateCount(), OptionalLong.of(expectedUpdateCount));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: It's 15 minutes. Why do we need such a large timeouts? I remember running these tests, and they were finishing withing ~20 seconds. Has it changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hasn't changed. 15 minutes there is unintentional. I can fix it in a follow up.

@rschlussel
Copy link
Contributor Author

I removed the commit that checks for scheduler bugs and will submit it as a separate PR. @arhimondr I'll address your comments in a follow on pull request, as it requires more testing to ensure that changing when we close the scheduler won't introduce any other problems.

@rschlussel rschlussel merged commit 4c2010e into prestodb:master Feb 7, 2020
@caithagoras caithagoras mentioned this pull request Feb 20, 2020
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants