-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14357] [CORE] Properly handle the root cause being a commit denied exception #12228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Should the matching for FetchFailedException, TaskKilledException & InterruptedException get similar treatment? |
|
ok to test |
| case CausedBy(e) => e | ||
| } | ||
|
|
||
| // Assert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove all these comments and blank lines? They're not super necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, thanks.
|
Looks great. Representing it in |
|
Test build #55288 has finished for PR 12228 at commit
|
|
Flaky test? The second test that is running has passed that one. |
|
Test build #55293 has finished for PR 12228 at commit
|
|
Jenkins, retest this please. |
|
@andrewor14, think it's safe to just retest? I don't think the fatal error was because of these changes. I don't have the permissions to start a test run. |
| execBackend.statusUpdate(taskId, TaskState.KILLED, ser.serialize(TaskKilled)) | ||
|
|
||
| case cDE: CommitDeniedException => | ||
| case CausedBy(cDE: CommitDeniedException) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tidy, but seems like overkill if this is the only place in the code that checks if "T, or its immediate cause, is a FooException". case t: Throwable if t.getCause.isInstanceOf[CommitDeniedException] => is sufficient to handle the additional case, at the cost of repeating the line of code.
Alternatively, if there are really a few instances of this pattern in the code that we can clean up with this pattern, then it seems worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen this is useful in other places too, so I don't think it's overkill. E.g. HiveExternalCatalog has some similar logic that could benefit from something like this in the future. What you suggested handles only 1 level of nesting and is less robust. I would prefer to leave this the way it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also only handles 1 level, note. If it can be reused to improve other code that would be convincing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's recursive right? We call unapply inside unapply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I missed that, nice one. Yeah I agree this is a good one then, certainly if it can replace similar patterns elsewhere.
|
retest this please |
1 similar comment
|
retest this please |
|
Oh never mind, looks like Jenkins is broken at the moment: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.3/. I think @JoshRosen is fixing it ATM. |
|
Fixed. retest this please |
|
Test build #55381 has finished for PR 12228 at commit
|
|
Merging into master, 1.6 and 1.5. |
…ied exception ## What changes were proposed in this pull request? When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception. ## How was this patch tested? Added a test suite for the component that extracts the root cause of the error. Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException. Author: Jason Moore <[email protected]> Closes #12228 from jasonmoore2k/SPARK-14357. (cherry picked from commit 22014e6) Signed-off-by: Andrew Or <[email protected]>
…ied exception ## What changes were proposed in this pull request? When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception. ## How was this patch tested? Added a test suite for the component that extracts the root cause of the error. Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException. Author: Jason Moore <[email protected]> Closes #12228 from jasonmoore2k/SPARK-14357. (cherry picked from commit 22014e6) Signed-off-by: Andrew Or <[email protected]>
|
SGTM; I checked for other instances of this pattern and I don't see any, which is I suppose good news (didn't miss any) and bad (unfortunately don't get to simply anything else with this whole new class) |
…ied exception ## What changes were proposed in this pull request? When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception. ## How was this patch tested? Added a test suite for the component that extracts the root cause of the error. Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException. Author: Jason Moore <[email protected]> Closes apache#12228 from jasonmoore2k/SPARK-14357. (cherry picked from commit 22014e6) Signed-off-by: Andrew Or <[email protected]> (cherry picked from commit 7a02c44)
|
I'm starting to think this change wasn't great to make. I'm now seeing speculated tasks sometimes getting into a cycle of running, being (legitimately) denied from committing their tasks, retrying, being denied, retrying, etc. And with the change I made in this PR, this retrying now has no limit. Which is really bad (jobs potentially running forever). Whereas previously they would eventually fail (still not great). I think the ideal behavior would be to have the speculated task instead be considered a success if the commit is denied legitimately (because another task for the same partition has already completed), rather than considered a failure and retried without limit (to only fail again). Any thoughts? |
|
You know more about the semantics than I do, but does this change make the behavior change in this way? it seems like just made something caused by a CDE be handled like a CDE. Are you suggesting just reverting this or some other change? |
|
Previously, code was in place to treat a CDE in a particular way, but it wasn't sufficient (it didn't consider that the CDE was actually nested inside a Spark Exception at the point it was being handled). This PR fixed that, but it seems to have highlighted a problem with the originally intended behavior (worst case can cause an infinite loop). Reverting my change could return it to a state where speculation would often cause a job to be aborted (but at least finish), which is still not great. Alternatively, I'm thinking we could put a task that receives a CDE from the driver, into a TaskState.FINISHED or some other state to indicated that the task shouldn't be resubmitted by the TaskScheduler. I'd probably need some opinions on whether there are other consequences for doing something like this. It might be worth moving this discussion to a JIRA ticket. I'll fire one up. |
|
I've opened this ticket: https://issues.apache.org/jira/browse/SPARK-14915 |
What changes were proposed in this pull request?
When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception.
How was this patch tested?
Added a test suite for the component that extracts the root cause of the error.
Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException.