Skip to content

Conversation

@jasonmoore2k
Copy link
Contributor

What changes were proposed in this pull request?

When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception.

How was this patch tested?

Added a test suite for the component that extracts the root cause of the error.
Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException.

@jasonmoore2k
Copy link
Contributor Author

Should the matching for FetchFailedException, TaskKilledException & InterruptedException get similar treatment?

@andrewor14
Copy link
Contributor

ok to test

case CausedBy(e) => e
}

// Assert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove all these comments and blank lines? They're not super necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, thanks.

@andrewor14
Copy link
Contributor

Looks great. Representing it in unapply is quite elegant.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55288 has finished for PR 12228 at commit 64a4499.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jasonmoore2k
Copy link
Contributor Author

Flaky test? The second test that is running has passed that one.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55293 has finished for PR 12228 at commit d5ebd55.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jasonmoore2k
Copy link
Contributor Author

Jenkins, retest this please.

@jasonmoore2k
Copy link
Contributor Author

@andrewor14, think it's safe to just retest? I don't think the fatal error was because of these changes. I don't have the permissions to start a test run.

execBackend.statusUpdate(taskId, TaskState.KILLED, ser.serialize(TaskKilled))

case cDE: CommitDeniedException =>
case CausedBy(cDE: CommitDeniedException) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tidy, but seems like overkill if this is the only place in the code that checks if "T, or its immediate cause, is a FooException". case t: Throwable if t.getCause.isInstanceOf[CommitDeniedException] => is sufficient to handle the additional case, at the cost of repeating the line of code.

Alternatively, if there are really a few instances of this pattern in the code that we can clean up with this pattern, then it seems worthwhile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen this is useful in other places too, so I don't think it's overkill. E.g. HiveExternalCatalog has some similar logic that could benefit from something like this in the future. What you suggested handles only 1 level of nesting and is less robust. I would prefer to leave this the way it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also only handles 1 level, note. If it can be reused to improve other code that would be convincing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's recursive right? We call unapply inside unapply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missed that, nice one. Yeah I agree this is a good one then, certainly if it can replace similar patterns elsewhere.

@andrewor14
Copy link
Contributor

retest this please

1 similar comment
@andrewor14
Copy link
Contributor

retest this please

@andrewor14
Copy link
Contributor

Oh never mind, looks like Jenkins is broken at the moment: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.3/. I think @JoshRosen is fixing it ATM.

@andrewor14
Copy link
Contributor

Fixed. retest this please

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55381 has finished for PR 12228 at commit d5ebd55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Merging into master, 1.6 and 1.5.

asfgit pushed a commit that referenced this pull request Apr 10, 2016
…ied exception

## What changes were proposed in this pull request?

When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception.

## How was this patch tested?

Added a test suite for the component that extracts the root cause of the error.
Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException.

Author: Jason Moore <[email protected]>

Closes #12228 from jasonmoore2k/SPARK-14357.

(cherry picked from commit 22014e6)
Signed-off-by: Andrew Or <[email protected]>
asfgit pushed a commit that referenced this pull request Apr 10, 2016
…ied exception

## What changes were proposed in this pull request?

When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception.

## How was this patch tested?

Added a test suite for the component that extracts the root cause of the error.
Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException.

Author: Jason Moore <[email protected]>

Closes #12228 from jasonmoore2k/SPARK-14357.

(cherry picked from commit 22014e6)
Signed-off-by: Andrew Or <[email protected]>
@asfgit asfgit closed this in 22014e6 Apr 10, 2016
@srowen
Copy link
Member

srowen commented Apr 10, 2016

SGTM; I checked for other instances of this pattern and I don't see any, which is I suppose good news (didn't miss any) and bad (unfortunately don't get to simply anything else with this whole new class)

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Apr 11, 2016
…ied exception

## What changes were proposed in this pull request?

When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception.

## How was this patch tested?

Added a test suite for the component that extracts the root cause of the error.
Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException.

Author: Jason Moore <[email protected]>

Closes apache#12228 from jasonmoore2k/SPARK-14357.

(cherry picked from commit 22014e6)
Signed-off-by: Andrew Or <[email protected]>
(cherry picked from commit 7a02c44)
@jasonmoore2k
Copy link
Contributor Author

@andrewor14 @srowen

I'm starting to think this change wasn't great to make. I'm now seeing speculated tasks sometimes getting into a cycle of running, being (legitimately) denied from committing their tasks, retrying, being denied, retrying, etc. And with the change I made in this PR, this retrying now has no limit. Which is really bad (jobs potentially running forever). Whereas previously they would eventually fail (still not great).

I think the ideal behavior would be to have the speculated task instead be considered a success if the commit is denied legitimately (because another task for the same partition has already completed), rather than considered a failure and retried without limit (to only fail again). Any thoughts?

@srowen
Copy link
Member

srowen commented Apr 26, 2016

You know more about the semantics than I do, but does this change make the behavior change in this way? it seems like just made something caused by a CDE be handled like a CDE. Are you suggesting just reverting this or some other change?

@jasonmoore2k
Copy link
Contributor Author

Previously, code was in place to treat a CDE in a particular way, but it wasn't sufficient (it didn't consider that the CDE was actually nested inside a Spark Exception at the point it was being handled). This PR fixed that, but it seems to have highlighted a problem with the originally intended behavior (worst case can cause an infinite loop).

Reverting my change could return it to a state where speculation would often cause a job to be aborted (but at least finish), which is still not great.

Alternatively, I'm thinking we could put a task that receives a CDE from the driver, into a TaskState.FINISHED or some other state to indicated that the task shouldn't be resubmitted by the TaskScheduler. I'd probably need some opinions on whether there are other consequences for doing something like this.

It might be worth moving this discussion to a JIRA ticket. I'll fire one up.

@jasonmoore2k
Copy link
Contributor Author

I've opened this ticket: https://issues.apache.org/jira/browse/SPARK-14915

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants