Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This is a backport of #8544 to branch-1.3 for inclusion in 1.3.2.

…mitCoordinator

When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop.

This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).

This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code.

Author: Josh Rosen <[email protected]>

Closes apache#8544 from JoshRosen/SPARK-10381.

(cherry picked from commit 38700ea)
Signed-off-by: Josh Rosen <[email protected]>
@SparkQA
Copy link

SparkQA commented Sep 17, 2015

Test build #42561 has finished for PR 8790 at commit bb34d15.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2015

Test build #42569 has finished for PR 8790 at commit dd615db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to take a look at the code to see the meaning of attempt. I find https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L451-L452. Looks like it is indeed attemptNumber.

@yhuai
Copy link
Contributor

yhuai commented Sep 18, 2015

LGTM. We can merge it once jenkins is good.

@SparkQA
Copy link

SparkQA commented Sep 18, 2015

Test build #42696 timed out for PR 8790 at commit dd615db after a configured wait of 120m.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 19, 2015

Test build #42704 has started for PR 8790 at commit dd615db.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 19, 2015

Test build #42705 has finished for PR 8790 at commit dd615db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Sep 19, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 19, 2015

Test build #42715 has finished for PR 8790 at commit dd615db.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 21, 2015

Test build #42763 timed out for PR 8790 at commit dd615db after a configured wait of 120m.

@marmbrus
Copy link
Contributor

test this please

@JoshRosen
Copy link
Contributor Author

It looks like all of the tests passed on this last run; it just timed out during MiMa checks while fetching some old versions of dependencies. The tests that failed in earlier runs are known to be flaky. Given this, I'm going to merge this now.

asfgit pushed a commit that referenced this pull request Sep 22, 2015
…mitCoordinator (branch-1.3 backport)

This is a backport of #8544 to `branch-1.3` for inclusion in 1.3.2.

Author: Josh Rosen <[email protected]>

Closes #8790 from JoshRosen/SPARK-10381-1.3.
@JoshRosen JoshRosen closed this Sep 22, 2015
@JoshRosen JoshRosen deleted the SPARK-10381-1.3 branch September 22, 2015 20:37
@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42850 has finished for PR 8790 at commit dd615db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42850/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants