Skip to content

[SPARK-25910][CORE] accumulator updates from previous stage attempt should not log error#22923

Closed
cloud-fan wants to merge 2 commits intoapache:masterfrom
cloud-fan:late-task
Closed

[SPARK-25910][CORE] accumulator updates from previous stage attempt should not log error#22923
cloud-fan wants to merge 2 commits intoapache:masterfrom
cloud-fan:late-task

Conversation

@cloud-fan
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

For shuffle map stages, we may have multiple attempts, while only the latest attempt is active. However, the scheduler still accepts successful tasks from previous attempts, to speed up the execution.

Each stage attempt has a StageInfo instance, which contains TaskMetrics. TaskMetrics has a bunch of accumulators to track the metrics like CPU time, etc. However, a stage only keeps the StageInfo of the latest attempt, which means the StageInfo of previous attempts will be GCed, and their accumulators of TaskMetrics will be cleaned.

This causes a problem: When the scheduler accepts a successful task from a previous attempt, and tries to update accumulators, we may fail to get the accumulators from AccumulatorContext, as they are already cleaned. And we may hit error log like

18/10/21 15:30:24 INFO ContextCleaner: Cleaned accumulator 2868 (name: internal.metrics.executorDeserializeTime)
18/10/21 15:30:24 ERROR DAGScheduler: Failed to update accumulators for task 7927
org.apache.spark.SparkException: attempted to access non-existent accumulator 2868
at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1267)
...

This PR proposes a simple fix: When the scheduler receives successful tasks from previous attempts, don't update accumulators. Accumulators of previous stage attemps are not tracked anymore, so we don't need to update them.

How was this patch tested?

a new test.

@cloud-fan
Copy link
Copy Markdown
Contributor Author

cc @vanzin @zsxwing @jiangxb1987

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 1, 2018

Test build #98355 has finished for PR 22923 at commit 4d9cbe0.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Copy Markdown
Member

retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 3, 2018

Test build #98416 has finished for PR 22923 at commit 4d9cbe0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zsxwing
Copy link
Copy Markdown
Member

zsxwing commented Nov 5, 2018

We need to always update user accumulators. Right now such task metrics just cause some annoying error logs, seems not worth to fix.

@cloud-fan
Copy link
Copy Markdown
Contributor Author

We need to always update user accumulators

Ah that's a good point.

I'm going to close it. I missed one thing: the AppStatusListener will keep the StageInfo instance until all tasks of that stage attempt is finished. See #22209

So this bug should already have been fixed.

@cloud-fan cloud-fan closed this Nov 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants