[SPARK-24415][Core] Fixed the aggregated stage metrics by retaining stage objects in liveStages until all tasks are complete by ankuriitg · Pull Request #22209 · apache/spark

ankuriitg · 2018-08-23T19:56:41Z

What changes were proposed in this pull request?

The problem occurs because stage object is removed from liveStages in
AppStatusListener onStageCompletion. Because of this any onTaskEnd event
received after onStageCompletion event do not update stage metrics.

The fix is to retain stage objects in liveStages until all tasks are complete.

How was this patch tested?

Fixed the reproducible example posted in the JIRA
Added unit test

ankuriitg · 2018-08-23T20:00:14Z

@vanzin @squito

vanzin · 2018-08-23T20:06:17Z

PR title should describe the fix, not the problem. The problem is already described on the bug.

vanzin · 2018-08-23T20:06:21Z

ok to test

tgravescs · 2018-08-23T20:24:19Z

Just a note the jira I filed for this was actually affecting running jobs as well (https://issues.apache.org/jira/browse/SPARK-24415)

I think we have seen a similar issue on the history server though as well so perhaps its related. Haven't looked at the code here yet.

ankuriitg · 2018-08-23T20:32:38Z

Thanks for your comment @tgravescs. This fix is indeed for the SparkUI (running jobs) issue. It does not fix it for history server yet.

ankuriitg · 2018-08-23T21:32:19Z

Just checked, this fixes the Spark History Server issue (SPARK-24539) as well, I just needed to restart SHS in order for my changes to take affect.

mridulm · 2018-08-23T21:51:52Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

removal from iterator should always happen

For already completed stages, we will leave the removal of stage to happen in either onTaskEnd or onStageCompleted event. This ensures that stage metrics are updated even when onJobEnd event is received before onTaskEnd event.

I am not sure I follow - if that is the case, why are we doing this for active stages here ? onStageCompleted/onTaskEnd would be fired for active stages as well.

Btw, there is an existing bug that we are not updating pool, etc which we do in onStageCompleted ...

I think the assumption here is that we will always receive onStageCompleted event before onJobEvent. If that does not occur for some reason, then any active stages are marked as skipped.
I don't know the scenario when onStageCompleted event is not received before onJobEnd event (or received at all). Let me look further into it. Additionally, I will also fix the bug for updating pool.

This can happen when events get dropped ...
Spark makes best case effort to deliver events in order; but when events get dropped, UI becomes inconsistent. I assume this might be an effort to recover in that case ? Any thoughts @tgravescs, @vanzin ?

In your question, this == this PR?

If so, no, that's not what it's fixing. Task end events can "naturally" arrive after the stage end event in the case of a stage failure, and this code was missing that case.

When event drops occur, a lot of things get out of sync, and this change wouldn't fix that. It perhaps could make it a little worse: if a task end event does not arrive, then maybe with this change the stage will never be actually removed from the live stages map. Not sure how easy it would be to recover from that though, since dropped events could probably cause other sorts of leaks in this class too, but I also feel that's a separate issue.

(Also, hopefully, dropped events for this listener should be less common in 2.3 after the listener bus changes.)

To clarify, I was referring to 'this' being job end event received before stage end (for a stage which is part of a job).

I was not referring to task end event's (those can come in after stage or job end's).

Thanks for clarifying @vanzin ... given the snippet is not trying to recover from events drop, wondering why "non"-skipped stages would even be in the list : I would expect all of them to be skipped ?

They would be in the list if the task end event arrives late, right? (Haven't really re-read the code to be sure.)

Unless it's guaranteed that the task end event will arrive before the job end event, unlike the case for the stage end one.

This code now only handles the scenario when onStageCompleted event is dropped (not received). If we don't want to handle that scenario, then we can remove this part of the code altogether.

SparkQA · 2018-08-23T23:56:06Z

Test build #95183 has finished for PR 22209 at commit cd1e856.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-08-24T17:21:33Z

Sorry, but is my biggest pet peeve. Your PR title still only explains the problem. "Fix problem X" does not describe the fix.

vanzin

Looks good, just some minor things.

vanzin · 2018-08-24T17:44:14Z

core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala

"update metrics for tasks that finish late". Same thing, much shorter. Could also add the bug number to the test name.

vanzin · 2018-08-24T17:46:42Z

core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala

This is the same comment as above, either remove this one or update each comment to say where the check it occurring. (I prefer the former.)

vanzin · 2018-08-24T17:48:28Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

last = removeStage.

SparkQA · 2018-08-24T20:17:04Z

Test build #95218 has finished for PR 22209 at commit 55637c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-24T21:40:24Z

Test build #95223 has finished for PR 22209 at commit 69497a2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-08-24T22:04:13Z

retest this please

SparkQA · 2018-08-25T01:33:35Z

Test build #95229 has finished for PR 22209 at commit 70678dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-25T02:56:15Z

Test build #95233 has finished for PR 22209 at commit 70678dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-08-27T23:15:52Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

So I went back and took a closer look and I think this isn't entirely correct (and wasn't entirely correct before either).

If I remember the semantics correctly, the stage should be skipped if it is part of the job's stages, and is in the pending state when the job finishes.

If it's in the active state, it should not be marked as skipped. If you do that, the update to the skipped tasks (in L358) will most certainly be wrong.

So if the state is still active here, it means some event was missed. The best we can do in that case, I think, is remove it from the live stages list and update the pool data, and that's it.

On a related note, if the "onStageSubmitted" event is missed, the stage will remain in the "pending" state even if tasks start on it. Perhaps that could also be added to the "onTaskStart" handler, just to be sure the stage is marked as active.

I changed the commit to mark a stage as skipped only if it is in Pending stage.

I did not update the "onTaskStart" to handle missed "onStageSubmitted" event as that can be a part of a different review, if required.

I think that there are still two things here:

according to code in DAGScheduler#handleTaskCompletion, it seems like it's possible for a job to be marked finished before all task end events arrive. The job is marked finished as soon as all partitions are computed, so if you have e.g. speculative tasks you may miss things by removing the stage here.

re-reading the code again, the pool only really needs to be updated in the pending case, since other cases are handled in onStageCompleted already.

This will cause that leak that I mentioned before, though, where missing events makes things remains in the live lists forever. But that's already a problem with this code, and we should look at all of those as a separate task.

So I think it's more correct to only do anything for pending stages here.

Done. Updating the stage in onJobEnd event only if it is in Pending status.

vanzin · 2018-08-27T23:16:47Z

streaming/src/test/scala/org/apache/spark/streaming/UISeleniumSuite.scala

Since you're touching this: .foreach { _ =>

SparkQA · 2018-08-27T23:20:40Z

Test build #95305 has finished for PR 22209 at commit 0552af0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…tage objects in liveStages until all tasks are complete The problem occurs because stage object is removed from liveStages in AppStatusListener onStageCompletion. Because of this any onTaskEnd event received after onStageCompletion event do not update stage metrics. The fix is to retain stage objects in liveStages until all tasks are complete. Testing Done: 1. Fixed the reproducible example posted in the JIRA 2. Added unit test 3. Fixed the flaky test in UISeleniumSuite

SparkQA · 2018-08-28T20:05:45Z

Test build #95358 has finished for PR 22209 at commit 76f1801.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-08-28T21:01:18Z

Looks like test failed due to https://issues.apache.org/jira/browse/SPARK-23622

bersprockets · 2018-08-28T21:01:39Z

retest this please

SparkQA · 2018-08-29T02:20:28Z

Test build #95389 has finished for PR 22209 at commit dccbf36.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-29T03:43:25Z

Test build #95382 has finished for PR 22209 at commit 76f1801.

This patch fails from timeout after a configured wait of `400m`.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-08-29T18:03:43Z

retest this please

vanzin

Found a small thing, plus a couple of nits.

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

SparkQA · 2018-08-29T22:40:26Z

Test build #95431 has finished for PR 22209 at commit dccbf36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-30T03:39:30Z

Test build #95444 has finished for PR 22209 at commit b164bb1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

streaming/src/test/scala/org/apache/spark/streaming/UISeleniumSuite.scala

vanzin · 2018-09-05T16:40:34Z

Merging to master.

squito · 2018-09-05T17:32:00Z

any reason not to merge to 2.3? its a bug in 2.3

vanzin · 2018-09-05T17:38:05Z

I guess it should be ok. Just trying to be conservative and not inadvertently making the branch less stable in the middle of an rc cycle...

vanzin · 2018-09-05T17:38:22Z

(Feel free to do it if you think it should be there, btw.)

tgravescs · 2018-09-05T20:20:55Z

would be nice to put into 2.3 as well, I realize we are close to rc but I don't think we should stop backporting fixes since I don't expect 2.4 to be stable for a while. If we stop for a bit for this rc we should have some way to track to pull back after rc. thoughts?

tgravescs · 2018-09-07T13:49:03Z

pulled into 2.3

…tage objects in liveStages until all tasks are complete The problem occurs because stage object is removed from liveStages in AppStatusListener onStageCompletion. Because of this any onTaskEnd event received after onStageCompletion event do not update stage metrics. The fix is to retain stage objects in liveStages until all tasks are complete. 1. Fixed the reproducible example posted in the JIRA 2. Added unit test Closes #22209 from ankuriitg/ankurgupta/SPARK-24415. Authored-by: ankurgupta <ankur.gupta@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 39a02d8) Signed-off-by: Thomas Graves <tgraves@apache.org>

…tage objects in liveStages until all tasks are complete The problem occurs because stage object is removed from liveStages in AppStatusListener onStageCompletion. Because of this any onTaskEnd event received after onStageCompletion event do not update stage metrics. The fix is to retain stage objects in liveStages until all tasks are complete. 1. Fixed the reproducible example posted in the JIRA 2. Added unit test Closes apache#22209 from ankuriitg/ankurgupta/SPARK-24415. Authored-by: ankurgupta <ankur.gupta@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 39a02d8) Ref: LIHADOOP-41272 RB=1447258 BUG=LIHADOOP-41272 G=superfriends-reviewers R=fli,mshen,yezhou,edlu A=fli

ankuriitg changed the title ~~[SPARK-24415][Core] HistoryServer does not display metrics from tasks…~~ [SPARK-24415][Core] Fixed the issue where stage page aggregated executor metrics are incorrect on stage failure Aug 23, 2018

mridulm reviewed Aug 23, 2018

View reviewed changes

ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from cd1e856 to 55637c3 Compare August 24, 2018 16:42

vanzin reviewed Aug 24, 2018

View reviewed changes

ankuriitg changed the title ~~[SPARK-24415][Core] Fixed the issue where stage page aggregated executor metrics are incorrect on stage failure~~ [SPARK-24415][Core] Fixed the aggregated stage metrics by retaining stage objects in liveStages until all tasks are complete Aug 24, 2018

ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from 55637c3 to 69497a2 Compare August 24, 2018 18:07

ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from 69497a2 to 70678dc Compare August 24, 2018 21:07

ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from 70678dc to 0552af0 Compare August 27, 2018 18:38

vanzin reviewed Aug 27, 2018

View reviewed changes

ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from 0552af0 to 76f1801 Compare August 28, 2018 15:50

Updated onJobEnd to not update stages in Active status

dccbf36

vanzin reviewed Aug 29, 2018

View reviewed changes

Updated code comments

b164bb1

vanzin mentioned this pull request Aug 30, 2018

[SPARK-25284] Spark UI: make sure skipped stages are updated onJobEnd #22286

Closed

vanzin reviewed Aug 30, 2018

View reviewed changes

core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala Show resolved Hide resolved

juliuszsompolski reviewed Aug 31, 2018

View reviewed changes

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala Show resolved Hide resolved

vanzin reviewed Sep 5, 2018

View reviewed changes

streaming/src/test/scala/org/apache/spark/streaming/UISeleniumSuite.scala Show resolved Hide resolved

asfgit closed this in 39a02d8 Sep 5, 2018

cloud-fan mentioned this pull request Nov 5, 2018

[SPARK-25910][CORE] accumulator updates from previous stage attempt should not log error #22923

Closed

kuwii mentioned this pull request Dec 23, 2022

[SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API #39190

Closed

Conversation

ankuriitg commented Aug 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ankuriitg commented Aug 23, 2018

Uh oh!

vanzin commented Aug 23, 2018

Uh oh!

vanzin commented Aug 23, 2018

Uh oh!

tgravescs commented Aug 23, 2018

Uh oh!

ankuriitg commented Aug 23, 2018

Uh oh!

ankuriitg commented Aug 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 23, 2018

Uh oh!

vanzin commented Aug 24, 2018

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 24, 2018

Uh oh!

SparkQA commented Aug 24, 2018

Uh oh!

vanzin commented Aug 24, 2018

Uh oh!

SparkQA commented Aug 25, 2018

Uh oh!

SparkQA commented Aug 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 27, 2018

Uh oh!

SparkQA commented Aug 28, 2018

Uh oh!

bersprockets commented Aug 28, 2018

Uh oh!

bersprockets commented Aug 28, 2018

Uh oh!

SparkQA commented Aug 29, 2018

ankuriitg commented Aug 23, 2018 •

edited

Loading