Skip to content

[SPARK-24415][Core] Fixed the aggregated stage metrics by retaining stage objects in liveStages until all tasks are complete#22209

Closed
ankuriitg wants to merge 3 commits intoapache:masterfrom
ankuriitg:ankurgupta/SPARK-24415
Closed

[SPARK-24415][Core] Fixed the aggregated stage metrics by retaining stage objects in liveStages until all tasks are complete#22209
ankuriitg wants to merge 3 commits intoapache:masterfrom
ankuriitg:ankurgupta/SPARK-24415

Conversation

@ankuriitg
Copy link
Copy Markdown
Contributor

@ankuriitg ankuriitg commented Aug 23, 2018

What changes were proposed in this pull request?

The problem occurs because stage object is removed from liveStages in
AppStatusListener onStageCompletion. Because of this any onTaskEnd event
received after onStageCompletion event do not update stage metrics.

The fix is to retain stage objects in liveStages until all tasks are complete.

How was this patch tested?

  1. Fixed the reproducible example posted in the JIRA
  2. Added unit test

@ankuriitg
Copy link
Copy Markdown
Contributor Author

@vanzin @squito

@vanzin
Copy link
Copy Markdown
Contributor

vanzin commented Aug 23, 2018

PR title should describe the fix, not the problem. The problem is already described on the bug.

@vanzin
Copy link
Copy Markdown
Contributor

vanzin commented Aug 23, 2018

ok to test

@tgravescs
Copy link
Copy Markdown
Contributor

Just a note the jira I filed for this was actually affecting running jobs as well (https://issues.apache.org/jira/browse/SPARK-24415)

I think we have seen a similar issue on the history server though as well so perhaps its related. Haven't looked at the code here yet.

@ankuriitg ankuriitg changed the title [SPARK-24415][Core] HistoryServer does not display metrics from tasks… [SPARK-24415][Core] Fixed the issue where stage page aggregated executor metrics are incorrect on stage failure Aug 23, 2018
@ankuriitg
Copy link
Copy Markdown
Contributor Author

Thanks for your comment @tgravescs. This fix is indeed for the SparkUI (running jobs) issue. It does not fix it for history server yet.

@ankuriitg
Copy link
Copy Markdown
Contributor Author

Just checked, this fixes the Spark History Server issue (SPARK-24539) as well, I just needed to restart SHS in order for my changes to take affect.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removal from iterator should always happen

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For already completed stages, we will leave the removal of stage to happen in either onTaskEnd or onStageCompleted event. This ensures that stage metrics are updated even when onJobEnd event is received before onTaskEnd event.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow - if that is the case, why are we doing this for active stages here ? onStageCompleted/onTaskEnd would be fired for active stages as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, there is an existing bug that we are not updating pool, etc which we do in onStageCompleted ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assumption here is that we will always receive onStageCompleted event before onJobEvent. If that does not occur for some reason, then any active stages are marked as skipped.
I don't know the scenario when onStageCompleted event is not received before onJobEnd event (or received at all). Let me look further into it. Additionally, I will also fix the bug for updating pool.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen when events get dropped ...
Spark makes best case effort to deliver events in order; but when events get dropped, UI becomes inconsistent. I assume this might be an effort to recover in that case ? Any thoughts @tgravescs, @vanzin ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your question, this == this PR?

If so, no, that's not what it's fixing. Task end events can "naturally" arrive after the stage end event in the case of a stage failure, and this code was missing that case.

When event drops occur, a lot of things get out of sync, and this change wouldn't fix that. It perhaps could make it a little worse: if a task end event does not arrive, then maybe with this change the stage will never be actually removed from the live stages map. Not sure how easy it would be to recover from that though, since dropped events could probably cause other sorts of leaks in this class too, but I also feel that's a separate issue.

(Also, hopefully, dropped events for this listener should be less common in 2.3 after the listener bus changes.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I was referring to 'this' being job end event received before stage end (for a stage which is part of a job).

I was not referring to task end event's (those can come in after stage or job end's).

Thanks for clarifying @vanzin ... given the snippet is not trying to recover from events drop, wondering why "non"-skipped stages would even be in the list : I would expect all of them to be skipped ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would be in the list if the task end event arrives late, right? (Haven't really re-read the code to be sure.)

Unless it's guaranteed that the task end event will arrive before the job end event, unlike the case for the stage end one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code now only handles the scenario when onStageCompleted event is dropped (not received). If we don't want to handle that scenario, then we can remove this part of the code altogether.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 23, 2018

Test build #95183 has finished for PR 22209 at commit cd1e856.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ankuriitg ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from cd1e856 to 55637c3 Compare August 24, 2018 16:42
@vanzin
Copy link
Copy Markdown
Contributor

vanzin commented Aug 24, 2018

Sorry, but is my biggest pet peeve. Your PR title still only explains the problem. "Fix problem X" does not describe the fix.

Copy link
Copy Markdown
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some minor things.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"update metrics for tasks that finish late". Same thing, much shorter. Could also add the bug number to the test name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same comment as above, either remove this one or update each comment to say where the check it occurring. (I prefer the former.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last = removeStage.

@ankuriitg ankuriitg changed the title [SPARK-24415][Core] Fixed the issue where stage page aggregated executor metrics are incorrect on stage failure [SPARK-24415][Core] Fixed the aggregated stage metrics by retaining stage objects in liveStages until all tasks are complete Aug 24, 2018
@ankuriitg ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from 55637c3 to 69497a2 Compare August 24, 2018 18:07
@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 24, 2018

Test build #95218 has finished for PR 22209 at commit 55637c3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ankuriitg ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from 69497a2 to 70678dc Compare August 24, 2018 21:07
@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 24, 2018

Test build #95223 has finished for PR 22209 at commit 69497a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Copy Markdown
Contributor

vanzin commented Aug 24, 2018

retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 25, 2018

Test build #95229 has finished for PR 22209 at commit 70678dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 25, 2018

Test build #95233 has finished for PR 22209 at commit 70678dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ankuriitg ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from 70678dc to 0552af0 Compare August 27, 2018 18:38
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I went back and took a closer look and I think this isn't entirely correct (and wasn't entirely correct before either).

If I remember the semantics correctly, the stage should be skipped if it is part of the job's stages, and is in the pending state when the job finishes.

If it's in the active state, it should not be marked as skipped. If you do that, the update to the skipped tasks (in L358) will most certainly be wrong.

So if the state is still active here, it means some event was missed. The best we can do in that case, I think, is remove it from the live stages list and update the pool data, and that's it.

On a related note, if the "onStageSubmitted" event is missed, the stage will remain in the "pending" state even if tasks start on it. Perhaps that could also be added to the "onTaskStart" handler, just to be sure the stage is marked as active.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the commit to mark a stage as skipped only if it is in Pending stage.

I did not update the "onTaskStart" to handle missed "onStageSubmitted" event as that can be a part of a different review, if required.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there are still two things here:

  • according to code in DAGScheduler#handleTaskCompletion, it seems like it's possible for a job to be marked finished before all task end events arrive. The job is marked finished as soon as all partitions are computed, so if you have e.g. speculative tasks you may miss things by removing the stage here.

  • re-reading the code again, the pool only really needs to be updated in the pending case, since other cases are handled in onStageCompleted already.

This will cause that leak that I mentioned before, though, where missing events makes things remains in the live lists forever. But that's already a problem with this code, and we should look at all of those as a separate task.

So I think it's more correct to only do anything for pending stages here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updating the stage in onJobEnd event only if it is in Pending status.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're touching this: .foreach { _ =>

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 27, 2018

Test build #95305 has finished for PR 22209 at commit 0552af0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…tage

objects in liveStages until all tasks are complete

The problem occurs because stage object is removed from liveStages in
AppStatusListener onStageCompletion. Because of this any onTaskEnd event
received after onStageCompletion event do not update stage metrics.

The fix is to retain stage objects in liveStages until all tasks are complete.

Testing Done:
1. Fixed the reproducible example posted in the JIRA
2. Added unit test
3. Fixed the flaky test in UISeleniumSuite
@ankuriitg ankuriitg force-pushed the ankurgupta/SPARK-24415 branch from 0552af0 to 76f1801 Compare August 28, 2018 15:50
@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 28, 2018

Test build #95358 has finished for PR 22209 at commit 76f1801.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bersprockets
Copy link
Copy Markdown
Contributor

Looks like test failed due to https://issues.apache.org/jira/browse/SPARK-23622

@bersprockets
Copy link
Copy Markdown
Contributor

retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 29, 2018

Test build #95389 has finished for PR 22209 at commit dccbf36.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 29, 2018

Test build #95382 has finished for PR 22209 at commit 76f1801.

  • This patch fails from timeout after a configured wait of `400m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bersprockets
Copy link
Copy Markdown
Contributor

retest this please

Copy link
Copy Markdown
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a small thing, plus a couple of nits.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 29, 2018

Test build #95431 has finished for PR 22209 at commit dccbf36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 30, 2018

Test build #95444 has finished for PR 22209 at commit b164bb1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Copy Markdown
Contributor

vanzin commented Sep 5, 2018

Merging to master.

@asfgit asfgit closed this in 39a02d8 Sep 5, 2018
@squito
Copy link
Copy Markdown
Contributor

squito commented Sep 5, 2018

any reason not to merge to 2.3? its a bug in 2.3

@vanzin
Copy link
Copy Markdown
Contributor

vanzin commented Sep 5, 2018

I guess it should be ok. Just trying to be conservative and not inadvertently making the branch less stable in the middle of an rc cycle...

@vanzin
Copy link
Copy Markdown
Contributor

vanzin commented Sep 5, 2018

(Feel free to do it if you think it should be there, btw.)

@tgravescs
Copy link
Copy Markdown
Contributor

would be nice to put into 2.3 as well, I realize we are close to rc but I don't think we should stop backporting fixes since I don't expect 2.4 to be stable for a while. If we stop for a bit for this rc we should have some way to track to pull back after rc. thoughts?

@tgravescs
Copy link
Copy Markdown
Contributor

pulled into 2.3

asfgit pushed a commit that referenced this pull request Sep 7, 2018
…tage objects in liveStages until all tasks are complete

The problem occurs because stage object is removed from liveStages in
AppStatusListener onStageCompletion. Because of this any onTaskEnd event
received after onStageCompletion event do not update stage metrics.

The fix is to retain stage objects in liveStages until all tasks are complete.

1. Fixed the reproducible example posted in the JIRA
2. Added unit test

Closes #22209 from ankuriitg/ankurgupta/SPARK-24415.

Authored-by: ankurgupta <ankur.gupta@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(cherry picked from commit 39a02d8)
Signed-off-by: Thomas Graves <tgraves@apache.org>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…tage objects in liveStages until all tasks are complete

The problem occurs because stage object is removed from liveStages in
AppStatusListener onStageCompletion. Because of this any onTaskEnd event
received after onStageCompletion event do not update stage metrics.

The fix is to retain stage objects in liveStages until all tasks are complete.

1. Fixed the reproducible example posted in the JIRA
2. Added unit test

Closes apache#22209 from ankuriitg/ankurgupta/SPARK-24415.

Authored-by: ankurgupta <ankur.gupta@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(cherry picked from commit 39a02d8)

Ref: LIHADOOP-41272

RB=1447258
BUG=LIHADOOP-41272
G=superfriends-reviewers
R=fli,mshen,yezhou,edlu
A=fli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants