[SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API by kuwii · Pull Request #39190 · apache/spark

kuwii · 2022-12-23T06:00:57Z

What changes were proposed in this pull request?

Update onJobEnd method of AppStatusListener, removing the logic of reducing job.activeStages for each pending stage.
Add UT to verify whether numActiveStages of jobs data is correct.

Why are the changes needed?

For property activeStages of LiveJob, it is updated when:

A job is started: activeStages = 0
A stage is submitted: activeStages += 1
A stage is completed: activeStages -= 1
A job is ended: activeStages -= 1 for each pending stages

According to the implementation of AppStatusListener and LiveStage:

When a job is created, all of its stages in job info will be created with state set to pending without updating activeStages.
When a stage is submitted, its state will be immediately set to active with activeStages increased by 1.

So for pending stages, they won't affect activeStages. Therefore, when job is ended, activeStages shouldn't be decreased by 1 for each pending stage.

Here's an example:

Job 0 starts with stage 0, 1, 2
Stage 0 submitted
Stage 0 completed
Job 0 ends

In this case, when job 0 ends, its numActiveStages will be -2, which is obviously incorrect.

Does this PR introduce any user-facing change?

For jobs API, property activeStages will be different if a job has pending stages when it ends. In these cases, previously the number is incorrect. This PR fixes it.

How was this patch tested?

This PR adds a UT of the example mentioned above, to make sure numActiveStages should be 0 instead of -2.

kuwii · 2022-12-23T06:14:28Z

Related Change: #22209
Kindly ping @ankuriitg @vanzin

AmplabJenkins · 2022-12-24T17:30:01Z

Can one of the admins verify this patch?

mridulm · 2022-12-27T18:30:59Z

+CC @thejdeep

kuwii · 2023-01-04T06:33:05Z

Kindly ping @ankuriitg @vanzin @mridulm @thejdeep
Could you please help to take a look at this PR? Thanks.

VindhyaG · 2023-01-11T12:56:00Z

Hi. this impacts Jobs API so this is a user facing change right?

kuwii · 2023-01-12T06:22:29Z

Hi. this impacts Jobs API so this is a user facing change right?

@VindhyaG Thanks for the comment. I've updated the PR description.

kuwii · 2023-01-19T05:25:47Z

Hi @srowen, could you please help to take a look at this PR? Thanks.

srowen · 2023-01-19T14:26:02Z

It makes sense to me. I don't know a lot about this code, so hesitate to review it. Does this only affect display metrics? I'm just wondering why it hadn't caused a problem before. Maybe it's always been a cosmetic issue, that only arises when a job is cancelled with pending stages or something?

srowen · 2023-01-19T14:26:31Z

Or maybe more to the point, do you have a concrete example of how this arises in Spark?

kuwii · 2023-01-20T11:41:43Z

@srowen We found this issue in some of Spark applications. Here's the event log of an example, which can be loaded through history server:
application_1671519030791_0001_1.zip

In /api/v1/applications/application_1671519030791_0001/1/jobs, numActiveStages of job 3, 4, 5, 8 are less than 0.

srowen · 2023-01-20T13:30:11Z

Yeah but do you know how it happens, or have a theory? Just want to see if the change seems to match with some theory of how it arises. Or does this change definitely change the output above?

kuwii · 2023-01-20T15:44:49Z

I'm not familiar with how Spark creates and runs jobs and stages for a query, but I think it may be related to this case. I can reproduce this locally using Spark on Yarn mode with this code:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import countDistinct, col, count, when
import time

conf = SparkConf().setAppName('test')
sc = SparkContext(conf = conf)
spark = SQLContext(sc).sparkSession

spark.range(1, 100).count()

The execution for count creates 2 jobs: job 0 with stage 0 and job 1 with stage 1, 2.

Because of some logic, stage 1 will always be skipped, not even submitted.

This is the case that is mentioned in the PR's description. And because the incorrect logic of updating numActiveStages, it will be -1 in jobs API. This PR can fix it.

srowen · 2023-01-20T16:54:34Z

FWIW, this part was last changed in https://issues.apache.org/jira/browse/SPARK-24415 to fix a different bug (CC @ankuriitg ) It might be worth re-running the simple example there to see if this retains the 'fix', but, evidently the tests added in that old change still pass here.

While I'm always wary of touching this core code and I myself don't know it well, this seems fairly convincing.

kuwii · 2023-01-21T09:36:04Z

Tried the example code in the JIRA, and it is not affected by this change. Tasks showed in the stage are the same before and after this change.

Also, numActiveStages of that example is also -1. I think the reason we didn't notice it is because currently the property seems to be only available in jobs REST API, not web UI.

I've checked comments about these lines in that PR. Code here is for handling stages metrics when onStageCompleted event is dropped somehow. But as mentioned in this PR, I think the logic to reduce activeStages here is incorrect, which should be removed when handling onJobEnd event.

srowen · 2023-01-21T15:28:08Z

Merged to master

mridulm · 2023-01-22T03:54:30Z

Late LGTM.
Thanks for fixing this @kuwii !
Thanks for merging it @srowen :-)

kuwii added 3 commits December 23, 2022 10:17

do not reduce job.activeStages

b478e43

add ut

f2ff451

fix scalastyle

c81ceb2

github-actions bot added the CORE label Dec 23, 2022

kuwii changed the title ~~[SPARK-41683][CORE][UI] Fix issue of getting incorrect property numActiveStages in jobs API~~ [SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API Dec 23, 2022

srowen closed this in e969bb2 Jan 21, 2023

kuwii deleted the dev/numActiveStages branch January 25, 2023 09:51

Conversation

kuwii commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

kuwii commented Dec 23, 2022

Uh oh!

AmplabJenkins commented Dec 24, 2022

Uh oh!

mridulm commented Dec 27, 2022

Uh oh!

kuwii commented Jan 4, 2023

Uh oh!

VindhyaG commented Jan 11, 2023

Uh oh!

kuwii commented Jan 12, 2023

Uh oh!

kuwii commented Jan 19, 2023

Uh oh!

srowen commented Jan 19, 2023

Uh oh!

srowen commented Jan 19, 2023

Uh oh!

kuwii commented Jan 20, 2023

Uh oh!

srowen commented Jan 20, 2023

Uh oh!

kuwii commented Jan 20, 2023

Uh oh!

srowen commented Jan 20, 2023

Uh oh!

kuwii commented Jan 21, 2023

Uh oh!

srowen commented Jan 21, 2023

Uh oh!

mridulm commented Jan 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kuwii commented Dec 23, 2022 •

edited

Loading