Skip to content

[SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API#39190

Closed
kuwii wants to merge 3 commits intoapache:masterfrom
kuwii:dev/numActiveStages
Closed

[SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API#39190
kuwii wants to merge 3 commits intoapache:masterfrom
kuwii:dev/numActiveStages

Conversation

@kuwii
Copy link
Copy Markdown
Contributor

@kuwii kuwii commented Dec 23, 2022

What changes were proposed in this pull request?

  • Update onJobEnd method of AppStatusListener, removing the logic of reducing job.activeStages for each pending stage.
  • Add UT to verify whether numActiveStages of jobs data is correct.

Why are the changes needed?

For property activeStages of LiveJob, it is updated when:

  • A job is started: activeStages = 0
  • A stage is submitted: activeStages += 1
  • A stage is completed: activeStages -= 1
  • A job is ended: activeStages -= 1 for each pending stages

According to the implementation of AppStatusListener and LiveStage:

  • When a job is created, all of its stages in job info will be created with state set to pending without updating activeStages.
  • When a stage is submitted, its state will be immediately set to active with activeStages increased by 1.

So for pending stages, they won't affect activeStages. Therefore, when job is ended, activeStages shouldn't be decreased by 1 for each pending stage.

Here's an example:

  • Job 0 starts with stage 0, 1, 2
  • Stage 0 submitted
  • Stage 0 completed
  • Job 0 ends

In this case, when job 0 ends, its numActiveStages will be -2, which is obviously incorrect.

Does this PR introduce any user-facing change?

For jobs API, property activeStages will be different if a job has pending stages when it ends. In these cases, previously the number is incorrect. This PR fixes it.

How was this patch tested?

This PR adds a UT of the example mentioned above, to make sure numActiveStages should be 0 instead of -2.

@github-actions github-actions bot added the CORE label Dec 23, 2022
@kuwii kuwii changed the title [SPARK-41683][CORE][UI] Fix issue of getting incorrect property numActiveStages in jobs API [SPARK-41683][CORE] Fix issue of getting incorrect property numActiveStages in jobs API Dec 23, 2022
@kuwii
Copy link
Copy Markdown
Contributor Author

kuwii commented Dec 23, 2022

Related Change: #22209
Kindly ping @ankuriitg @vanzin

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@mridulm
Copy link
Copy Markdown
Contributor

mridulm commented Dec 27, 2022

+CC @thejdeep

@kuwii
Copy link
Copy Markdown
Contributor Author

kuwii commented Jan 4, 2023

Kindly ping @ankuriitg @vanzin @mridulm @thejdeep
Could you please help to take a look at this PR? Thanks.

@VindhyaG
Copy link
Copy Markdown
Contributor

Hi. this impacts Jobs API so this is a user facing change right?

@kuwii
Copy link
Copy Markdown
Contributor Author

kuwii commented Jan 12, 2023

Hi. this impacts Jobs API so this is a user facing change right?

@VindhyaG Thanks for the comment. I've updated the PR description.

@kuwii
Copy link
Copy Markdown
Contributor Author

kuwii commented Jan 19, 2023

Hi @srowen, could you please help to take a look at this PR? Thanks.

@srowen
Copy link
Copy Markdown
Member

srowen commented Jan 19, 2023

It makes sense to me. I don't know a lot about this code, so hesitate to review it. Does this only affect display metrics? I'm just wondering why it hadn't caused a problem before. Maybe it's always been a cosmetic issue, that only arises when a job is cancelled with pending stages or something?

@srowen
Copy link
Copy Markdown
Member

srowen commented Jan 19, 2023

Or maybe more to the point, do you have a concrete example of how this arises in Spark?

@kuwii
Copy link
Copy Markdown
Contributor Author

kuwii commented Jan 20, 2023

@srowen We found this issue in some of Spark applications. Here's the event log of an example, which can be loaded through history server:
application_1671519030791_0001_1.zip

In /api/v1/applications/application_1671519030791_0001/1/jobs, numActiveStages of job 3, 4, 5, 8 are less than 0.
image

@srowen
Copy link
Copy Markdown
Member

srowen commented Jan 20, 2023

Yeah but do you know how it happens, or have a theory? Just want to see if the change seems to match with some theory of how it arises. Or does this change definitely change the output above?

@kuwii
Copy link
Copy Markdown
Contributor Author

kuwii commented Jan 20, 2023

I'm not familiar with how Spark creates and runs jobs and stages for a query, but I think it may be related to this case. I can reproduce this locally using Spark on Yarn mode with this code:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import countDistinct, col, count, when
import time

conf = SparkConf().setAppName('test')
sc = SparkContext(conf = conf)
spark = SQLContext(sc).sparkSession

spark.range(1, 100).count()

The execution for count creates 2 jobs: job 0 with stage 0 and job 1 with stage 1, 2.

image

Because of some logic, stage 1 will always be skipped, not even submitted.

image

This is the case that is mentioned in the PR's description. And because the incorrect logic of updating numActiveStages, it will be -1 in jobs API. This PR can fix it.

image

@srowen
Copy link
Copy Markdown
Member

srowen commented Jan 20, 2023

FWIW, this part was last changed in https://issues.apache.org/jira/browse/SPARK-24415 to fix a different bug (CC @ankuriitg ) It might be worth re-running the simple example there to see if this retains the 'fix', but, evidently the tests added in that old change still pass here.

While I'm always wary of touching this core code and I myself don't know it well, this seems fairly convincing.

@kuwii
Copy link
Copy Markdown
Contributor Author

kuwii commented Jan 21, 2023

Tried the example code in the JIRA, and it is not affected by this change. Tasks showed in the stage are the same before and after this change.

image

Also, numActiveStages of that example is also -1. I think the reason we didn't notice it is because currently the property seems to be only available in jobs REST API, not web UI.

image

I've checked comments about these lines in that PR. Code here is for handling stages metrics when onStageCompleted event is dropped somehow. But as mentioned in this PR, I think the logic to reduce activeStages here is incorrect, which should be removed when handling onJobEnd event.

image

@srowen srowen closed this in e969bb2 Jan 21, 2023
@srowen
Copy link
Copy Markdown
Member

srowen commented Jan 21, 2023

Merged to master

@mridulm
Copy link
Copy Markdown
Contributor

mridulm commented Jan 22, 2023

Late LGTM.
Thanks for fixing this @kuwii !
Thanks for merging it @srowen :-)

@kuwii kuwii deleted the dev/numActiveStages branch January 25, 2023 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants