-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26399][CORE] Add new stage-level REST APIs and parameters to get stage level executor peak metrics distribution #31001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…et stage level executor peak metrics distribution
|
FYI @ron8hu @gengliangwang @maropu @warrenzhu25 @dongjoon-hyun Since some logic is same as #29247, so I just use his code and I will add a co-author to @warrenzhu25 |
|
@AngersZhuuuu Many Spark users like to look at the executorMetricsDistribution information on Web UI as well. it is a good idea to keep the feature's web UI and REST API consistent. Like the table "Summary metrics for completed tasks", you can display the "Metrics Distribution for Executors" immediately below the table "Summary metrics for completed tasks". To keep it consistent, the default values in the "Metrics Distribution for Executors" table can be Min, 25th percentile, Median, 75th percentile, and Max as well in the web UI. |
Yea, should be consistent. Maybe we need to add a new ticket to add this in WEB UI page. |
| def executorSummary( | ||
| @PathParam("stageId") stageId: Int, | ||
| @PathParam("stageAttemptId") stageAttemptId: Int, | ||
| @DefaultValue("0.05,0.25,0.5,0.75,0.95") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the default value to 0.0,0.25,0.5,0.75,1.0. In a parallel system, the duration of a stage is often determined by the slowest task/executor. To monitor/debug a skew issue, the maximal value (or 100% percentile value) is more useful than the 95% percentile value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the default value to 0.0,0.25,0.5,0.75,1.0. In a parallel system, the duration of a stage is often determined by the slowest task/executor. To monitor/debug a skew issue, the maximal value (or 100% percentile value) is more useful than the 95% percentile value.
Add this value since
spark/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala
Lines 72 to 78 in f54d430
| @GET | |
| @Path("{stageId: \\d+}/{stageAttemptId: \\d+}/taskSummary") | |
| def taskSummary( | |
| @PathParam("stageId") stageId: Int, | |
| @PathParam("stageAttemptId") stageAttemptId: Int, | |
| @DefaultValue("0.05,0.25,0.5,0.75,0.95") @QueryParam("quantiles") quantileString: String) | |
| : TaskMetricDistributions = withUI { ui => |
Similar API need keep consistent too....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. You want to keep it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. You want to keep it consistent.
Emmmm, I found that in ui and restful API the quantiles is not same....
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133614 has finished for PR 31001 at commit
|
|
retest this please |
|
Test build #133646 has finished for PR 31001 at commit
|
What changes were proposed in this pull request?
Add restful api for user to get stage level executor peak metrics distribution.
Example:
?quantiles=0.01,0.5,0.99Why are the changes needed?
It can help Spark users debug/monitor a bottleneck of a stage
Does this PR introduce any user-facing change?
Usage as First section.
/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/executorMetricsDistributionSummary peak executor metrics of all executors in the given stage attempt.?quantilessummarize the metrics with the given quantiles.Example:
?quantiles=0.01,0.5,0.99How was this patch tested?
Added UT