Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

Add restful api for user to get stage level executor peak metrics distribution.

  • /applications/<application_id>/stages/<stage_id>/< stage_attempt_id >/executorMetricsDistribution : distribution of peak values of executor metrics for each executor for the stage, followed by peak values of executor metrics for the stage
  • /applications/<application_id>/stages/<stage_id>/< stage_attempt_id >/executorMetricsDistribution?quantiles=0.25,0.5,0.75 : summarize the metrics with the given quantiles.
    Example: ?quantiles=0.01,0.5,0.99

Why are the changes needed?

It can help Spark users debug/monitor a bottleneck of a stage

Does this PR introduce any user-facing change?

Usage as First section.

/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/executorMetricsDistribution Summary peak executor metrics of all executors in the given stage attempt.
?quantiles summarize the metrics with the given quantiles.
Example: ?quantiles=0.01,0.5,0.99

How was this patch tested?

Added UT

…et stage level executor peak metrics distribution
@AngersZhuuuu
Copy link
Contributor Author

FYI @ron8hu @gengliangwang @maropu @warrenzhu25 @dongjoon-hyun

Since some logic is same as #29247, so I just use his code and I will add a co-author to @warrenzhu25

@ron8hu
Copy link
Contributor

ron8hu commented Jan 4, 2021

@AngersZhuuuu Many Spark users like to look at the executorMetricsDistribution information on Web UI as well. it is a good idea to keep the feature's web UI and REST API consistent. Like the table "Summary metrics for completed tasks", you can display the "Metrics Distribution for Executors" immediately below the table "Summary metrics for completed tasks". To keep it consistent, the default values in the "Metrics Distribution for Executors" table can be Min, 25th percentile, Median, 75th percentile, and Max as well in the web UI.

@AngersZhuuuu
Copy link
Contributor Author

@AngersZhuuuu Many Spark users like to look at the executorMetricsDistribution information on Web UI as well. it is a good idea to keep the feature's web UI and REST API consistent. Like the table "Summary metrics for completed tasks", you can display the "Metrics Distribution for Executors" immediately below the table "Summary metrics for completed tasks". To keep it consistent, the default values in the "Metrics Distribution for Executors" table can be Min, 25th percentile, Median, 75th percentile, and Max as well in the web UI.

Yea, should be consistent. Maybe we need to add a new ticket to add this in WEB UI page.

def executorSummary(
@PathParam("stageId") stageId: Int,
@PathParam("stageAttemptId") stageAttemptId: Int,
@DefaultValue("0.05,0.25,0.5,0.75,0.95")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the default value to 0.0,0.25,0.5,0.75,1.0. In a parallel system, the duration of a stage is often determined by the slowest task/executor. To monitor/debug a skew issue, the maximal value (or 100% percentile value) is more useful than the 95% percentile value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the default value to 0.0,0.25,0.5,0.75,1.0. In a parallel system, the duration of a stage is often determined by the slowest task/executor. To monitor/debug a skew issue, the maximal value (or 100% percentile value) is more useful than the 95% percentile value.

Add this value since

@GET
@Path("{stageId: \\d+}/{stageAttemptId: \\d+}/taskSummary")
def taskSummary(
@PathParam("stageId") stageId: Int,
@PathParam("stageAttemptId") stageAttemptId: Int,
@DefaultValue("0.05,0.25,0.5,0.75,0.95") @QueryParam("quantiles") quantileString: String)
: TaskMetricDistributions = withUI { ui =>

Similar API need keep consistent too....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. You want to keep it consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. You want to keep it consistent.

Emmmm, I found that in ui and restful API the quantiles is not same....

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38203/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38203/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133614 has finished for PR 31001 at commit f54d430.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133646 has finished for PR 31001 at commit f54d430.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants