Skip to content

Conversation

@edwinalu
Copy link
Contributor

What changes were proposed in this pull request?

Collect and show the new executor memory metrics for each stage, to provide more information on how memory is used per stage. Peak values for metrics are show for each stage. For executor summaries for each stage, the peak values per executor are also shown.

How was this patch tested?

Added new unit tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@felixcheung
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100275 has finished for PR 23340 at commit d77612c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Dec 18, 2018

Jenkins, retest this please

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the PR description to make it clear this is only updating the REST endpoints for now? As you refer to 'showing' the metrics, it sounds like they're visible in the UI.

val activeStages = liveStages.values().asScala
.filter(_.status == v1.StageStatus.ACTIVE)
activeStages.foreach { stage =>
stage.peakExecutorMetrics.compareAndUpdatePeakValues(updates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need a maybeUpdate(stage, now) after this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for catching this.

Copy link
Contributor

@rezasafi rezasafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the pr.


Option(liveStages.get((executorMetrics.stageId, executorMetrics.stageAttemptId)))
.foreach { stage =>
stage.peakExecutorMetrics.compareAndUpdatePeakValues(executorMetrics.executorMetrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is it possible to change the name of the parameter or the field member? executorMetrics.executorMetrics feels kind of weird

Copy link
Contributor Author

@edwinalu edwinalu Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "event" for the parameter name -- this would be more inline with the other methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sounds good to me.

var metrics = createMetrics(default = 0L)

// peak values for executor level metrics
var peakExecutorMetrics = new ExecutorMetrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this also be a val like the one in line 387?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be val.

@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100289 has finished for PR 23340 at commit d77612c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2018

Test build #100294 has finished for PR 23340 at commit 42599fe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
// check if there is a new peak value for any of the executor level memory metrics,
// while reading from the log. SparkListenerStageExecutorMetrics are only processed
// when reading logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a little refresher here -- does this comment also apply to the new code you're adding above?

and why again does this only apply for when we're reading from the event logs (maybe this comment should be updated to point to whatever is happening for a live app)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, comments are still applicable, although for stage level. I'll move the comments around, and and add a pointer to onExecutorMetricsUpdate for the live app case.

@SparkQA
Copy link

SparkQA commented Dec 20, 2018

Test build #100324 has finished for PR 23340 at commit 3dd8424.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2019

Test build #104844 has finished for PR 23340 at commit 6d6f64c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@edwinalu
Copy link
Contributor Author

@squito and @rezasafi , could you please take another look? Thanks for your review and comments.

@github-actions
Copy link

github-actions bot commented Jan 3, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 3, 2020
@github-actions github-actions bot closed this Jan 4, 2020
@itamarst
Copy link

Ping—any chance someone could review this patch? These are useful metrics.

@stackedsax
Copy link

@squito @rezasafi any chance either of you could review this patch?

@dongjoon-hyun
Copy link
Member

This is reopened according to the community request. I also removed the label, Stale.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@stackedsax
Copy link

@edwinalu thanks for getting this started so long ago. Looks like there are some conflicts after so much time has passed. Do you have time/interest to continue working on this or would you like some help getting this sorted out?

@edwinalu
Copy link
Contributor Author

edwinalu commented Jul 1, 2020

@stackedsax , I don't have time to continue working on this unfortunately. If you or someone else is able to help with adding stage level metrics, that would be great. Please let me know if there's any questions about the original PR.

@imback82
Copy link
Contributor

imback82 commented Jul 2, 2020

@stackedsax, do you plan to work on this? Otherwise, I can start looking into this next week. Thanks!

@itamarst
Copy link

itamarst commented Jul 2, 2020

@imback82 I'll be taking this up if you don't (on behalf of @stackedsax), but since I am not a Spark contributor at the moment it'd probably be a lot faster and easier if you did it. So please do look into it—thank you!

@imback82
Copy link
Contributor

imback82 commented Jul 3, 2020

Thanks @itamarst. I started looking into this and will update this thread.

@imback82
Copy link
Contributor

imback82 commented Jul 7, 2020

I created a PR here: #29020. I will also ask a reviewing committer to put @edwinalu as a co-author if that's the practice when a PR is "inherited". Thanks!

gengliangwang pushed a commit that referenced this pull request Aug 4, 2020
… API

### What changes were proposed in this pull request?

Note that this PR is forked from #23340 originally written by edwinalu.

This PR proposes to expose the peak executor metrics at the stage level via the REST APIs:
* `/applications/<application_id>/stages/`: peak values of executor metrics for **each stage**
* `/applications/<application_id>/stages/<stage_id>/< stage_attempt_id >`: peak values of executor metrics for **each executor** for the stage, followed by peak values of executor metrics for the stage

### Why are the changes needed?

The stage level peak executor metrics can help better understand your application's resource utilization.

### Does this PR introduce _any_ user-facing change?

1. For the `/applications/<application_id>/stages/` API, you will see the following new info for **each stage**:
```JSON
  "peakExecutorMetrics" : {
    "JVMHeapMemory" : 213367864,
    "JVMOffHeapMemory" : 189011656,
    "OnHeapExecutionMemory" : 0,
    "OffHeapExecutionMemory" : 0,
    "OnHeapStorageMemory" : 2133349,
    "OffHeapStorageMemory" : 0,
    "OnHeapUnifiedMemory" : 2133349,
    "OffHeapUnifiedMemory" : 0,
    "DirectPoolMemory" : 282024,
    "MappedPoolMemory" : 0,
    "ProcessTreeJVMVMemory" : 0,
    "ProcessTreeJVMRSSMemory" : 0,
    "ProcessTreePythonVMemory" : 0,
    "ProcessTreePythonRSSMemory" : 0,
    "ProcessTreeOtherVMemory" : 0,
    "ProcessTreeOtherRSSMemory" : 0,
    "MinorGCCount" : 13,
    "MinorGCTime" : 115,
    "MajorGCCount" : 4,
    "MajorGCTime" : 339
  }
```

2. For the `/applications/<application_id>/stages/<stage_id>/<stage_attempt_id>` API, you will see the following new info for **each executor** under `executorSummary`:
```JSON
  "peakMemoryMetrics" : {
    "JVMHeapMemory" : 0,
    "JVMOffHeapMemory" : 0,
    "OnHeapExecutionMemory" : 0,
    "OffHeapExecutionMemory" : 0,
    "OnHeapStorageMemory" : 0,
    "OffHeapStorageMemory" : 0,
    "OnHeapUnifiedMemory" : 0,
    "OffHeapUnifiedMemory" : 0,
    "DirectPoolMemory" : 0,
    "MappedPoolMemory" : 0,
    "ProcessTreeJVMVMemory" : 0,
    "ProcessTreeJVMRSSMemory" : 0,
    "ProcessTreePythonVMemory" : 0,
    "ProcessTreePythonRSSMemory" : 0,
    "ProcessTreeOtherVMemory" : 0,
    "ProcessTreeOtherRSSMemory" : 0,
    "MinorGCCount" : 0,
    "MinorGCTime" : 0,
    "MajorGCCount" : 0,
    "MajorGCTime" : 0
  }
```
, and the following at the stage level:
```JSON
"peakExecutorMetrics" : {
    "JVMHeapMemory" : 213367864,
    "JVMOffHeapMemory" : 189011656,
    "OnHeapExecutionMemory" : 0,
    "OffHeapExecutionMemory" : 0,
    "OnHeapStorageMemory" : 2133349,
    "OffHeapStorageMemory" : 0,
    "OnHeapUnifiedMemory" : 2133349,
    "OffHeapUnifiedMemory" : 0,
    "DirectPoolMemory" : 282024,
    "MappedPoolMemory" : 0,
    "ProcessTreeJVMVMemory" : 0,
    "ProcessTreeJVMRSSMemory" : 0,
    "ProcessTreePythonVMemory" : 0,
    "ProcessTreePythonRSSMemory" : 0,
    "ProcessTreeOtherVMemory" : 0,
    "ProcessTreeOtherRSSMemory" : 0,
    "MinorGCCount" : 13,
    "MinorGCTime" : 115,
    "MajorGCCount" : 4,
    "MajorGCTime" : 339
  }
```

### How was this patch tested?

Added tests.

Closes #29020 from imback82/metrics.

Lead-authored-by: Terry Kim <[email protected]>
Co-authored-by: edwinalu <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
@stackedsax
Copy link

@imback82 Looks like #29020 is merged into 3.1.0. Should this issue get closed now, or is it being left open for someone to do the same in 2.x versions of Spark?

@imback82
Copy link
Contributor

This can be closed now. We can use changes in #29020 if backporting to 2.4 is needed. (@dongjoon-hyun / @gengliangwang can confirm, but I don't think this will be backported to 2.4.)

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Nov 20, 2020
@github-actions github-actions bot closed this Nov 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants