[SPARK-25228][CORE]Add executor CPU time metric. #22218

LucaCanali · 2018-08-24T12:09:23Z

What changes were proposed in this pull request?

Add a new metric to measure the executor's process (JVM) CPU time.

How was this patch tested?

Manually tested on a Spark cluster (see SPARK-25228 for an example screenshot).

maropu · 2018-08-25T02:08:12Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+  // The value is returned in nanoseconds, the method return -1 if this operation is not supported.
+  val osMXBean = ManagementFactory.getOperatingSystemMXBean.asInstanceOf[OperatingSystemMXBean]
+  metricRegistry.register(MetricRegistry.name("executorCPUTime" ), new Gauge[Long] {
+    override def getValue: Long = osMXBean.getProcessCpuTime()


This metric is useful for users? The task cpu time metric is not enough?

I believe the proposed metric tracking the executor CPU time is useful and adds additional information and convenience on top of the task CPU metric, as implemented in SPARK-22190. A couple of considerations to support this argument from some of the recent findings and experimentation on this:

the process CPU time contains all the CPU consumed by the JVM, notably including the CPU consumed by garbage collection, which can be important in some cases and definitely something we want to measure and analyze

the CPU time collected from the tasks is "harder to consume" in a dashboard as the CPU value is only updated at the end of the successful execution of the task, which makes it harder to handle for a dashboard in case of long-running tasks. In contrast, the executor process CPU time "dropwizard gauge" gives an up-to-date value of the CPU consumed by the executor at any time as it takes it directly from the OS.

srowen · 2018-08-25T14:20:02Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

 import java.util.concurrent.ThreadPoolExecutor

 import scala.collection.JavaConverters._

 import com.codahale.metrics.{Gauge, MetricRegistry}
+import com.sun.management.OperatingSystemMXBean


Is this com.sun class going to be available in all JDKs? Thinking of OpenJDK and IBM JDKs

Good point.
This class cannot be loaded at least on IBM JDK as reported here.

Indeed this is a very good point that I had overlooked. I have now directly checked and this appears to work OK on OpenJDK (and on Oracle JVM of course). In addition, I tested manually with IBM JDK (IBM J9 VM, Java 1.8.0_181, where one would indeed suspect incompatibilities and surprisingly this appears to work in that case too. I believe this may come from recent work by IBM to make com.ibm.lang.management.OperatingSystemMXBean.getProcessCpuTime compatible with com.sun.management.OperatingSystemMXBean.getProcessCpuTime? See also this link

I guess that if this is confirmed, we should be fine with a large fraction of the commonly used JDKs. In addition, we could handle the exception in case getProcessCpuTime is not available on a particular platform where the executor is running, for example returning the value -1 for this gauge in that case. Any thoughts/suggestions on this proposal?

I think it's safest to a little reflection here to make sure this doesn't cause the whole app to crash every time.

I have refactored the code with a different approach using the BeanServer which should address the comments about avialability of com.sun.management.OperatingSystemMXBean across different JDKs.

…MX Bean if available

LucaCanali · 2018-08-31T15:51:09Z

I have refactored the code now using the BeanServer which should address the comments about availability of com.sun.management.OperatingSystemMXBean across different JDKs.

srowen · 2018-08-31T16:57:59Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+  // will try to get JVM Process CPU time or return -1 otherwise
+  // will use proprietary extensions as com.sun.management.OperatingSystemMXBean or
+  // com.ibm.lang.management.OperatingSystemMXBean if available
+  def tryToGetJVMProcessPCUTime() : Long = {


Can you just inline this method below?

srowen · 2018-08-31T16:58:27Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+      if (attribute != null) {
+        attribute.asInstanceOf[Long]
+      }
+      else {


Nit: pull onto previous line

…MX Bean, if available

LucaCanali · 2018-08-31T19:31:13Z

I have implemented the changes as from the latest comments, namely inlined the method.

srowen · 2018-08-31T21:58:35Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

@@ -73,6 +75,29 @@ class ExecutorSource(threadPool: ThreadPoolExecutor, executorId: String) extends
    registerFileSystemStat(scheme, "write_ops", _.getWriteOps(), 0)
  }

+  /** Dropwizard metrics gauge measuring the executor's process CPU time.


Nit: the comments should begin on the next line. But this is scaladoc syntax, and inside a code block, normally we just use // block comments.

srowen · 2018-08-31T21:59:18Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+   *  It will use proprietary extensions as com.sun.management.OperatingSystemMXBean or
+   *  com.ibm.lang.management.OperatingSystemMXBean if available
+   */
+  val mBean: MBeanServer = ManagementFactory.getPlatformMBeanServer


The problem here is that these become fields in the parent object. These should go inside the new Gauge... { I think?

maropu · 2018-08-31T22:35:41Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+          -1L
+        }
+      } catch {
+        case _ : Exception => -1L


case NonFatal(_) => -1?

maropu · 2018-08-31T22:40:16Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+   */
+  val mBean: MBeanServer = ManagementFactory.getPlatformMBeanServer
+  val name = new ObjectName("java.lang", "type", "OperatingSystem")
+  metricRegistry.register(MetricRegistry.name("executorCPUTime" ), new Gauge[Long] {


a little confused with the exsiting cpuTime. How about jvmCpuTime?

nit: name("executorCPUTime" ) -> name("executorCPUTime")

maropu · 2018-08-31T22:41:44Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+        if (attribute != null) {
+          attribute.asInstanceOf[Long]
+        } else {
+          -1L


Any reason to return -1 instead of 0?

I took the idea from com.sun.management.OperatingSystemMXBean.getProcessCpuTime, according to the documentation: "Returns: the CPU time used by the process in nanoseconds, or -1 if this operation is not supported."
I guess it makes sense to return an invalid value as -1L for the CPU time if something goes wrong with gathering CPU Time values, so the error condition will appear evident to the end user of the metric. Returning 0 is also possible, of course.

ok, thanks.

… in OperatingSystem MX Bean, if available.

LucaCanali · 2018-09-01T19:52:10Z

I have implemented the changes as from the latest comments by @maropu and @srowen

… in OperatingSystem MX Bean, if available.

srowen · 2018-09-01T20:03:38Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+  // com.ibm.lang.management.OperatingSystemMXBean, if available.
+  metricRegistry.register(MetricRegistry.name("jvmCpuTime"), new Gauge[Long] {
+    override def getValue: Long = {
+      val mBean: MBeanServer = ManagementFactory.getPlatformMBeanServer


Although I actually mean to put these inside the anonymous Gauge instance but outside the method, so as to compute them once, I doubt there is much overhead here. Getting the bean is just returning a field, although constructing the ObjectName is a little non-trivial. I suppose metrics are infrequently computed so this doesn't matter much.

… in OperatingSystem MX Bean, if available.

LucaCanali · 2018-09-01T20:25:02Z

Thanks @srowen

srowen · 2018-09-02T15:14:42Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+  // The CPU time value is returned in nanoseconds.
+  // It will use proprietary extensions such as com.sun.management.OperatingSystemMXBean or
+  // com.ibm.lang.management.OperatingSystemMXBean, if available.
+  metricRegistry.register(MetricRegistry.name("jvmCpuTime"), new Gauge[Long] {


So this isn't exposed except through dropwizard... not plumbed through to the driver too like some of the metrics below? just checking that this is all that needs to happen, that the metric can be used by external users but is not otherwise touched by Spark.

Indeed, this is exposed only through dropwizard metrics system and not used otherwise in the Spark code. Another point worth mentioning is that currently executorSource is not registered when running in local mode.
On a related topic (although maybe for a more general discussion than the scope of this PR) I was wondering if it would make sense to introduce a few SparkConf properties to switch on/off certain families of (dropwizard) metrics in the Spark, as the list of available metrics is mecoming long in recent versions.

maropu · 2018-09-03T01:41:59Z

core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala

+    override def getValue: Long = {
+      try {
+        val attribute = mBean.getAttribute(name, "ProcessCpuTime")
+        if (attribute != null) {


I checked the doc for `getAttribute though, when does it return null?
https://docs.oracle.com/javase/8/docs/api/javax/management/MBeanServerConnection.html#getAttribute-javax.management.ObjectName-java.lang.String-

Indeed good point. I'll remove this additional check for null value.

I personally don't mind the defensive checks, because who knows what to really expect from these implementations? but this is OK by me. In case of a bad impl this would still return -1.

SparkQA · 2018-09-03T17:15:27Z

Test build #4330 has finished for PR 22218 at commit e72966e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-03T23:29:21Z

retest this please

SparkQA · 2018-09-04T22:24:40Z

Test build #4331 has finished for PR 22218 at commit e72966e.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

srowen · 2018-09-05T14:00:31Z

Merged to master

Add Executor CPU Time metric

c715096

maropu reviewed Aug 25, 2018

View reviewed changes

srowen reviewed Aug 25, 2018

View reviewed changes

Add executor CPU Time using ProcessCpuTime method in OperatingSystem …

438bf90

…MX Bean if available

srowen reviewed Aug 31, 2018

View reviewed changes

Add executor CPU Time using ProcessCpuTime method in OperatingSystem …

807119b

…MX Bean, if available

srowen requested changes Aug 31, 2018

View reviewed changes

maropu reviewed Aug 31, 2018

View reviewed changes

Add executor CPU Time metric (jvmCpuTime) using ProcessCpuTime method…

d522fa2

… in OperatingSystem MX Bean, if available.

Add executor CPU Time metric (jvmCpuTime) using ProcessCpuTime method…

b7fdec2

… in OperatingSystem MX Bean, if available.

srowen reviewed Sep 1, 2018

View reviewed changes

srowen approved these changes Sep 1, 2018

View reviewed changes

Add executor CPU Time metric (jvmCpuTime) using ProcessCpuTime method…

95d31f6

… in OperatingSystem MX Bean, if available.

srowen reviewed Sep 2, 2018

View reviewed changes

maropu reviewed Sep 3, 2018

View reviewed changes

removed check for null value.

e72966e

asfgit closed this in 8440e30 Sep 5, 2018

skonto mentioned this pull request Sep 12, 2018

[SPARK-25394][CORE] Add an application status metrics source #22381

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25228][CORE]Add executor CPU time metric. #22218

[SPARK-25228][CORE]Add executor CPU time metric. #22218

LucaCanali commented Aug 24, 2018

maropu Aug 25, 2018 •

edited

Loading

LucaCanali Aug 27, 2018

srowen Aug 25, 2018

kiszk Aug 26, 2018

LucaCanali Aug 27, 2018

srowen Aug 27, 2018

LucaCanali Aug 31, 2018

LucaCanali commented Aug 31, 2018

srowen Aug 31, 2018

srowen Aug 31, 2018

LucaCanali commented Aug 31, 2018

srowen Aug 31, 2018

srowen Aug 31, 2018

maropu Aug 31, 2018

maropu Aug 31, 2018

kiszk Sep 1, 2018

maropu Aug 31, 2018

LucaCanali Sep 1, 2018

maropu Sep 3, 2018

LucaCanali commented Sep 1, 2018

srowen Sep 1, 2018

LucaCanali commented Sep 1, 2018

srowen Sep 2, 2018

LucaCanali Sep 2, 2018

maropu Sep 3, 2018

LucaCanali Sep 3, 2018

srowen Sep 3, 2018

SparkQA commented Sep 3, 2018

maropu commented Sep 3, 2018

SparkQA commented Sep 4, 2018

srowen commented Sep 5, 2018

[SPARK-25228][CORE]Add executor CPU time metric. #22218

[SPARK-25228][CORE]Add executor CPU time metric. #22218

Conversation

LucaCanali commented Aug 24, 2018

What changes were proposed in this pull request?

How was this patch tested?

maropu Aug 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucaCanali commented Aug 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucaCanali commented Aug 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucaCanali commented Sep 1, 2018

Choose a reason for hiding this comment

LucaCanali commented Sep 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 3, 2018

maropu commented Sep 3, 2018

SparkQA commented Sep 4, 2018

srowen commented Sep 5, 2018

maropu Aug 25, 2018 •

edited

Loading