[SPARK-29400][CORE] Improve PrometheusResource to use labels #26060

dongjoon-hyun · 2019-10-08T22:11:50Z

What changes were proposed in this pull request?

SPARK-29064 introduced PrometheusResource to expose ExecutorSummary. This PR aims to improve it further more Prometheus-friendly to use Prometheus labels.

Why are the changes needed?

BEFORE

metrics_app_20191008151432_0000_driver_executor_rddBlocks_Count 0
metrics_app_20191008151432_0000_driver_executor_memoryUsed_Count 0
metrics_app_20191008151432_0000_driver_executor_diskUsed_Count 0

AFTER

$ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3
metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0

Does this PR introduce any user-facing change?

No, but Prometheus understands the new format and shows more intelligently.

How was this patch tested?

Manually.

SETUP

$ sbin/start-master.sh
$ sbin/start-slave.sh spark://`hostname`:7077
$ bin/spark-shell --master spark://`hostname`:7077 --conf spark.ui.prometheus.enabled=true

RESULT

$ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3
metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0

dongjoon-hyun · 2019-10-08T22:18:50Z

Hi, @gatorsmile .
This is the improvement on executor metrics which is beyond spark.metrics.namespace conf.

SparkQA · 2019-10-09T01:19:56Z

Test build #111920 has finished for PR 26060 at commit 9076922.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-10-09T03:12:20Z

Hi, @srowen , @dbtsai , @HyukjinKwon .
Could you review this PR, please?

yuecong · 2019-10-09T03:15:03Z

@dongjoon-hyun Thanks for fixing this.
I have several questions on this.

Short-lived metrics
As Prometheus uses pull model, how do you recommend people to use these metrics for some executors who get shut down immediately? Also how this will work for some short-lived(e.g. shorter than one Prometheus scrape interval, usually it is 30s) spark application?
Check this blog about short-lived metrics for Prometheus.
High cardinality
It looks like you are using app_id as one of the app_id, which will increase the cardinality for Prometheus metrics. See more information about prometheus's cardinality issue as here as well as this doc

If a user uses a central Prometheus server to scrape its spark application with this PR. for each time, it has a new Spark application, it will have N metrics(say 10) and assume it has M workers(20) on average. As app_id will change each time, with time going, old metrics will not disappear, it will add up to millions and even billions of metrics. This will cause a heavy load for a traditional Prometheus server. There are several solutions(M3, Cortex, Thanos) to address this issue, but we should make it clear about the cardinality for users to use such metrics.

It would be great we give some suggestion how we want users to use such metrics in practice, especially on how to handle short-lived metrics and high cardinality metrics

viirya · 2019-10-09T03:25:46Z

core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala

+        "executor_id" -> executor.id
+      ).map { case (k, v) => s"""$k="$v"""" }.mkString("{", ", ", "}")
+      sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n")
+      sb.append(s"${prefix}memoryUsed_Count$labels ${executor.memoryUsed}\n")


Not related to this PR. But why they all end with _Count? For rddBlocks, it is ok, but some seems not suitable, like memoryUsed_Count.

Thank you for review, @viirya . Yes. Of course, we rename it freely because we start to support them natively.

dongjoon-hyun · 2019-10-09T03:28:23Z

Hi, @yuecong . Thank you for review.

That was true in the old Prometheus plugin. So, Apache Spark 3.0.0 exposes this Prometheus metric on the driver port, instead of the executor port. I mean you are referring executor instead of driver. Do you have a short-live Spark driver which dies in 30s?

As Prometheus uses pull model, how do you recommend people to use these metrics for some executors who get shut down immediately? Also how this will work for some short-lived(e.g. shorter than one Prometheus scrape interval, usually it is 30s) spark application?

Please see this PR's description. The metric name is unique with cadinality 1 by using labels, metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000"

It looks like you are using app_id as one of the app_id, which will increase the cardinality for Prometheus metrics.

I don't think you mean Prometheus Dimension feature is high-cardinality.

yuecong · 2019-10-09T03:31:54Z

That was true in the old Prometheus plugin. So, Apache Spark 3.0.0 exposes this Prometheus metric on the driver port, instead of the executor port

This is awesome!

yuecong · 2019-10-09T03:34:13Z

Please see this PR's description. The metric name is unique with cadinality 1 by using labels, metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000"

For Prometheus, not just the metrics name, but also labels count for a high-cardinality. Inside the tsdb for Prometheus server, actually, the combination of metrics names and labels are the key.
see details here

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

dongjoon-hyun · 2019-10-09T03:40:14Z

@yuecong . That is a general issue on Apache Spark monitoring instead of this PR, isn't it? So, I have three questions for you.

Do you use a custom Sink to monitor Apache Spark?
Do you collect only a cluster-wide metrics?
Is it helpful for long-running app monitoring like structured streamings?

yuecong · 2019-10-09T03:40:26Z

Do the metrics for the Spark application disappear after the application finished? I guess the answer is No. If driver keeps all the metrics for all the spark applications running using the driver, will this not cause a high memory usage for the driver? Do you have some tests on this?

dongjoon-hyun · 2019-10-09T03:43:05Z

This PR doesn't collect new metrics, only exposing the existing one. So, the following is not about this PR. If you have a concern on Apache Spark driver, you can file a new issue on that.

If driver keeps all the metrics for all the spark applications running using the driver,

Second, from Prometheus side, you can use Prometheus TTL feature, @yuecong . Have you try that?

yuecong · 2019-10-09T03:48:56Z

That is a general issue on Apache Spark monitoring instead of this PR, isn't it? So, I have three questions for you.

Do you use a custom Sink to monitor Apache Spark?

Do you collect only a cluster-wide metrics?

Is it helpful for long-running app monitoring like structured streamings?

I agree with it is a general challenge for Apache Spark monitoring using normal Prometheus server. I would suggest just make it clear about its high-cardinality. Maybe this is orthogonal to your PR. just my two cents. People use a highly scalable Prometheus(e.g. M3, Cotext, etc) to handle Spark metrics.

Also if we could have one custom exporter to allow users to use a push model to expose it to some distributed time serials database or a pub-sub system(e.g. kafka), it can solve this high cardinality issue as well

yuecong · 2019-10-09T03:50:20Z

Second, you can use Prometheus TTL feature, @yuecong . Have you try that?

could you share the link on this one? we are not using TTL feature for Prometheus yet.

dongjoon-hyun · 2019-10-09T04:05:56Z

Sorry for the misleading naming. I meant the following.

storage.tsdb.retention.time: https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects

yuecong · 2019-10-09T04:11:13Z

storage.tsdb.retention.time: https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects

Thanks for the pointer. We are using a retention period for sure. But this will not help to expire some metrics as far as it goes into the Prometheus server. Also, it is one challenge to remove metrics from the Prometheus registry from the client-side as well. Again, these challenges become more critical when it is high cardinality metrics. This is why Prometheus community does not suggest people to use some unbounded values for a label.

dongjoon-hyun · 2019-10-09T04:12:54Z

BTW, the following cover some gigantic clusters, but not all cases. There is a different and cheaper approach like Federation. We are using Federation which fits our environment much better.

People use a highly scalable Prometheus(e.g. M3, Cotext, etc) to handle Spark metrics.

yuecong · 2019-10-09T04:15:36Z

BTW, the following cover some gigantic clusters, but not all cases. There is a different and cheaper approach like Federation. We are using Federation which fits our environment much better.

Would like to hear more about this. Our experience on Federation is that if we do not filter out some metrics but just federate metrics from each child server, the Federation prom will become the bottleneck. Federation provid a way to gather metrics from other Prometheus server, but it does not solve the scalability issues.
How many metrics you have for your Federation Prometheus ?

dongjoon-hyun · 2019-10-09T04:19:41Z

Nope. Why do you collect all? It's up to your configuration.

Back to the beginning, I fully understand your cluster's underlying issues.

However, none of them blocks Apache Spark support Prometheus metric natively.

First, you can use the previous existing solution if you have that. (spark.ui.prometheus.enabled is also by default false).
Second, your claims are too general. Not every users have that kind of gigantic size clusters. Although a few big customers have some, there are also many satellite small-size clusters.

I'm not sure your metric. Could you share us the size of your clusters and the number of apps and the number of metrics? Does it run on Apache Spark?

dongjoon-hyun · 2019-10-09T04:20:03Z

I don't see any number from you so far here. :)

dongjoon-hyun · 2019-10-09T21:53:23Z

Gentle ping, @gatorsmile , @srowen , @dbtsai , @viirya , @yuecong .

Brief Summary

Apache Spark has been providing a REST API. The new Prometheus servlet is only exposing the same information. If there is a scalability issue inside Apache Spark, we should fix it for both the existing REST API and this. That is orthogonal to this PR.
Prometheus (performance or architectural) issues are completely irrelevant to this PR.

srowen · 2019-10-09T22:04:08Z

(I don't have any opinion on this - don't know Prometheus)

viirya · 2019-10-09T22:25:49Z

core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala

+      val prefix = "metrics_executor_"
+      val labels = Seq(
+        "application_id" -> store.applicationInfo.id,
+        "application_name" -> store.applicationInfo.name,


I am not sure if application name is needed, because you have application id already.

This is the only field human-readable to distinguish the jobs.

viirya · 2019-10-09T22:32:46Z

core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala

+        "application_name" -> store.applicationInfo.name,
+        "executor_id" -> executor.id
+      ).map { case (k, v) => s"""$k="$v"""" }.mkString("{", ", ", "}")
+      sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n")


Actually prefix is fixed? So they are now the same metrics. And application id, executor id now are labels on them?

Does it have bad impact on the metrics usage later? Because now all applications are recorded under the same metrics. I am not sure how Prometheus processes, but naturally I'd think Prometheus needs to search specified application id in the metrics of all applications.

Previously you have appId and executor id in metric name.

Right. The redundant information is moved to labels.

Actually prefix is fixed? So they are now the same metrics. And application id, executor id now are labels on them?

No. Prometheus query language support to handle them individually.

Does it have bad impact on the metrics usage later? Because now all applications are recorded under the same metrics.

No. Prometheus query language support to handle them individually.

Yes. But I am wondering is, now all numbers from all applications are recorded under same metric. To retrieve number for specified application, does not Prometheus need to search it among all applications' metric numbers?

I may misunderstand Prometheus's approach. If so, then this might not be a problem.

For that Prometheus question, different labels mean different time-series in Prometheus.

Prometheus fundamentally stores all data as time series: streams of timestamped values belonging to the same metric and the same set of labeled dimensions.

Here are the reference for the details~

https://prometheus.io/docs/concepts/data_model/

https://prometheus.io/docs/practices/naming/#labels

HyukjinKwon · 2019-10-10T00:54:17Z

Seems fine.

dongjoon-hyun · 2019-10-10T15:46:36Z

Thank you all, @yuecong , @viirya , @srowen , @HyukjinKwon .
Merged to master.

### What changes were proposed in this pull request? [SPARK-29064](apache#25770) introduced `PrometheusResource` to expose `ExecutorSummary`. This PR aims to improve it further more `Prometheus`-friendly to use [Prometheus labels](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels). ### Why are the changes needed? **BEFORE** ``` metrics_app_20191008151432_0000_driver_executor_rddBlocks_Count 0 metrics_app_20191008151432_0000_driver_executor_memoryUsed_Count 0 metrics_app_20191008151432_0000_driver_executor_diskUsed_Count 0 ``` **AFTER** ``` $ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3 metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 ``` ### Does this PR introduce any user-facing change? No, but `Prometheus` understands the new format and shows more intelligently. <img width="735" alt="ui" src="https://user-images.githubusercontent.com/9700541/66438279-1756f900-e9e1-11e9-91c7-c04c6ce9172f.png"> ### How was this patch tested? Manually. **SETUP** ``` $ sbin/start-master.sh $ sbin/start-slave.sh spark://`hostname`:7077 $ bin/spark-shell --master spark://`hostname`:7077 --conf spark.ui.prometheus.enabled=true ``` **RESULT** ``` $ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3 metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 ``` Closes apache#26060 from dongjoon-hyun/SPARK-29400. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

RamakrishnaHande · 2023-01-18T16:39:29Z

I see that this PR is closed but https://issues.apache.org/jira/browse/SPARK-29400 says issue is resolved. Is that true ?

srowen · 2023-01-18T16:53:07Z

Right, the change was merged several years ago?

[SPARK-29400][CORE] Improve PrometheusResource to use labels

9076922

dongjoon-hyun added the SPARK CORE label Oct 8, 2019

viirya reviewed Oct 9, 2019

View reviewed changes

dongjoon-hyun closed this in e946104 Oct 10, 2019

dongjoon-hyun deleted the SPARK-29400 branch October 10, 2019 15:47

[SPARK-29400][CORE] Improve PrometheusResource to use labels #26060

[SPARK-29400][CORE] Improve PrometheusResource to use labels #26060

Uh oh!

Conversation

dongjoon-hyun commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Oct 8, 2019

Uh oh!

SparkQA commented Oct 9, 2019

Uh oh!

dongjoon-hyun commented Oct 9, 2019

Uh oh!

yuecong commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuecong commented Oct 9, 2019

Uh oh!

yuecong commented Oct 9, 2019

Uh oh!

dongjoon-hyun commented Oct 9, 2019

Uh oh!

yuecong commented Oct 9, 2019

Uh oh!

dongjoon-hyun commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuecong commented Oct 9, 2019

Uh oh!

yuecong commented Oct 9, 2019

Uh oh!

dongjoon-hyun commented Oct 9, 2019

Uh oh!

yuecong commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 9, 2019

Uh oh!

yuecong commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 9, 2019

Uh oh!

dongjoon-hyun commented Oct 9, 2019

Uh oh!

srowen commented Oct 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 10, 2019

Uh oh!

dongjoon-hyun commented Oct 10, 2019

dongjoon-hyun commented Oct 8, 2019 •

edited

Loading

yuecong commented Oct 9, 2019 •

edited

Loading

dongjoon-hyun commented Oct 9, 2019 •

edited

Loading

dongjoon-hyun commented Oct 9, 2019 •

edited

Loading

yuecong commented Oct 9, 2019 •

edited

Loading

yuecong commented Oct 9, 2019 •

edited

Loading

dongjoon-hyun commented Oct 9, 2019 •

edited

Loading

viirya Oct 9, 2019 •

edited

Loading