Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Oct 8, 2019

What changes were proposed in this pull request?

SPARK-29064 introduced PrometheusResource to expose ExecutorSummary. This PR aims to improve it further more Prometheus-friendly to use Prometheus labels.

Why are the changes needed?

BEFORE

metrics_app_20191008151432_0000_driver_executor_rddBlocks_Count 0
metrics_app_20191008151432_0000_driver_executor_memoryUsed_Count 0
metrics_app_20191008151432_0000_driver_executor_diskUsed_Count 0

AFTER

$ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3
metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0

Does this PR introduce any user-facing change?

No, but Prometheus understands the new format and shows more intelligently.

ui

How was this patch tested?

Manually.

SETUP

$ sbin/start-master.sh
$ sbin/start-slave.sh spark://`hostname`:7077
$ bin/spark-shell --master spark://`hostname`:7077 --conf spark.ui.prometheus.enabled=true

RESULT

$ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3
metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile .
This is the improvement on executor metrics which is beyond spark.metrics.namespace conf.

@SparkQA
Copy link

SparkQA commented Oct 9, 2019

Test build #111920 has finished for PR 26060 at commit 9076922.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @srowen , @dbtsai , @HyukjinKwon .
Could you review this PR, please?

@yuecong
Copy link
Contributor

yuecong commented Oct 9, 2019

@dongjoon-hyun Thanks for fixing this.
I have several questions on this.

  1. Short-lived metrics
    As Prometheus uses pull model, how do you recommend people to use these metrics for some executors who get shut down immediately? Also how this will work for some short-lived(e.g. shorter than one Prometheus scrape interval, usually it is 30s) spark application?
    Check this blog about short-lived metrics for Prometheus.

  2. High cardinality
    It looks like you are using app_id as one of the app_id, which will increase the cardinality for Prometheus metrics. See more information about prometheus's cardinality issue as here as well as this doc

If a user uses a central Prometheus server to scrape its spark application with this PR. for each time, it has a new Spark application, it will have N metrics(say 10) and assume it has M workers(20) on average. As app_id will change each time, with time going, old metrics will not disappear, it will add up to millions and even billions of metrics. This will cause a heavy load for a traditional Prometheus server. There are several solutions(M3, Cortex, Thanos) to address this issue, but we should make it clear about the cardinality for users to use such metrics.

It would be great we give some suggestion how we want users to use such metrics in practice, especially on how to handle short-lived metrics and high cardinality metrics

"executor_id" -> executor.id
).map { case (k, v) => s"""$k="$v"""" }.mkString("{", ", ", "}")
sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n")
sb.append(s"${prefix}memoryUsed_Count$labels ${executor.memoryUsed}\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR. But why they all end with _Count? For rddBlocks, it is ok, but some seems not suitable, like memoryUsed_Count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @viirya . Yes. Of course, we rename it freely because we start to support them natively.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Oct 9, 2019

Hi, @yuecong . Thank you for review.

  1. That was true in the old Prometheus plugin. So, Apache Spark 3.0.0 exposes this Prometheus metric on the driver port, instead of the executor port. I mean you are referring executor instead of driver. Do you have a short-live Spark driver which dies in 30s?

As Prometheus uses pull model, how do you recommend people to use these metrics for some executors who get shut down immediately? Also how this will work for some short-lived(e.g. shorter than one Prometheus scrape interval, usually it is 30s) spark application?

  1. Please see this PR's description. The metric name is unique with cadinality 1 by using labels, metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000"

It looks like you are using app_id as one of the app_id, which will increase the cardinality for Prometheus metrics.

I don't think you mean Prometheus Dimension feature is high-cardinality.

@yuecong
Copy link
Contributor

yuecong commented Oct 9, 2019

  1. That was true in the old Prometheus plugin. So, Apache Spark 3.0.0 exposes this Prometheus metric on the driver port, instead of the executor port

This is awesome!

@yuecong
Copy link
Contributor

yuecong commented Oct 9, 2019

  1. Please see this PR's description. The metric name is unique with cadinality 1 by using labels, metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000"

For Prometheus, not just the metrics name, but also labels count for a high-cardinality. Inside the tsdb for Prometheus server, actually, the combination of metrics names and labels are the key.
see details here

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

@dongjoon-hyun
Copy link
Member Author

@yuecong . That is a general issue on Apache Spark monitoring instead of this PR, isn't it? So, I have three questions for you.

  1. Do you use a custom Sink to monitor Apache Spark?
  2. Do you collect only a cluster-wide metrics?
  3. Is it helpful for long-running app monitoring like structured streamings?

@yuecong
Copy link
Contributor

yuecong commented Oct 9, 2019

Do the metrics for the Spark application disappear after the application finished? I guess the answer is No. If driver keeps all the metrics for all the spark applications running using the driver, will this not cause a high memory usage for the driver? Do you have some tests on this?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Oct 9, 2019

This PR doesn't collect new metrics, only exposing the existing one. So, the following is not about this PR. If you have a concern on Apache Spark driver, you can file a new issue on that.

If driver keeps all the metrics for all the spark applications running using the driver,

Second, from Prometheus side, you can use Prometheus TTL feature, @yuecong . Have you try that?

@yuecong
Copy link
Contributor

yuecong commented Oct 9, 2019

That is a general issue on Apache Spark monitoring instead of this PR, isn't it? So, I have three questions for you.

  1. Do you use a custom Sink to monitor Apache Spark?
  2. Do you collect only a cluster-wide metrics?
  3. Is it helpful for long-running app monitoring like structured streamings?

I agree with it is a general challenge for Apache Spark monitoring using normal Prometheus server. I would suggest just make it clear about its high-cardinality. Maybe this is orthogonal to your PR. just my two cents. People use a highly scalable Prometheus(e.g. M3, Cotext, etc) to handle Spark metrics.

Also if we could have one custom exporter to allow users to use a push model to expose it to some distributed time serials database or a pub-sub system(e.g. kafka), it can solve this high cardinality issue as well

@yuecong
Copy link
Contributor

yuecong commented Oct 9, 2019

Second, you can use Prometheus TTL feature, @yuecong . Have you try that?

could you share the link on this one? we are not using TTL feature for Prometheus yet.

@dongjoon-hyun
Copy link
Member Author

Sorry for the misleading naming. I meant the following.

@yuecong
Copy link
Contributor

yuecong commented Oct 9, 2019

Thanks for the pointer. We are using a retention period for sure. But this will not help to expire some metrics as far as it goes into the Prometheus server. Also, it is one challenge to remove metrics from the Prometheus registry from the client-side as well. Again, these challenges become more critical when it is high cardinality metrics. This is why Prometheus community does not suggest people to use some unbounded values for a label.

@dongjoon-hyun
Copy link
Member Author

BTW, the following cover some gigantic clusters, but not all cases. There is a different and cheaper approach like Federation. We are using Federation which fits our environment much better.

People use a highly scalable Prometheus(e.g. M3, Cotext, etc) to handle Spark metrics.

@yuecong
Copy link
Contributor

yuecong commented Oct 9, 2019

BTW, the following cover some gigantic clusters, but not all cases. There is a different and cheaper approach like Federation. We are using Federation which fits our environment much better.

Would like to hear more about this. Our experience on Federation is that if we do not filter out some metrics but just federate metrics from each child server, the Federation prom will become the bottleneck. Federation provid a way to gather metrics from other Prometheus server, but it does not solve the scalability issues.
How many metrics you have for your Federation Prometheus ?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Oct 9, 2019

Nope. Why do you collect all? It's up to your configuration.

Back to the beginning, I fully understand your cluster's underlying issues.

However, none of them blocks Apache Spark support Prometheus metric natively.

  1. First, you can use the previous existing solution if you have that. (spark.ui.prometheus.enabled is also by default false).
  2. Second, your claims are too general. Not every users have that kind of gigantic size clusters. Although a few big customers have some, there are also many satellite small-size clusters.

I'm not sure your metric. Could you share us the size of your clusters and the number of apps and the number of metrics? Does it run on Apache Spark?

@dongjoon-hyun
Copy link
Member Author

I don't see any number from you so far here. :)

@dongjoon-hyun
Copy link
Member Author

Gentle ping, @gatorsmile , @srowen , @dbtsai , @viirya , @yuecong .

Brief Summary

  1. Apache Spark has been providing a REST API. The new Prometheus servlet is only exposing the same information. If there is a scalability issue inside Apache Spark, we should fix it for both the existing REST API and this. That is orthogonal to this PR.
  2. Prometheus (performance or architectural) issues are completely irrelevant to this PR.

@srowen
Copy link
Member

srowen commented Oct 9, 2019

(I don't have any opinion on this - don't know Prometheus)

val prefix = "metrics_executor_"
val labels = Seq(
"application_id" -> store.applicationInfo.id,
"application_name" -> store.applicationInfo.name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if application name is needed, because you have application id already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only field human-readable to distinguish the jobs.

"application_name" -> store.applicationInfo.name,
"executor_id" -> executor.id
).map { case (k, v) => s"""$k="$v"""" }.mkString("{", ", ", "}")
sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n")
Copy link
Member

@viirya viirya Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually prefix is fixed? So they are now the same metrics. And application id, executor id now are labels on them?

Does it have bad impact on the metrics usage later? Because now all applications are recorded under the same metrics. I am not sure how Prometheus processes, but naturally I'd think Prometheus needs to search specified application id in the metrics of all applications.

Previously you have appId and executor id in metric name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The redundant information is moved to labels.

Actually prefix is fixed? So they are now the same metrics. And application id, executor id now are labels on them?

No. Prometheus query language support to handle them individually.

Does it have bad impact on the metrics usage later? Because now all applications are recorded under the same metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Prometheus query language support to handle them individually.

Yes. But I am wondering is, now all numbers from all applications are recorded under same metric. To retrieve number for specified application, does not Prometheus need to search it among all applications' metric numbers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may misunderstand Prometheus's approach. If so, then this might not be a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that Prometheus question, different labels mean different time-series in Prometheus.

Prometheus fundamentally stores all data as time series: streams of timestamped values belonging to the same metric and the same set of labeled dimensions.

Here are the reference for the details~

@HyukjinKwon
Copy link
Member

Seems fine.

@dongjoon-hyun
Copy link
Member Author

Thank you all, @yuecong , @viirya , @srowen , @HyukjinKwon .
Merged to master.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-29400 branch October 10, 2019 15:47
atronchi pushed a commit to atronchi/spark that referenced this pull request Oct 23, 2019
### What changes were proposed in this pull request?

[SPARK-29064](apache#25770) introduced `PrometheusResource` to expose `ExecutorSummary`. This PR aims to improve it further more `Prometheus`-friendly to use [Prometheus labels](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels).

### Why are the changes needed?

**BEFORE**
```
metrics_app_20191008151432_0000_driver_executor_rddBlocks_Count 0
metrics_app_20191008151432_0000_driver_executor_memoryUsed_Count 0
metrics_app_20191008151432_0000_driver_executor_diskUsed_Count 0
```

**AFTER**
```
$ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3
metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
```

### Does this PR introduce any user-facing change?

No, but `Prometheus` understands the new format and shows more intelligently.

<img width="735" alt="ui" src="https://user-images.githubusercontent.com/9700541/66438279-1756f900-e9e1-11e9-91c7-c04c6ce9172f.png">

### How was this patch tested?

Manually.

**SETUP**
```
$ sbin/start-master.sh
$ sbin/start-slave.sh spark://`hostname`:7077
$ bin/spark-shell --master spark://`hostname`:7077 --conf spark.ui.prometheus.enabled=true
```

**RESULT**
```
$ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3
metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
```

Closes apache#26060 from dongjoon-hyun/SPARK-29400.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@RamakrishnaHande
Copy link

I see that this PR is closed but https://issues.apache.org/jira/browse/SPARK-29400 says issue is resolved. Is that true ?

@srowen
Copy link
Member

srowen commented Jan 18, 2023

Right, the change was merged several years ago?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants