Skip to content

[native] Add prometheus-histogram-metrics-collection-enabled config#25065

Closed
xin-zhang2 wants to merge 1 commit intoprestodb:masterfrom
xin-zhang2:histograms_flag
Closed

[native] Add prometheus-histogram-metrics-collection-enabled config#25065
xin-zhang2 wants to merge 1 commit intoprestodb:masterfrom
xin-zhang2:histograms_flag

Conversation

@xin-zhang2
Copy link
Contributor

@xin-zhang2 xin-zhang2 commented May 7, 2025

Description

Fixed #25058

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* ... 
* ... 

Hive Connector Changes
* ... 
* ... 

If release note is NOT required, use:

== NO RELEASE NOTE ==

@xin-zhang2 xin-zhang2 requested review from a team, elharo and steveburnett as code owners May 7, 2025 19:07
@xin-zhang2 xin-zhang2 requested a review from jaystarshot May 7, 2025 19:07
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label May 7, 2025
@prestodb-ci prestodb-ci requested review from a team, jp-sivaprasad and nishithakbhaskaran and removed request for a team May 7, 2025 19:07
@xin-zhang2 xin-zhang2 requested a review from majetideepak May 7, 2025 19:09
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! Just two formatting nits.

@xin-zhang2 xin-zhang2 marked this pull request as draft May 7, 2025 20:00
@xin-zhang2 xin-zhang2 force-pushed the histograms_flag branch 3 times, most recently from a07211b to 658e167 Compare May 7, 2025 21:05
@xin-zhang2 xin-zhang2 marked this pull request as ready for review May 8, 2025 07:57
@prestodb-ci prestodb-ci requested a review from a team May 8, 2025 07:57
steveburnett
steveburnett previously approved these changes May 8, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build. Thanks!

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xin-zhang2. Have few comments.

int64_t min,
int64_t max,
const std::vector<int32_t>& pcts) const {
if (!enableHistogramMetricCollection) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to disable registering the metrics as well ? We could only disable adding metric values to keep this simple for the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.
I’m inclined to disable both. If we only disable adding metric values, those metrics would still appear in the API output, which could be confusing — they’d show up in the response but wouldn’t actually be collected.
Do you think it would make more sense to only disable the metric collection?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should disable registering as well. We should add a VLOG(1) that this histogram metric is disabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that users should be aware that the histogram metric exists but we are not collecting it on account of performance concerns. Its okay if its not populated in graphs that are exposed by tooling.

If a user feels that they want to see the values, then they can turn it on as a session property.

If we disable registering then a server restart is needed each time we want to turn on/off the histrogram collection.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want to enable/disable this at the session level. The downside of always enabling is that histograms will always be returned empty, and it might be confusing in certain settings. This fix is temporary. We need to fix the root cause of the performance degradation with histograms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to investigate the performance degradation with histograms and be committed to fixing it. If we are not able to fix it, then it would be better to expose the histogram metrics to the user as empty and turn them on/off for debugging.

static constexpr std::string_view kEnableRuntimeMetricsCollection{
"runtime-metrics-collection-enabled"};

static constexpr std::string_view kEnableRuntimeHistogramMetricsCollection{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly use prometheus in the same since this config is only for Prometheus stats ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is only for PrometheusStatsReporter and it makes sense to use prometheus in the name.
My thinking is that since PrometheusStatsReporter is the only implementation of StatReporter in Presto when the metrics collection is enabled, we can avoid referencing prometheus in Configs.h that is in the common directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xin-zhang2 : Meta has their own implementation of metrics collection.

Though you have a valid point about the name in Configs.h.

Lets keep it as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add Prometheus as the issue is specific to Prometheus. We are only disabling histogram metrics for Prometheus.
We could add class PrometheusConfig : public ConfigBase { to make this clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good not to have too many config objects, as configs have to be copied in QueryContextManager

Though I also lean that having prometheus in the name is better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as configs have to be copied in QueryContextManager

We don't have to in this case.

int64_t min,
int64_t max,
const std::vector<int32_t>& pcts) const {
if (!enableHistogramMetricCollection) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should disable registering as well. We should add a VLOG(1) that this histogram metric is disabled.

When enabled and Presto C++ workers interact with the S3 filesystem, additional runtime metrics are collected.
For a detailed list of these metrics, see `S3 FileSystem <https://facebookincubator.github.io/velox/monitoring/metrics.html#s3-filesystem>`_.

``runtime-histogram-metrics-collection-enabled``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is specific to prometheus. Meta's implementation does not have this issue.
Let's name this prometheus-histogram-metrics-collection-enabled. Move the documentation close to prometheus.

Enables collection of worker level metrics.

``runtime-histogram-metrics-collection-enabled``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not document prometheus here?

* **Type:** ``boolean``
* **Default value:** ``false``

Enable histogram metrics collection. This property is only valid when ``runtime-metrics-collection-enabled`` is set to ``true``. No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that there can be a significant performance degradation when enabled.

if (enableRuntimeMetricsCollection) {
configProperties = format("%s%n" +
"runtime-metrics-collection-enabled=true%n", configProperties);
configProperties = format("%s%n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is used for all the tests and QueryRunners...Why are you enabling histogram metrics collection here ? It might be better to disable here as well.

Copy link
Contributor Author

@xin-zhang2 xin-zhang2 May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that these are for non-production environemnt, so I kept the behavior unchanged from before we introduced the new config.
Do you think it would be better to disable it? enableRuntimeMetricsCollection is now passed as a function parameter, and we could add another parameter to the function for the histogram metrics.


```
runtime-metrics-collection-enabled=true
runtime-histogram-metrics-collection-enabled=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we suggest turning this on here ? It might be better to turn off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is meant to show how to enable the metrics collection when building Prestissimo from source, so I guess it make sense to set it to true here.

* **Type:** ``boolean``
* **Default value:** ``false``

Enable histogram metrics collection. This property is only valid when ``runtime-metrics-collection-enabled`` is set to ``true``. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have examples of the current histogram metrics that will not be collected as well.

http-server.http.port=7777
shutdown-onset-sec=1
runtime-metrics-collection-enabled=true
runtime-histogram-metrics-collection-enabled=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needn't be added here as well. We could continue with the default to disable histogram metrics collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This properties file is used by the QueryRunners for non-production environment, so I left its behavior unchanged from before the introduction of the new config.

int64_t min,
int64_t max,
const std::vector<int32_t>& pcts) const {
if (!enableHistogramMetricCollection) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that users should be aware that the histogram metric exists but we are not collecting it on account of performance concerns. Its okay if its not populated in graphs that are exposed by tooling.

If a user feels that they want to see the values, then they can turn it on as a session property.

If we disable registering then a server restart is needed each time we want to turn on/off the histrogram collection.

static constexpr std::string_view kEnableRuntimeMetricsCollection{
"runtime-metrics-collection-enabled"};

static constexpr std::string_view kEnableRuntimeHistogramMetricsCollection{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good not to have too many config objects, as configs have to be copied in QueryContextManager

Though I also lean that having prometheus in the name is better.

@xin-zhang2
Copy link
Contributor Author

Made some changes based on the review comments.

  1. Renamed runtime-histogram-metrics-collection-enabled to prometheus-histogram-metrics-collection-enabled, and moved it to a new class PrometheusConfig under the runtime-metrics directory to ensure the common directory is clean. The PrometheusConfig also reads from config.properties.
  2. Added a new parameter enablePrometheusHistogramMetricsCollection to createNativeQueryRunner function and set it to true only when enableRuntimeMetricsCollection is also true.
  3. Removed the new config from presto-native-execution/etc/config.properties, so the Prometheus histogram metrics are disabled by default when runing PrestoMain.cpp.

Currently the registration and collection are both disabled. Not sure if we need a session property enable toggling it at runtime.

#23338 mentioned that a metric updated in the query path would degrade the performance. So I suspect we run into the same issue because kMetricTableScanBatchProcessTimeMs is updated in getOutput() of TableScan (https://github.com/facebookincubator/velox/pull/12759/files#diff-b04232f155cc15faca93b69cf34dd80e29b534a4e54dd774bf8bdbedbfc66d6bR286).
If this is the root cause, it should be resolved by #24716.

@majetideepak @aditi-pandit

@aditi-pandit aditi-pandit changed the title [native] Add runtime-histogram-metrics-collection-enabled config [native] Add prometheus-histogram-metrics-collection-enabled config May 29, 2025
int64_t max,
const std::vector<int32_t>& pcts) const {
if (!enablePrometheusHistogramMetricCollection) {
VLOG(1) << "Prometheus histogram metrics collection is disabled";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the name of the disabled metric in the log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config will disable all histogram metrics, so it might not be necessary to list each of them individually in the log.
Currently there are 47 histogram metrics defined across presto_cpp and velox.

private PrestoNativeQueryRunnerUtils() {}

public static QueryRunner createQueryRunner(boolean addStorageFormatToPath, boolean isCoordinatorSidecarEnabled, boolean enableRuntimeMetricsCollection, boolean enableSsdCache)
public static QueryRunner createQueryRunner(boolean addStorageFormatToPath, boolean isCoordinatorSidecarEnabled, boolean enableRuntimeMetricsCollection, boolean enablePrometheusHistogramMetricsCollection, boolean enableSsdCache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait for #25120 that adds builder style usage for QueryRunner constructor and update the methods accordingly.

Copy link
Contributor Author

@xin-zhang2 xin-zhang2 May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for letting me know. Yes, the builder pattern will definitely make the creation of the QueryRunner easier. I’ll update the code once that PR is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xin-zhang2 : That PR is merged. Please update this code.

Co-authored-by: Steve Burnett <burnett@pobox.com>
@xin-zhang2
Copy link
Contributor Author

Close as the issue has been fixed by #24716.

@xin-zhang2 xin-zhang2 closed this Jul 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Native] Prometheus histogram metrics can cause significant performance degradation

5 participants