[native] Add prometheus-histogram-metrics-collection-enabled config#25065
[native] Add prometheus-histogram-metrics-collection-enabled config#25065xin-zhang2 wants to merge 1 commit intoprestodb:masterfrom
Conversation
96a0153 to
a5c9b54
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Thanks for the doc! Just two formatting nits.
a07211b to
658e167
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build. Thanks!
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @xin-zhang2. Have few comments.
| int64_t min, | ||
| int64_t max, | ||
| const std::vector<int32_t>& pcts) const { | ||
| if (!enableHistogramMetricCollection) { |
There was a problem hiding this comment.
Do you really want to disable registering the metrics as well ? We could only disable adding metric values to keep this simple for the future.
There was a problem hiding this comment.
Thanks for the review.
I’m inclined to disable both. If we only disable adding metric values, those metrics would still appear in the API output, which could be confusing — they’d show up in the response but wouldn’t actually be collected.
Do you think it would make more sense to only disable the metric collection?
There was a problem hiding this comment.
We should disable registering as well. We should add a VLOG(1) that this histogram metric is disabled.
There was a problem hiding this comment.
My thinking is that users should be aware that the histogram metric exists but we are not collecting it on account of performance concerns. Its okay if its not populated in graphs that are exposed by tooling.
If a user feels that they want to see the values, then they can turn it on as a session property.
If we disable registering then a server restart is needed each time we want to turn on/off the histrogram collection.
There was a problem hiding this comment.
I am not sure if we want to enable/disable this at the session level. The downside of always enabling is that histograms will always be returned empty, and it might be confusing in certain settings. This fix is temporary. We need to fix the root cause of the performance degradation with histograms.
There was a problem hiding this comment.
It will be good to investigate the performance degradation with histograms and be committed to fixing it. If we are not able to fix it, then it would be better to expose the histogram metrics to the user as empty and turn them on/off for debugging.
| static constexpr std::string_view kEnableRuntimeMetricsCollection{ | ||
| "runtime-metrics-collection-enabled"}; | ||
|
|
||
| static constexpr std::string_view kEnableRuntimeHistogramMetricsCollection{ |
There was a problem hiding this comment.
Should we explicitly use prometheus in the same since this config is only for Prometheus stats ?
There was a problem hiding this comment.
This config is only for PrometheusStatsReporter and it makes sense to use prometheus in the name.
My thinking is that since PrometheusStatsReporter is the only implementation of StatReporter in Presto when the metrics collection is enabled, we can avoid referencing prometheus in Configs.h that is in the common directory.
There was a problem hiding this comment.
@xin-zhang2 : Meta has their own implementation of metrics collection.
Though you have a valid point about the name in Configs.h.
Lets keep it as is.
There was a problem hiding this comment.
We should add Prometheus as the issue is specific to Prometheus. We are only disabling histogram metrics for Prometheus.
We could add class PrometheusConfig : public ConfigBase { to make this clean.
There was a problem hiding this comment.
It would be good not to have too many config objects, as configs have to be copied in QueryContextManager
Though I also lean that having prometheus in the name is better.
There was a problem hiding this comment.
as configs have to be copied in QueryContextManager
We don't have to in this case.
| int64_t min, | ||
| int64_t max, | ||
| const std::vector<int32_t>& pcts) const { | ||
| if (!enableHistogramMetricCollection) { |
There was a problem hiding this comment.
We should disable registering as well. We should add a VLOG(1) that this histogram metric is disabled.
| When enabled and Presto C++ workers interact with the S3 filesystem, additional runtime metrics are collected. | ||
| For a detailed list of these metrics, see `S3 FileSystem <https://facebookincubator.github.io/velox/monitoring/metrics.html#s3-filesystem>`_. | ||
|
|
||
| ``runtime-histogram-metrics-collection-enabled`` |
There was a problem hiding this comment.
The issue is specific to prometheus. Meta's implementation does not have this issue.
Let's name this prometheus-histogram-metrics-collection-enabled. Move the documentation close to prometheus.
| Enables collection of worker level metrics. | ||
|
|
||
| ``runtime-histogram-metrics-collection-enabled`` | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
There was a problem hiding this comment.
Do we not document prometheus here?
| * **Type:** ``boolean`` | ||
| * **Default value:** ``false`` | ||
|
|
||
| Enable histogram metrics collection. This property is only valid when ``runtime-metrics-collection-enabled`` is set to ``true``. No newline at end of file |
There was a problem hiding this comment.
Clarify that there can be a significant performance degradation when enabled.
| if (enableRuntimeMetricsCollection) { | ||
| configProperties = format("%s%n" + | ||
| "runtime-metrics-collection-enabled=true%n", configProperties); | ||
| configProperties = format("%s%n" + |
There was a problem hiding this comment.
This code is used for all the tests and QueryRunners...Why are you enabling histogram metrics collection here ? It might be better to disable here as well.
There was a problem hiding this comment.
My thinking is that these are for non-production environemnt, so I kept the behavior unchanged from before we introduced the new config.
Do you think it would be better to disable it? enableRuntimeMetricsCollection is now passed as a function parameter, and we could add another parameter to the function for the histogram metrics.
presto-native-execution/README.md
Outdated
|
|
||
| ``` | ||
| runtime-metrics-collection-enabled=true | ||
| runtime-histogram-metrics-collection-enabled=true |
There was a problem hiding this comment.
Do we suggest turning this on here ? It might be better to turn off.
There was a problem hiding this comment.
This doc is meant to show how to enable the metrics collection when building Prestissimo from source, so I guess it make sense to set it to true here.
| * **Type:** ``boolean`` | ||
| * **Default value:** ``false`` | ||
|
|
||
| Enable histogram metrics collection. This property is only valid when ``runtime-metrics-collection-enabled`` is set to ``true``. No newline at end of file |
There was a problem hiding this comment.
Would be great to have examples of the current histogram metrics that will not be collected as well.
| http-server.http.port=7777 | ||
| shutdown-onset-sec=1 | ||
| runtime-metrics-collection-enabled=true | ||
| runtime-histogram-metrics-collection-enabled=true |
There was a problem hiding this comment.
This needn't be added here as well. We could continue with the default to disable histogram metrics collection.
There was a problem hiding this comment.
This properties file is used by the QueryRunners for non-production environment, so I left its behavior unchanged from before the introduction of the new config.
| int64_t min, | ||
| int64_t max, | ||
| const std::vector<int32_t>& pcts) const { | ||
| if (!enableHistogramMetricCollection) { |
There was a problem hiding this comment.
My thinking is that users should be aware that the histogram metric exists but we are not collecting it on account of performance concerns. Its okay if its not populated in graphs that are exposed by tooling.
If a user feels that they want to see the values, then they can turn it on as a session property.
If we disable registering then a server restart is needed each time we want to turn on/off the histrogram collection.
| static constexpr std::string_view kEnableRuntimeMetricsCollection{ | ||
| "runtime-metrics-collection-enabled"}; | ||
|
|
||
| static constexpr std::string_view kEnableRuntimeHistogramMetricsCollection{ |
There was a problem hiding this comment.
It would be good not to have too many config objects, as configs have to be copied in QueryContextManager
Though I also lean that having prometheus in the name is better.
|
Made some changes based on the review comments.
Currently the registration and collection are both disabled. Not sure if we need a session property enable toggling it at runtime. #23338 mentioned that a metric updated in the query path would degrade the performance. So I suspect we run into the same issue because kMetricTableScanBatchProcessTimeMs is updated in getOutput() of TableScan (https://github.com/facebookincubator/velox/pull/12759/files#diff-b04232f155cc15faca93b69cf34dd80e29b534a4e54dd774bf8bdbedbfc66d6bR286). |
| int64_t max, | ||
| const std::vector<int32_t>& pcts) const { | ||
| if (!enablePrometheusHistogramMetricCollection) { | ||
| VLOG(1) << "Prometheus histogram metrics collection is disabled"; |
There was a problem hiding this comment.
Please add the name of the disabled metric in the log message.
There was a problem hiding this comment.
The config will disable all histogram metrics, so it might not be necessary to list each of them individually in the log.
Currently there are 47 histogram metrics defined across presto_cpp and velox.
| private PrestoNativeQueryRunnerUtils() {} | ||
|
|
||
| public static QueryRunner createQueryRunner(boolean addStorageFormatToPath, boolean isCoordinatorSidecarEnabled, boolean enableRuntimeMetricsCollection, boolean enableSsdCache) | ||
| public static QueryRunner createQueryRunner(boolean addStorageFormatToPath, boolean isCoordinatorSidecarEnabled, boolean enableRuntimeMetricsCollection, boolean enablePrometheusHistogramMetricsCollection, boolean enableSsdCache) |
There was a problem hiding this comment.
Please wait for #25120 that adds builder style usage for QueryRunner constructor and update the methods accordingly.
There was a problem hiding this comment.
Sure, thanks for letting me know. Yes, the builder pattern will definitely make the creation of the QueryRunner easier. I’ll update the code once that PR is merged.
There was a problem hiding this comment.
@xin-zhang2 : That PR is merged. Please update this code.
Co-authored-by: Steve Burnett <burnett@pobox.com>
f50e346 to
deb8093
Compare
|
Close as the issue has been fixed by #24716. |
Description
Fixed #25058
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.
If release note is NOT required, use: