[WIP] Expose Rest API to return metrics in prometheus format#21599
[WIP] Expose Rest API to return metrics in prometheus format#21599karteekmurthys wants to merge 6 commits intomasterfrom
Conversation
|
bc7f6a3 to
87e94ee
Compare
|
Ahana is not a great name for this. How about PrometheusStatsReporter? |
87e94ee to
1cabfe5
Compare
| } | ||
|
|
||
| void PrestoServer::reportHealthMetrics(proxygen::ResponseHandler* downstream) { | ||
| auto reporter = std::dynamic_pointer_cast<ahana::PrometheusStatsReporter>( |
There was a problem hiding this comment.
Remove the ahana namespace.
|
|
||
| namespace ahana { | ||
|
|
||
| class AhanaStatsReporterTest : public testing::Test { |
There was a problem hiding this comment.
Remove the use of the ahana namespace and use PrometheusStatsReporterTest as the name of the test instead.
|
|
||
| #include "PrometheusStatsReporter.h" | ||
|
|
||
| namespace ahana { |
There was a problem hiding this comment.
This class should be under facebook::presto namespace.
| const std::vector<int32_t>& /* pcts */) const override {} | ||
|
|
||
| void addMetricValue(const std::string& key, size_t value = 1) | ||
| const override{}; |
There was a problem hiding this comment.
Why is this an empty method ? We should do the same logic as the other addMetricValue methods.
bdff158 to
13e58ac
Compare
|
@karteekmurthys : Please add some documentation. The doc should have information about:
In addition, I feel that Prometheus StatsReporter should be added based on some config (and not default). |
|
@karteekmurthys let me know when you have added the documentation and I'll review it then. If I can help you get the docs drafted into review-ready status, just ask! |
@steveburnett thanks! I am currently drafting it. Will reach out to you soon. cc: @aditi-pandit |
7b8851f to
b01736d
Compare
daca1b2 to
0effee9
Compare
13e58ac to
4dc5230
Compare
6e46e4d to
725c364
Compare
There was a problem hiding this comment.
Have you considered using https://github.com/jupp0r/prometheus-cpp or some opensource implementation for stats collection? If not any specific reason?
There was a problem hiding this comment.
The data model is not too complex to rely on a library at this point. We want to keep external dependencies minimal as well. If we discover that writing serialization ourselves is getting complex we may consider using the library.
The library has additional support to create exporter as well as a push support to a gateway.
void RegisterCollectable(const std::weak_ptr<Collectable>& collectable,
const std::string& uri = std::string("/metrics"));
void RegisterAuth(
std::function<bool(const std::string&, const std::string&)> authCB,
const std::string& realm = "Prometheus-cpp Exporter",
const std::string& uri = std::string("/metrics"));
I don't think we need these capabilities at this point.
There was a problem hiding this comment.
On the other hand, it supports all metrics, is production tested and widely used (though like an unofficial library)
There was a problem hiding this comment.
I have created a draft PR - #21753, we believe that using the library is the fastest way to production since its production tested. Not sure how to integrate
There was a problem hiding this comment.
Using the library seems better if it is production-ready and widely used.
This can be an optional dependency that we can turn on/off at compile time.
I can help with the dependency management.
There was a problem hiding this comment.
@karteekmurthys I believe Jay already has a PR in progress here #21753
We should base our work on that.
You can help with the dependency management in that PR.
There was a problem hiding this comment.
@jaystarshot I took a shot at using prometheus-cpp in this PR itself. I have added a compile time settings to include prometheus-directory. Added some unit tests. I will further modify the circle ci configs to run unit tests for this change. Feel free to use this PR for your testing. I will be opening an RFC soon to get this merged with Presto.
There was a problem hiding this comment.
Have you tried this in local? I don't think this will work since there is no exposer defined. My PR #21753 shows how to do that. Can you share the RFC with me too? I can also input from my end thanks
There was a problem hiding this comment.
We don't need an Exposer. We already expose a REST API, check PrestoServer.cpp changes in this PR.
You can configure your Prometheus server to scrape the enpoint: /v1/info/health/metrics exposed by the worker.
Or you can configure whatever intermediary process to scrape from worker endpoint.
Missed adding the test results in previous comment, here it is: prometheus-format-metrics.txt
There was a problem hiding this comment.
I see thanks! I think having an exposer is better since it separates the http server and makes it configurable based on the number of threads etc. In future we can also move this easily to a sidecar. We can sync offline regarding the RFC too if you have a draft prepared already
725c364 to
892525f
Compare
4dc5230 to
e16ad47
Compare
e16ad47 to
a4dc38c
Compare
892525f to
571c2c8
Compare
There was a problem hiding this comment.
let's move the Docker/Jenkins changes to its own PR.
d0e3abb to
663fe40
Compare
076f229 to
f1fb59f
Compare
There was a problem hiding this comment.
@karteekmurthys Karteek, let me add this config in a separte PR. Inside Meta there is a small wrapper on PrestoServer that does following:
if (gflag enabled):
initialize stats reporting
initialize presto server
I think we can use the config and remove the internal flag altogether. Let me introduce this flag in a PR and modify internal code. You can rebase on top of that.
There was a problem hiding this comment.
@karteekmurthys Here is the small PR
#21745
Once we merge it, I will refactor internal logic to use the cofnig property.
1f43cb8 to
0046c30
Compare
0046c30 to
01bb923
Compare
01bb923 to
3ce8d6c
Compare
|
|
||
| private: | ||
| /// Mapping of registered stats key to StatType. | ||
| mutable std::unordered_map<std::string, facebook::velox::StatType> |
There was a problem hiding this comment.
Use mutable folly::ConcurrentHashMap
There was a problem hiding this comment.
we normally use folly::Synchronized<> to do this. Lmk revisit this after finishing the RFC. I wanted to unblock any testing you wanted to do, so focussed on getting a working version. I need to clean up the tests as well.
| /// A mapping from stats key of type COUNT to value. | ||
| mutable std::unordered_map<std::string, size_t> metricsMap_; | ||
| // Mutex to control access to registeredStats_ and metricMap_ members. | ||
| mutable std::mutex mutex_; |
| proxygen::HTTPMessage* /*message*/, | ||
| const std::vector<std::unique_ptr<folly::IOBuf>>& /*body*/, | ||
| proxygen::ResponseHandler* downstream) { | ||
| server->reportHealthMetrics(downstream); |
There was a problem hiding this comment.
Why do we need this? If we just define an exposer that will create another http server on /metrics or /v1/metrics? We can also define the number of threads to be used by the new http server
There was a problem hiding this comment.
That is exactly what we are trying to avoid. We didn't find a strong need to start a new server sharing space with our worker containers. Reduces the overhead of maintaining and launching another process. Why do you want multiple threads to be spawned by the exposer? are you expecting high traffic.
CMIW, prometheus server is configured to periodically call the scrape endpoint. So, we can expect 1 HTTP request at X seconds interval. If you are expecting huge traffic, then it is not recommended to share space with worker instance.
There was a problem hiding this comment.
I think isolation is also one point, we do not want to degrade the presto process as much as we can
There was a problem hiding this comment.
@amitkdutta : What approach did Meta take for this ? Does your metric collection use any endpoint in Presto worker process itself, or did you start another server ?
|
Moved it here: #22360 (review) |
Description
This PR is still a work-in-progress.
There are 2 approaches implemented in this PR:
This PR implements the BaseStatsReporter interface as StatsReporterImpl. The current design of StatsReporterImpl holds the metrics in-memory and exposes a method getMetrics(MetricsSerializer) that serializes the metrics with the serializer passed down to the function call. MetricsSerializer is an interface that users can implement to customize metric serialization. This approach doesn't support Histogram
Implements PrometheusReporter using the prometheus-cpp library. The library has interfaces to create all metric types (COUNTER, GAUGE, Histograms and Summaries). The library exposes a Registry which can be used to hold the references to the counters. Also, a textSerializer interface that can convert the metrics into Prometheus Data Model.
We may chose one of the above or both.
To the outside world prestissimo exposes a rest API
/v1/info/health/metrics, which in turn uses StatsReporterImpl::getMetrics(MetricsSerializer).The Prometheus server must be configured to regularly probe this Prestissimo rest API to collect metrics.
Test Plan
Added a unit test class for StatsReporterImpl and also PrometheusReporter.
Also, Tested REST API locally and results are attached here.
prometheus-format-metrics.txt using prometheus-cpp library
Contributor checklist