-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streams: two additional Prometheus metrics for connections #10275
Streams: two additional Prometheus metrics for connections #10275
Conversation
This has conflicts with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work :)
@@ -414,6 +428,8 @@ label(M) when is_map(M) -> | |||
end, <<>>, M); | |||
label(#resource{virtual_host = VHost, kind = exchange, name = Name}) -> | |||
<<"vhost=\"", VHost/binary, "\",exchange=\"", Name/binary, "\"">>; | |||
label({#resource{virtual_host = VHost, kind = queue, name = Name}, P, _}) when is_pid(P) -> | |||
<<"vhost=\"", VHost/binary, "\",queue=\"", Name/binary, "\"","channel=\"",(iolist_to_binary(pid_to_list(P)))/binary, "\"">>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recently another PR #9656 was merged which adds escaping to label values. I think you need to rebase to latest main and add escaping to this line as well (the vhost and queue name values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @gomoripeti I did completely miss this!
@@ -694,7 +743,9 @@ accumulate_count_and_sum(Value, {Count, Sum}) -> | |||
|
|||
empty(T) when T == channel_queue_exchange_metrics; T == channel_process_metrics; T == queue_consumer_count -> | |||
{T, 0}; | |||
empty(T) when T == connection_coarse_metrics; T == auth_attempt_metrics; T == auth_attempt_detailed_metrics -> | |||
empty(T) when T == connection_coarse_metrics; T == rabbit_stream_publisher_created; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly rabbit_stream_publisher_created is not needed
@@ -512,6 +528,35 @@ get_data(channel_metrics = Table, false, _, _) -> | |||
[{Table, [{consumer_count, A1}, {messages_unacknowledged, A2}, {messages_unconfirmed, A3}, | |||
{messages_uncommitted, A4}, {acks_uncommitted, A5}, {prefetch_count, A6}, | |||
{global_prefetch_count, A7}]}]; | |||
get_data(rabbit_stream_publisher_created = Table, false, _, _) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is some rebase or copy-paste issue here. this function clause is not necessary, the next one covers publisher metrics.
ce61937
to
d50dd43
Compare
d50dd43
to
051c917
Compare
there are valid test failures. We are working on fixing them... |
051c917
to
66ec336
Compare
after @gomoripeti 's push above, I believe this PR is ready for review as all Prometheus related tests passing and conflicts are resolved. (looks like failing tests are related to MQTT_V5) |
Hi! This looks very good, thank you for implementing it. What do you think of the idea to expose the |
Exposing the current segment count sounds OK to me. |
that sounds good! perhaps I could look into that in a different PR once this gets merged :) |
bbf276e
to
b59730d
Compare
b59730d
to
2affbe8
Compare
force-push was a rebase to latest main. google-github-actions/auth fails in CI |
CI fails because external PRs do not have access to the necessary secrets. We will run the tests locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot rebase this PR onto main
without resolving a boatload of conflicts.
Specifically, I have given up on the 7th.
It's probably based on a revision that's a month or so out of date.
Indeed, https://github.com/cloudamqp/rabbitmq-server/ still uses |
Rebasing Then rebasing this PR onto the result will conflict with #11431 but that should be it. |
2affbe8
to
82eaa32
Compare
I used the github UI button "Rebase" previously, sorry if it caused some trouble. Now I manually rebased to latest main @ rabbitmq/rabbitmq-server and fixed the conflict with #11431 |
Is this PR still active and can it be accepted? If not, can I create a new pull PR for the same purpose? I plan to use stream queues and really need to monitor consumer lag. If this PR is accepted, I am willing to create a new PR to collect this metric (consumer lag) in Datadog. |
82eaa32
to
289029f
Compare
thanks for the ping. I guess this PR accidentally got forgotten by the Core Team. I rebased to latest main. |
I noticed that the current implementation uses vhost, queue, and connection PID as labels. Instead or in addition to PID it would be better to include publisher_id/subscription_id as label to the per-object metrics. I'll work on this. |
Huh, anyway. David is right, cardinality going to be too big. |
thanks for the valuable input. As streams are long lived would per-stream (no connection and pub/sub id labels) metrics be acceptable?
|
yeah, I would start with describing user stories around these new metrics. I like max_offset_lag per stream since I can imagine setting an alert on this. The rest I don't know. So how they would be used? As for metrics granularity, "per-object", etc would it make more sense to have only reasonable aggregates in prometheus and referring to Management UI for granularity when an event occurred? |
Exposing per-object Prometheus metrics for streams is acceptable from my point of view, similar to how exposing per-object metrics for quorum queues is acceptable. These are usually rather long lived entities. On the other hand, a connection identified by its Pid is what I consider short lived. |
I think the urge to add high cardinality metrics to the prometheus endpoint comes from the feeling that metrics in the mgmt plugin will be deprecated and removed eventually. (I think it is still not clear how and to what extent this will happen. As a user I really like the graphs with some history on per connection or per queue basis. As a maintainer or contributor it is quite a complexity so I agree it should be replaced somehow) For use cases I think |
We don't have a clear idea on what we want to do about management metrics at this point so they are unlikely to be removed anytime soon. Prometheus was considered to be an alternative to get the metrics out, but this cardinality issue makes it a bad candidate for this. So we are still at square one in that regard, we want something better than management metrics currently are, but we don't know how to go about it. |
Removal of management UI metrics is not something we have discussed in many months, not something planned for the foreseeable future, and the last time it was discussed, it was a discussion about undeprecating it. |
so what I can do is add metrics which are sum per-stream:
and max per-stream for
Or if there is uncertainty if the former are useful enough, those can be postponed and only add |
I would start with max lag first |
142a219
to
6aae7be
Compare
I updated this PR to only include max_offset_leg metric. |
|
It's not recompiled. You need to remove the test on TEST and instead add a comment just above this function's export that says Functions that are hidden behind TEST in rabbit are often a misuse of test builds that we have noticed and begun correcting. |
So that they can be used from multiple test suites.
Supports both per stream (detailed) and aggregated (metrics) values.
The application is not always recompiled which causes tests to fail because they cannot call `serial_number:usort/1`.
c0418ba
to
05a3733
Compare
Rebased to latest main. This is ready for another round of review. The failing test cases seem unrelated to me. |
I have received no objections to this PR from other core team members. |
This seems safe to backport to |
#12765 is in, the core team is now discussing whether it is something we should backport to |
Proposed Changes
This PR adds support for metrics from the
rabbit_stream_consumer_created
andrabbit_stream_publisher_created
ETS tables to be exposed through prometheus through more user-friendly named metrics endpoints. Based off the conversation in a previous PR: #3043The change is requested to ensure customers using stream connections can scrape metrics from the prometheus endpoints.
Types of Changes
What types of changes does your code introduce to this project?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply.You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.
CONTRIBUTING.md
documentFurther Comments
If PR is accepted I think this should also be pushed to the documentation on the rabbitmq-website. Do I create a PR to that repo once this is merged? Or is there another procedure to follow?
Thanks!