Streams: two additional Prometheus metrics for connections #10275

markus812498 · 2024-01-04T01:54:03Z

Proposed Changes

This PR adds support for metrics from the rabbit_stream_consumer_created and rabbit_stream_publisher_created ETS tables to be exposed through prometheus through more user-friendly named metrics endpoints. Based off the conversation in a previous PR: #3043

The change is requested to ensure customers using stream connections can scrape metrics from the prometheus endpoints.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If PR is accepted I think this should also be pushed to the documentation on the rabbitmq-website. Do I create a PR to that repo once this is merged? Or is there another procedure to follow?

Thanks!

michaelklishin · 2024-01-04T09:59:10Z

This has conflicts with main and needs rebasing.

gomoripeti

great work :)

gomoripeti · 2024-01-04T10:07:18Z

deps/rabbitmq_prometheus/src/collectors/prometheus_rabbitmq_core_metrics_collector.erl

@@ -414,6 +428,8 @@ label(M) when is_map(M) ->
              end, <<>>, M);
 label(#resource{virtual_host = VHost, kind = exchange, name = Name}) ->
    <<"vhost=\"", VHost/binary, "\",exchange=\"", Name/binary, "\"">>;
+label({#resource{virtual_host = VHost, kind = queue, name = Name}, P, _}) when is_pid(P) ->
+      <<"vhost=\"", VHost/binary, "\",queue=\"", Name/binary, "\"","channel=\"",(iolist_to_binary(pid_to_list(P)))/binary, "\"">>;


recently another PR #9656 was merged which adds escaping to label values. I think you need to rebase to latest main and add escaping to this line as well (the vhost and queue name values)

thanks @gomoripeti I did completely miss this!

gomoripeti · 2024-01-04T10:16:54Z

deps/rabbitmq_prometheus/src/collectors/prometheus_rabbitmq_core_metrics_collector.erl

@@ -694,7 +743,9 @@ accumulate_count_and_sum(Value, {Count, Sum}) ->

 empty(T) when T == channel_queue_exchange_metrics; T == channel_process_metrics; T == queue_consumer_count ->
    {T, 0};
-empty(T) when T == connection_coarse_metrics; T == auth_attempt_metrics; T == auth_attempt_detailed_metrics ->
+empty(T) when T == connection_coarse_metrics; T == rabbit_stream_publisher_created;


similarly rabbit_stream_publisher_created is not needed

gomoripeti · 2024-01-04T10:18:34Z

deps/rabbitmq_prometheus/src/collectors/prometheus_rabbitmq_core_metrics_collector.erl

@@ -512,6 +528,35 @@ get_data(channel_metrics = Table, false, _, _) ->
     [{Table, [{consumer_count, A1}, {messages_unacknowledged, A2}, {messages_unconfirmed, A3},
               {messages_uncommitted, A4}, {acks_uncommitted, A5}, {prefetch_count, A6},
               {global_prefetch_count, A7}]}];
+get_data(rabbit_stream_publisher_created = Table, false, _, _) ->


I think there is some rebase or copy-paste issue here. this function clause is not necessary, the next one covers publisher metrics.

gomoripeti · 2024-01-08T20:48:17Z

there are valid test failures. We are working on fixing them...

markus812498 · 2024-01-14T19:44:59Z

after @gomoripeti 's push above, I believe this PR is ready for review as all Prometheus related tests passing and conflicts are resolved. (looks like failing tests are related to MQTT_V5)

luos · 2024-02-13T08:47:42Z

Hi!

This looks very good, thank you for implementing it.

What do you think of the idea to expose the segments count as well for a stream, then the rough stream size could be calculated from max-segment-size * segments? (Maybe in a separate PR.)

michaelklishin · 2024-02-13T14:50:51Z

Exposing the current segment count sounds OK to me.

markus812498 · 2024-02-13T21:19:35Z

that sounds good! perhaps I could look into that in a different PR once this gets merged :)

gomoripeti · 2024-06-21T07:22:17Z

force-push was a rebase to latest main. google-github-actions/auth fails in CI

michaelklishin · 2024-06-22T22:55:14Z

CI fails because external PRs do not have access to the necessary secrets. We will run the tests locally.

michaelklishin

I cannot rebase this PR onto main without resolving a boatload of conflicts.

Specifically, I have given up on the 7th.

It's probably based on a revision that's a month or so out of date.

michaelklishin · 2024-06-22T22:58:24Z

Indeed, https://github.com/cloudamqp/rabbitmq-server/ still uses master as its default branch and https://github.com/cloudamqp/rabbitmq-server/tree/main specifically is from May 15th.

michaelklishin · 2024-06-22T23:01:56Z

Rebasing cloudamqp/rabbitmq-server@main onto rabbitmq/rabbitmq-server@main is a fast forward.

Then rebasing this PR onto the result will conflict with #11431 but that should be it.

gomoripeti · 2024-06-24T09:51:48Z

I used the github UI button "Rebase" previously, sorry if it caused some trouble.

Now I manually rebased to latest main @ rabbitmq/rabbitmq-server and fixed the conflict with #11431

pad-master82 · 2024-10-30T09:50:56Z

Is this PR still active and can it be accepted? If not, can I create a new pull PR for the same purpose? I plan to use stream queues and really need to monitor consumer lag. If this PR is accepted, I am willing to create a new PR to collect this metric (consumer lag) in Datadog.

gomoripeti · 2024-10-30T13:13:03Z

thanks for the ping. I guess this PR accidentally got forgotten by the Core Team. I rebased to latest main.

gomoripeti · 2024-10-30T13:24:03Z

I noticed that the current implementation uses vhost, queue, and connection PID as labels. Instead or in addition to PID it would be better to include publisher_id/subscription_id as label to the per-object metrics. I'll work on this.

ikavgo · 2024-11-05T09:10:40Z

Huh, anyway. David is right, cardinality going to be too big.

gomoripeti · 2024-11-05T09:43:31Z

thanks for the valuable input. As streams are long lived would per-stream (no connection and pub/sub id labels) metrics be acceptable?

I realise that consumer offset does not make sense aggregated
would be good to get some info on offset lag, definitely useful to see if it is zero or not. Sum of offset_lag might not be the best aggregation as it can be higher if there are a lot of consumers. Maybe "Max offset_lag" per-stream and max per node? (a gauge)
for published, confirmed and publish errors, the sum per stream sounds useful. (But this counter has the usual problem that if a publisher goes away these counters decrease)
total number of messages consumed per stream might make sense (although with the non-destructive nature of streams this might be less meaningful) (Also has the problem that if a consumer goes away this counter decreases)

ikavgo · 2024-11-05T10:50:21Z

yeah, I would start with describing user stories around these new metrics. I like max_offset_lag per stream since I can imagine setting an alert on this. The rest I don't know.

So how they would be used?

As for metrics granularity, "per-object", etc would it make more sense to have only reasonable aggregates in prometheus and referring to Management UI for granularity when an event occurred?

ansd · 2024-11-05T11:42:04Z

As streams are long lived would per-stream (no connection and pub/sub id labels) metrics be acceptable?

Exposing per-object Prometheus metrics for streams is acceptable from my point of view, similar to how exposing per-object metrics for quorum queues is acceptable. These are usually rather long lived entities. On the other hand, a connection identified by its Pid is what I consider short lived.

gomoripeti · 2024-11-05T16:50:21Z

I think the urge to add high cardinality metrics to the prometheus endpoint comes from the feeling that metrics in the mgmt plugin will be deprecated and removed eventually. (I think it is still not clear how and to what extent this will happen. As a user I really like the graphs with some history on per connection or per queue basis. As a maintainer or contributor it is quite a complexity so I agree it should be replaced somehow)

For use cases I think stream_publisher_published_total and stream_consumer_consumed_total has the some role as queue_exchange_messages_published_total and channel_messages_delivered_total: first we can look at global msgs received and delivered to get a sense of node traffic, then drill down to answer which were the busiest streams during a traffic spike (in the past - ie historic data)

lhoguin · 2024-11-05T17:02:56Z

We don't have a clear idea on what we want to do about management metrics at this point so they are unlikely to be removed anytime soon. Prometheus was considered to be an alternative to get the metrics out, but this cardinality issue makes it a bad candidate for this. So we are still at square one in that regard, we want something better than management metrics currently are, but we don't know how to go about it.

michaelklishin · 2024-11-05T17:21:28Z

Removal of management UI metrics is not something we have discussed in many months, not something planned for the foreseeable future, and the last time it was discussed, it was a discussion about undeprecating it.

gomoripeti · 2024-11-08T08:46:25Z

so what I can do is add metrics which are sum per-stream:

stream_publisher_published_total
stream_publisher_confirmed_total
stream_publisher_error_messages
stream_consumer_consumed_total

and max per-stream for

stream_consumer_max_offset_lag

Or if there is uncertainty if the former are useful enough, those can be postponed and only add stream_consumer_max_offset_lag for now?

deadtrickster · 2024-11-08T08:59:42Z

I would start with max lag first

gomoripeti · 2024-11-11T15:28:01Z

I updated this PR to only include max_offset_leg metric.
But have to fix the tests (issues with dependencies between apps)

gomoripeti · 2024-11-11T19:51:17Z

amqp10_common test fails consistently with the below error.
serial_number:usort/1 is only exported if compiled with TEST macro.
I wonder if the failure is because now that amqp10_common is a dependency of rabbitmq_ct_helpers it is first compiled without TEST and then it is not recompiled with TEST defined. Not sure if this is how erlang.mk works and if yes how to workaround this?

serial_number_SUITE > test_usort
    #1. {error,
            {undef,
                [{serial_number,usort,[[]],[]},
                 {serial_number_SUITE,test_usort,1,
                     [{file,"serial_number_SUITE.erl"},{line,65}]},

lhoguin · 2024-11-12T08:28:26Z

It's not recompiled. You need to remove the test on TEST and instead add a comment just above this function's export that says %% For tests.

Functions that are hidden behind TEST in rabbit are often a misuse of test builds that we have noticed and begun correcting.

So that they can be used from multiple test suites.

Supports both per stream (detailed) and aggregated (metrics) values.

The application is not always recompiled which causes tests to fail because they cannot call `serial_number:usort/1`.

gomoripeti · 2024-11-19T22:26:24Z

Rebased to latest main. This is ready for another round of review. The failing test cases seem unrelated to me.

michaelklishin · 2024-11-20T00:16:16Z

#12765

michaelklishin · 2024-11-20T15:06:28Z

I have received no objections to this PR from other core team members.

michaelklishin · 2024-11-20T15:22:07Z

This seems safe to backport to v4.0.x, too. It's a new metric in a node-local plugin.

michaelklishin · 2024-11-20T15:36:52Z

#12765 is in, the core team is now discussing whether it is something we should backport to v4.0.x.

markus812498 marked this pull request as ready for review January 4, 2024 01:55

gomoripeti reviewed Jan 4, 2024

View reviewed changes

markus812498 force-pushed the expose_stream_offset_lag_metric_to_prometheus branch 4 times, most recently from ce61937 to d50dd43 Compare January 7, 2024 23:42

gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from d50dd43 to 051c917 Compare January 8, 2024 19:06

gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from 051c917 to 66ec336 Compare January 9, 2024 16:24

markus812498 force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from bbf276e to b59730d Compare February 26, 2024 03:41

markus812498 mentioned this pull request May 16, 2024

Prometheus: add segment file count to queue_metrics and expose #11245

Merged

12 tasks

mergify bot mentioned this pull request May 25, 2024

Prometheus: add segment file count to queue_metrics and expose (backport #11245) #11325

Merged

12 tasks

gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from b59730d to 2affbe8 Compare June 21, 2024 07:19

michaelklishin requested changes Jun 22, 2024

View reviewed changes

gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from 2affbe8 to 82eaa32 Compare June 24, 2024 09:49

gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from 82eaa32 to 289029f Compare October 30, 2024 13:11

michaelklishin changed the title ~~Added support for exposing Stream Connection metrics~~ Streams: two additional Prometheus metrics for connections Oct 30, 2024

ikavgo closed this Nov 5, 2024

ikavgo reopened this Nov 5, 2024

gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from 142a219 to 6aae7be Compare November 8, 2024 19:24

gomoripeti and others added 4 commits November 19, 2024 22:27

Move client-side stream protocol test helpers to a separate module

cf8a00c

So that they can be used from multiple test suites.

Expose max offset lag of stream consumers via Prometheus

e82058e

Supports both per stream (detailed) and aggregated (metrics) values.

Add test for stream consumer max offset lag prometheus metric

0c76054

amqp10_common: Don't wrap export for test in test macro

05a3733

The application is not always recompiled which causes tests to fail because they cannot call `serial_number:usort/1`.

gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from c0418ba to 05a3733 Compare November 19, 2024 21:28

michaelklishin mentioned this pull request Nov 20, 2024

By @gomoripeti: Streams: two additional Prometheus metrics for connections #12765

Merged

mergify bot mentioned this pull request Nov 20, 2024

By @gomoripeti: Streams: two additional Prometheus metrics for connections (backport #12765) #12771

Closed

michaelklishin closed this Nov 20, 2024

michaelklishin added a commit that referenced this pull request Nov 20, 2024

Add #10275 #12765 to 4.1.0 release notes

d6620a5

michaelklishin added this to the 4.1.0 milestone Nov 20, 2024

Streams: two additional Prometheus metrics for connections #10275

Streams: two additional Prometheus metrics for connections #10275

Conversation

markus812498 commented Jan 4, 2024

Proposed Changes

Types of Changes

Checklist

Further Comments

michaelklishin commented Jan 4, 2024

gomoripeti left a comment

Choose a reason for hiding this comment

gomoripeti Jan 4, 2024

Choose a reason for hiding this comment

markus812498 Jan 7, 2024

Choose a reason for hiding this comment

gomoripeti Jan 4, 2024

Choose a reason for hiding this comment

gomoripeti Jan 4, 2024

Choose a reason for hiding this comment

gomoripeti commented Jan 8, 2024

markus812498 commented Jan 14, 2024

luos commented Feb 13, 2024

michaelklishin commented Feb 13, 2024

markus812498 commented Feb 13, 2024

gomoripeti commented Jun 21, 2024

michaelklishin commented Jun 22, 2024

michaelklishin left a comment

Choose a reason for hiding this comment

michaelklishin commented Jun 22, 2024

michaelklishin commented Jun 22, 2024

gomoripeti commented Jun 24, 2024

pad-master82 commented Oct 30, 2024

gomoripeti commented Oct 30, 2024

gomoripeti commented Oct 30, 2024

ikavgo commented Nov 5, 2024

gomoripeti commented Nov 5, 2024

ikavgo commented Nov 5, 2024

ansd commented Nov 5, 2024 • edited Loading

gomoripeti commented Nov 5, 2024

lhoguin commented Nov 5, 2024

michaelklishin commented Nov 5, 2024

gomoripeti commented Nov 8, 2024

deadtrickster commented Nov 8, 2024

gomoripeti commented Nov 11, 2024

gomoripeti commented Nov 11, 2024

lhoguin commented Nov 12, 2024

gomoripeti commented Nov 19, 2024

michaelklishin commented Nov 20, 2024

michaelklishin commented Nov 20, 2024

michaelklishin commented Nov 20, 2024

michaelklishin commented Nov 20, 2024

ansd commented Nov 5, 2024 •

edited

Loading