Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams: two additional Prometheus metrics for connections #10275

Conversation

markus812498
Copy link
Contributor

Proposed Changes

This PR adds support for metrics from the rabbit_stream_consumer_created and rabbit_stream_publisher_created ETS tables to be exposed through prometheus through more user-friendly named metrics endpoints. Based off the conversation in a previous PR: #3043

The change is requested to ensure customers using stream connections can scrape metrics from the prometheus endpoints.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

Further Comments

If PR is accepted I think this should also be pushed to the documentation on the rabbitmq-website. Do I create a PR to that repo once this is merged? Or is there another procedure to follow?

Thanks!

@markus812498 markus812498 marked this pull request as ready for review January 4, 2024 01:55
@michaelklishin
Copy link
Member

This has conflicts with main and needs rebasing.

Copy link
Contributor

@gomoripeti gomoripeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work :)

@@ -414,6 +428,8 @@ label(M) when is_map(M) ->
end, <<>>, M);
label(#resource{virtual_host = VHost, kind = exchange, name = Name}) ->
<<"vhost=\"", VHost/binary, "\",exchange=\"", Name/binary, "\"">>;
label({#resource{virtual_host = VHost, kind = queue, name = Name}, P, _}) when is_pid(P) ->
<<"vhost=\"", VHost/binary, "\",queue=\"", Name/binary, "\"","channel=\"",(iolist_to_binary(pid_to_list(P)))/binary, "\"">>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recently another PR #9656 was merged which adds escaping to label values. I think you need to rebase to latest main and add escaping to this line as well (the vhost and queue name values)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @gomoripeti I did completely miss this!

@@ -694,7 +743,9 @@ accumulate_count_and_sum(Value, {Count, Sum}) ->

empty(T) when T == channel_queue_exchange_metrics; T == channel_process_metrics; T == queue_consumer_count ->
{T, 0};
empty(T) when T == connection_coarse_metrics; T == auth_attempt_metrics; T == auth_attempt_detailed_metrics ->
empty(T) when T == connection_coarse_metrics; T == rabbit_stream_publisher_created;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly rabbit_stream_publisher_created is not needed

@@ -512,6 +528,35 @@ get_data(channel_metrics = Table, false, _, _) ->
[{Table, [{consumer_count, A1}, {messages_unacknowledged, A2}, {messages_unconfirmed, A3},
{messages_uncommitted, A4}, {acks_uncommitted, A5}, {prefetch_count, A6},
{global_prefetch_count, A7}]}];
get_data(rabbit_stream_publisher_created = Table, false, _, _) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some rebase or copy-paste issue here. this function clause is not necessary, the next one covers publisher metrics.

@markus812498 markus812498 force-pushed the expose_stream_offset_lag_metric_to_prometheus branch 4 times, most recently from ce61937 to d50dd43 Compare January 7, 2024 23:42
@gomoripeti gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from d50dd43 to 051c917 Compare January 8, 2024 19:06
@gomoripeti
Copy link
Contributor

there are valid test failures. We are working on fixing them...

@gomoripeti gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from 051c917 to 66ec336 Compare January 9, 2024 16:24
@markus812498
Copy link
Contributor Author

after @gomoripeti 's push above, I believe this PR is ready for review as all Prometheus related tests passing and conflicts are resolved. (looks like failing tests are related to MQTT_V5)

@luos
Copy link
Contributor

luos commented Feb 13, 2024

Hi!

This looks very good, thank you for implementing it.

What do you think of the idea to expose the segments count as well for a stream, then the rough stream size could be calculated from max-segment-size * segments? (Maybe in a separate PR.)

@michaelklishin
Copy link
Member

Exposing the current segment count sounds OK to me.

@markus812498
Copy link
Contributor Author

that sounds good! perhaps I could look into that in a different PR once this gets merged :)

@markus812498 markus812498 force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from bbf276e to b59730d Compare February 26, 2024 03:41
@gomoripeti gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from b59730d to 2affbe8 Compare June 21, 2024 07:19
@gomoripeti
Copy link
Contributor

force-push was a rebase to latest main. google-github-actions/auth fails in CI

@michaelklishin
Copy link
Member

CI fails because external PRs do not have access to the necessary secrets. We will run the tests locally.

Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot rebase this PR onto main without resolving a boatload of conflicts.

Specifically, I have given up on the 7th.

It's probably based on a revision that's a month or so out of date.

@michaelklishin
Copy link
Member

Indeed, https://github.com/cloudamqp/rabbitmq-server/ still uses master as its default branch and https://github.com/cloudamqp/rabbitmq-server/tree/main specifically is from May 15th.

@michaelklishin
Copy link
Member

Rebasing cloudamqp/rabbitmq-server@main onto rabbitmq/rabbitmq-server@main is a fast forward.

Then rebasing this PR onto the result will conflict with #11431 but that should be it.

@gomoripeti gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from 2affbe8 to 82eaa32 Compare June 24, 2024 09:49
@gomoripeti
Copy link
Contributor

I used the github UI button "Rebase" previously, sorry if it caused some trouble.

Now I manually rebased to latest main @ rabbitmq/rabbitmq-server and fixed the conflict with #11431

@pad-master82
Copy link

Is this PR still active and can it be accepted? If not, can I create a new pull PR for the same purpose? I plan to use stream queues and really need to monitor consumer lag. If this PR is accepted, I am willing to create a new PR to collect this metric (consumer lag) in Datadog.

@gomoripeti gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from 82eaa32 to 289029f Compare October 30, 2024 13:11
@gomoripeti
Copy link
Contributor

thanks for the ping. I guess this PR accidentally got forgotten by the Core Team. I rebased to latest main.

@gomoripeti
Copy link
Contributor

I noticed that the current implementation uses vhost, queue, and connection PID as labels. Instead or in addition to PID it would be better to include publisher_id/subscription_id as label to the per-object metrics. I'll work on this.

@michaelklishin michaelklishin changed the title Added support for exposing Stream Connection metrics Streams: two additional Prometheus metrics for connections Oct 30, 2024
@ikavgo ikavgo closed this Nov 5, 2024
@ikavgo ikavgo reopened this Nov 5, 2024
@ikavgo
Copy link
Contributor

ikavgo commented Nov 5, 2024

Huh, anyway. David is right, cardinality going to be too big.

@gomoripeti
Copy link
Contributor

thanks for the valuable input. As streams are long lived would per-stream (no connection and pub/sub id labels) metrics be acceptable?

  • I realise that consumer offset does not make sense aggregated
  • would be good to get some info on offset lag, definitely useful to see if it is zero or not. Sum of offset_lag might not be the best aggregation as it can be higher if there are a lot of consumers. Maybe "Max offset_lag" per-stream and max per node? (a gauge)
  • for published, confirmed and publish errors, the sum per stream sounds useful. (But this counter has the usual problem that if a publisher goes away these counters decrease)
  • total number of messages consumed per stream might make sense (although with the non-destructive nature of streams this might be less meaningful) (Also has the problem that if a consumer goes away this counter decreases)

@ikavgo
Copy link
Contributor

ikavgo commented Nov 5, 2024

yeah, I would start with describing user stories around these new metrics. I like max_offset_lag per stream since I can imagine setting an alert on this. The rest I don't know.

So how they would be used?

As for metrics granularity, "per-object", etc would it make more sense to have only reasonable aggregates in prometheus and referring to Management UI for granularity when an event occurred?

@ansd
Copy link
Member

ansd commented Nov 5, 2024

As streams are long lived would per-stream (no connection and pub/sub id labels) metrics be acceptable?

Exposing per-object Prometheus metrics for streams is acceptable from my point of view, similar to how exposing per-object metrics for quorum queues is acceptable. These are usually rather long lived entities. On the other hand, a connection identified by its Pid is what I consider short lived.

@gomoripeti
Copy link
Contributor

I think the urge to add high cardinality metrics to the prometheus endpoint comes from the feeling that metrics in the mgmt plugin will be deprecated and removed eventually. (I think it is still not clear how and to what extent this will happen. As a user I really like the graphs with some history on per connection or per queue basis. As a maintainer or contributor it is quite a complexity so I agree it should be replaced somehow)

For use cases I think stream_publisher_published_total and stream_consumer_consumed_total has the some role as queue_exchange_messages_published_total and channel_messages_delivered_total: first we can look at global msgs received and delivered to get a sense of node traffic, then drill down to answer which were the busiest streams during a traffic spike (in the past - ie historic data)

@lhoguin
Copy link
Contributor

lhoguin commented Nov 5, 2024

We don't have a clear idea on what we want to do about management metrics at this point so they are unlikely to be removed anytime soon. Prometheus was considered to be an alternative to get the metrics out, but this cardinality issue makes it a bad candidate for this. So we are still at square one in that regard, we want something better than management metrics currently are, but we don't know how to go about it.

@michaelklishin
Copy link
Member

Removal of management UI metrics is not something we have discussed in many months, not something planned for the foreseeable future, and the last time it was discussed, it was a discussion about undeprecating it.

@gomoripeti
Copy link
Contributor

so what I can do is add metrics which are sum per-stream:

  • stream_publisher_published_total
  • stream_publisher_confirmed_total
  • stream_publisher_error_messages
  • stream_consumer_consumed_total

and max per-stream for

  • stream_consumer_max_offset_lag

Or if there is uncertainty if the former are useful enough, those can be postponed and only add stream_consumer_max_offset_lag for now?

@deadtrickster
Copy link
Contributor

I would start with max lag first

@gomoripeti gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from 142a219 to 6aae7be Compare November 8, 2024 19:24
@gomoripeti
Copy link
Contributor

I updated this PR to only include max_offset_leg metric.
But have to fix the tests (issues with dependencies between apps)

@gomoripeti
Copy link
Contributor

amqp10_common test fails consistently with the below error.
serial_number:usort/1 is only exported if compiled with TEST macro.
I wonder if the failure is because now that amqp10_common is a dependency of rabbitmq_ct_helpers it is first compiled without TEST and then it is not recompiled with TEST defined. Not sure if this is how erlang.mk works and if yes how to workaround this?

serial_number_SUITE > test_usort
    #1. {error,
            {undef,
                [{serial_number,usort,[[]],[]},
                 {serial_number_SUITE,test_usort,1,
                     [{file,"serial_number_SUITE.erl"},{line,65}]},

@lhoguin
Copy link
Contributor

lhoguin commented Nov 12, 2024

It's not recompiled. You need to remove the test on TEST and instead add a comment just above this function's export that says %% For tests.

Functions that are hidden behind TEST in rabbit are often a misuse of test builds that we have noticed and begun correcting.

gomoripeti and others added 4 commits November 19, 2024 22:27
So that they can be used from multiple test suites.
Supports both per stream (detailed) and aggregated (metrics) values.
The application is not always recompiled which causes tests to fail
because they cannot call `serial_number:usort/1`.
@gomoripeti gomoripeti force-pushed the expose_stream_offset_lag_metric_to_prometheus branch from c0418ba to 05a3733 Compare November 19, 2024 21:28
@gomoripeti
Copy link
Contributor

Rebased to latest main. This is ready for another round of review. The failing test cases seem unrelated to me.

@michaelklishin
Copy link
Member

#12765

@michaelklishin
Copy link
Member

I have received no objections to this PR from other core team members.

@michaelklishin
Copy link
Member

This seems safe to backport to v4.0.x, too. It's a new metric in a node-local plugin.

@michaelklishin
Copy link
Member

#12765 is in, the core team is now discussing whether it is something we should backport to v4.0.x.

michaelklishin added a commit that referenced this pull request Nov 20, 2024
@michaelklishin michaelklishin added this to the 4.1.0 milestone Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants