Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add server role to channel metrics to check cardinality #114

Merged
merged 1 commit into from
Apr 28, 2020

Conversation

AaronFriel
Copy link
Contributor

When reading stats like pending count and max in flight it is helpful to filter by
server role != Follower, because followers will report a max in flight but a pending
count of 0.

Example: to determine queue saturation, I would query the following:

sum(nss_chan_subs_pending_count) by (channel, queue, durable_name)
  / sum(nss_chan_subs_max_inflight) by (channel, queue, durable_name)

But the sum on the right hand side will be too large - Follower nodes
report the same value as the Leader.

Say I have 4 subcribers, each with a max in flight of 10, and a "true"
pending count of 10 messages (divided however), and a cluster size of
5 nodes.

The real "saturation" value should be 25%, (10 pending / (10 max in flight * 4 subscribers))

However with no way to filter out the follower nodes the saturation is reported
as 5%. The numerator is still 10, but the denominator will be 10 * 4 * 5, the
true max in flight multiplied by subscribers multiplied by servers reporting the
max in flight.

@AaronFriel
Copy link
Contributor Author

Hey @wallyqs @derekcollison would it be possible to get a quick review and release on this adding labels? It would help unblock some metrics we use internally and alerts. For the time being I'm just going to hardcode a constant factor.

It would also fix weird examples like this one: https://docs.nats.io/nats-streaming-concepts/monitoring#msgs-sec-vs-pending-on-channel

sum(rate(nss_chan_msgs_total{channel="foo"}[5m])) by (channel) / 3
sum(nss_chan_subs_pending_count{channel="foo"}) by (channel) / 3

The former should probably be max(rate(nss_chan_msgs_total{channel="foo"}[5m])) and the latter should be sum(nss_chan_subs_pending_count{channel="foo",role="Leader"}).

Well, actually that's kind of a weird metric because the pending count caps out at the max in flight, so I'd defer to the metric in my commit message for saturation.

Copy link
Member

@wallyqs wallyqs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR Aaron :) Left one comment

@@ -297,18 +297,23 @@ func (nc *channelsCollector) Describe(ch chan<- *prometheus.Desc) {
func (nc *channelsCollector) Collect(ch chan<- prometheus.Metric) {
for _, server := range nc.servers {
var resp Channelsz
var serverResp StreamingServerz
if err := getMetricURL(nc.httpClient, server.URL, &resp); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this extra getMetricURL added? seems it is doing the same as the call below?

Copy link
Contributor Author

@AaronFriel AaronFriel Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's the wrong way to do this. I was copying a pattern elsewhere to get the serversz response but the URL is wrong I can see.

How would you suggest getting the server role into the Collect function here? This function is called with an array of servers via channelsCollector, so for each server I need to actually get a single metric from a different URL.

  1. Use some URL parsing to substitute channelsz with serversz and get one additional data point from the prior role?
  2. If both serversz and channelsz are collected, run the former first and have it somehow return the role to apply to all of the latter. (This seems messy because both collectors are called with arrays of servers, not individual servers.)
  3. ???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented in my new commit option #1, seeing no way without a major refactor to implement #2

When reading stats like pending count and max in flight it is helpful to filter by
server role != Follower, because followers will report a max in flight but a pending
count of 0.

Example: to determine queue saturation, I would query the following:

```
sum(nss_chan_subs_pending_count) by (channel, queue, durable_name)
  / sum(nss_chan_subs_max_inflight) by (channel, queue, durable_name)
```

But the sum on the right hand side will be too large - Follower nodes
report the same value as the Leader.

Say I have 4 subcribers, each with a max in flight of 10, and a "true"
pending count of 10 messages (divided however), and a cluster size of
5 nodes.

The real "saturation" value should be 25%, (10 pending / (10 max in flight * 4 subscribers))

However with no way to filter out the follower nodes the saturation is reported
as 5%. The numerator is still 10, but the denominator will be `10 * 4 * 5`, the
true max in flight multiplied by subscribers multiplied by servers reporting the
max in flight.
@AaronFriel
Copy link
Contributor Author

@wallyqs Do those recent changes make sense given the structure of the code?

@wallyqs
Copy link
Member

wallyqs commented Apr 28, 2020

Thanks @AaronFriel for the contribution and the feedback, I get the idea but will apply a few changes before merging in a bit.

@wallyqs
Copy link
Member

wallyqs commented Apr 28, 2020

Made a few fixes with the ordering of the labels: fd48eba

Now it looks like this for example:

# HELP nss_chan_bytes_total Total of bytes
# TYPE nss_chan_bytes_total gauge
nss_chan_bytes_total{channel="foo",server_id="http://127.0.0.1:8222",server_role="Follower"} 62
# HELP nss_chan_last_seq Last seq
# TYPE nss_chan_last_seq gauge
nss_chan_last_seq{channel="foo",server_id="http://127.0.0.1:8222",server_role="Follower"} 1
# HELP nss_chan_msgs_total Total of messages
# TYPE nss_chan_msgs_total gauge
nss_chan_msgs_total{channel="foo",server_id="http://127.0.0.1:8222",server_role="Follower"} 1

@wallyqs wallyqs merged commit 46e6775 into nats-io:master Apr 28, 2020
@wallyqs
Copy link
Member

wallyqs commented Apr 28, 2020

Merged with a few fixes, thanks!

@AaronFriel
Copy link
Contributor Author

Thanks, this was my first Go commit on an OSS project so I'm glad I grok your changes! Go error handling is... interesting!

Looking forward to this being released.

@wallyqs
Copy link
Member

wallyqs commented Apr 28, 2020

@AaronFriel excellent first contribution! 🎉 I will try to release now (release v0.6.2), feel free to send PR as well for the updates on the prometheus docs as well :) Thanks.

@wallyqs
Copy link
Member

wallyqs commented Apr 29, 2020

Hi @AaronFriel this has been released now: https://github.com/nats-io/prometheus-nats-exporter/releases/tag/v0.6.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants