Add server role to channel metrics to check cardinality #114

AaronFriel · 2020-04-25T21:11:29Z

When reading stats like pending count and max in flight it is helpful to filter by
server role != Follower, because followers will report a max in flight but a pending
count of 0.

Example: to determine queue saturation, I would query the following:

sum(nss_chan_subs_pending_count) by (channel, queue, durable_name)
  / sum(nss_chan_subs_max_inflight) by (channel, queue, durable_name)

But the sum on the right hand side will be too large - Follower nodes
report the same value as the Leader.

Say I have 4 subcribers, each with a max in flight of 10, and a "true"
pending count of 10 messages (divided however), and a cluster size of
5 nodes.

The real "saturation" value should be 25%, (10 pending / (10 max in flight * 4 subscribers))

However with no way to filter out the follower nodes the saturation is reported
as 5%. The numerator is still 10, but the denominator will be 10 * 4 * 5, the
true max in flight multiplied by subscribers multiplied by servers reporting the
max in flight.

AaronFriel · 2020-04-25T21:18:20Z

Hey @wallyqs @derekcollison would it be possible to get a quick review and release on this adding labels? It would help unblock some metrics we use internally and alerts. For the time being I'm just going to hardcode a constant factor.

It would also fix weird examples like this one: https://docs.nats.io/nats-streaming-concepts/monitoring#msgs-sec-vs-pending-on-channel

sum(rate(nss_chan_msgs_total{channel="foo"}[5m])) by (channel) / 3
sum(nss_chan_subs_pending_count{channel="foo"}) by (channel) / 3

The former should probably be max(rate(nss_chan_msgs_total{channel="foo"}[5m])) and the latter should be sum(nss_chan_subs_pending_count{channel="foo",role="Leader"}).

Well, actually that's kind of a weird metric because the pending count caps out at the max in flight, so I'd defer to the metric in my commit message for saturation.

wallyqs

Thanks for the PR Aaron :) Left one comment

wallyqs · 2020-04-25T22:31:40Z

collector/streaming.go

@@ -297,18 +297,23 @@ func (nc *channelsCollector) Describe(ch chan<- *prometheus.Desc) {
 func (nc *channelsCollector) Collect(ch chan<- prometheus.Metric) {
 	for _, server := range nc.servers {
 		var resp Channelsz
+		var serverResp StreamingServerz
+		if err := getMetricURL(nc.httpClient, server.URL, &resp); err != nil {


why is this extra getMetricURL added? seems it is doing the same as the call below?

Hmm, that's the wrong way to do this. I was copying a pattern elsewhere to get the serversz response but the URL is wrong I can see.

How would you suggest getting the server role into the Collect function here? This function is called with an array of servers via channelsCollector, so for each server I need to actually get a single metric from a different URL.

Use some URL parsing to substitute channelsz with serversz and get one additional data point from the prior role?

If both serversz and channelsz are collected, run the former first and have it somehow return the role to apply to all of the latter. (This seems messy because both collectors are called with arrays of servers, not individual servers.)

???

I implemented in my new commit option #1, seeing no way without a major refactor to implement #2

When reading stats like pending count and max in flight it is helpful to filter by server role != Follower, because followers will report a max in flight but a pending count of 0. Example: to determine queue saturation, I would query the following: ``` sum(nss_chan_subs_pending_count) by (channel, queue, durable_name) / sum(nss_chan_subs_max_inflight) by (channel, queue, durable_name) ``` But the sum on the right hand side will be too large - Follower nodes report the same value as the Leader. Say I have 4 subcribers, each with a max in flight of 10, and a "true" pending count of 10 messages (divided however), and a cluster size of 5 nodes. The real "saturation" value should be 25%, (10 pending / (10 max in flight * 4 subscribers)) However with no way to filter out the follower nodes the saturation is reported as 5%. The numerator is still 10, but the denominator will be `10 * 4 * 5`, the true max in flight multiplied by subscribers multiplied by servers reporting the max in flight.

AaronFriel · 2020-04-28T16:45:03Z

@wallyqs Do those recent changes make sense given the structure of the code?

wallyqs · 2020-04-28T17:04:59Z

Thanks @AaronFriel for the contribution and the feedback, I get the idea but will apply a few changes before merging in a bit.

wallyqs · 2020-04-28T17:53:47Z

Made a few fixes with the ordering of the labels: fd48eba

Now it looks like this for example:

# HELP nss_chan_bytes_total Total of bytes
# TYPE nss_chan_bytes_total gauge
nss_chan_bytes_total{channel="foo",server_id="http://127.0.0.1:8222",server_role="Follower"} 62
# HELP nss_chan_last_seq Last seq
# TYPE nss_chan_last_seq gauge
nss_chan_last_seq{channel="foo",server_id="http://127.0.0.1:8222",server_role="Follower"} 1
# HELP nss_chan_msgs_total Total of messages
# TYPE nss_chan_msgs_total gauge
nss_chan_msgs_total{channel="foo",server_id="http://127.0.0.1:8222",server_role="Follower"} 1

wallyqs · 2020-04-28T20:28:41Z

Merged with a few fixes, thanks!

AaronFriel · 2020-04-28T21:14:51Z

Thanks, this was my first Go commit on an OSS project so I'm glad I grok your changes! Go error handling is... interesting!

Looking forward to this being released.

wallyqs · 2020-04-28T21:25:01Z

@AaronFriel excellent first contribution! 🎉 I will try to release now (release v0.6.2), feel free to send PR as well for the updates on the prometheus docs as well :) Thanks.

wallyqs · 2020-04-29T01:05:27Z

Hi @AaronFriel this has been released now: https://github.com/nats-io/prometheus-nats-exporter/releases/tag/v0.6.2

wallyqs reviewed Apr 25, 2020

View reviewed changes

AaronFriel force-pushed the channelsz-role branch from de55e53 to 46e6775 Compare April 26, 2020 00:52

wallyqs merged commit 46e6775 into nats-io:master Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add server role to channel metrics to check cardinality #114

Add server role to channel metrics to check cardinality #114

AaronFriel commented Apr 25, 2020

AaronFriel commented Apr 25, 2020

wallyqs left a comment

wallyqs Apr 25, 2020

AaronFriel Apr 26, 2020 •

edited

Loading

AaronFriel Apr 26, 2020

AaronFriel commented Apr 28, 2020

wallyqs commented Apr 28, 2020

wallyqs commented Apr 28, 2020

wallyqs commented Apr 28, 2020

AaronFriel commented Apr 28, 2020

wallyqs commented Apr 28, 2020

wallyqs commented Apr 29, 2020

Add server role to channel metrics to check cardinality #114

Add server role to channel metrics to check cardinality #114

Conversation

AaronFriel commented Apr 25, 2020

AaronFriel commented Apr 25, 2020

wallyqs left a comment

Choose a reason for hiding this comment

wallyqs Apr 25, 2020

Choose a reason for hiding this comment

AaronFriel Apr 26, 2020 • edited Loading

Choose a reason for hiding this comment

AaronFriel Apr 26, 2020

Choose a reason for hiding this comment

AaronFriel commented Apr 28, 2020

wallyqs commented Apr 28, 2020

wallyqs commented Apr 28, 2020

wallyqs commented Apr 28, 2020

AaronFriel commented Apr 28, 2020

wallyqs commented Apr 28, 2020

wallyqs commented Apr 29, 2020

AaronFriel Apr 26, 2020 •

edited

Loading