-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add server role to channel metrics to check cardinality #114
Conversation
Hey @wallyqs @derekcollison would it be possible to get a quick review and release on this adding labels? It would help unblock some metrics we use internally and alerts. For the time being I'm just going to hardcode a constant factor. It would also fix weird examples like this one: https://docs.nats.io/nats-streaming-concepts/monitoring#msgs-sec-vs-pending-on-channel
The former should probably be Well, actually that's kind of a weird metric because the pending count caps out at the max in flight, so I'd defer to the metric in my commit message for saturation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR Aaron :) Left one comment
collector/streaming.go
Outdated
@@ -297,18 +297,23 @@ func (nc *channelsCollector) Describe(ch chan<- *prometheus.Desc) { | |||
func (nc *channelsCollector) Collect(ch chan<- prometheus.Metric) { | |||
for _, server := range nc.servers { | |||
var resp Channelsz | |||
var serverResp StreamingServerz | |||
if err := getMetricURL(nc.httpClient, server.URL, &resp); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this extra getMetricURL
added? seems it is doing the same as the call below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's the wrong way to do this. I was copying a pattern elsewhere to get the serversz response but the URL is wrong I can see.
How would you suggest getting the server role into the Collect function here? This function is called with an array of servers via channelsCollector, so for each server I need to actually get a single metric from a different URL.
- Use some URL parsing to substitute
channelsz
withserversz
and get one additional data point from the prior role? - If both
serversz
andchannelsz
are collected, run the former first and have it somehow return the role to apply to all of the latter. (This seems messy because both collectors are called with arrays of servers, not individual servers.) - ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reading stats like pending count and max in flight it is helpful to filter by server role != Follower, because followers will report a max in flight but a pending count of 0. Example: to determine queue saturation, I would query the following: ``` sum(nss_chan_subs_pending_count) by (channel, queue, durable_name) / sum(nss_chan_subs_max_inflight) by (channel, queue, durable_name) ``` But the sum on the right hand side will be too large - Follower nodes report the same value as the Leader. Say I have 4 subcribers, each with a max in flight of 10, and a "true" pending count of 10 messages (divided however), and a cluster size of 5 nodes. The real "saturation" value should be 25%, (10 pending / (10 max in flight * 4 subscribers)) However with no way to filter out the follower nodes the saturation is reported as 5%. The numerator is still 10, but the denominator will be `10 * 4 * 5`, the true max in flight multiplied by subscribers multiplied by servers reporting the max in flight.
de55e53
to
46e6775
Compare
@wallyqs Do those recent changes make sense given the structure of the code? |
Thanks @AaronFriel for the contribution and the feedback, I get the idea but will apply a few changes before merging in a bit. |
Made a few fixes with the ordering of the labels: fd48eba Now it looks like this for example:
|
Merged with a few fixes, thanks! |
Thanks, this was my first Go commit on an OSS project so I'm glad I grok your changes! Go error handling is... interesting! Looking forward to this being released. |
@AaronFriel excellent first contribution! 🎉 I will try to release now (release v0.6.2), feel free to send PR as well for the updates on the prometheus docs as well :) Thanks. |
Hi @AaronFriel this has been released now: https://github.com/nats-io/prometheus-nats-exporter/releases/tag/v0.6.2 |
When reading stats like pending count and max in flight it is helpful to filter by
server role != Follower, because followers will report a max in flight but a pending
count of 0.
Example: to determine queue saturation, I would query the following:
But the sum on the right hand side will be too large - Follower nodes
report the same value as the Leader.
Say I have 4 subcribers, each with a max in flight of 10, and a "true"
pending count of 10 messages (divided however), and a cluster size of
5 nodes.
The real "saturation" value should be 25%, (10 pending / (10 max in flight * 4 subscribers))
However with no way to filter out the follower nodes the saturation is reported
as 5%. The numerator is still 10, but the denominator will be
10 * 4 * 5
, thetrue max in flight multiplied by subscribers multiplied by servers reporting the
max in flight.