Emit version metric from SPIRE Server and Agent on startup#1208
Emit version metric from SPIRE Server and Agent on startup#1208evan2645 merged 8 commits intospiffe:masterfrom
Conversation
This also modifies the binary stamping to include the binary SHA in the version if its not exactly a git tag. Signed-off-by: Tyler Dixon <tylerd@uber.com>
Some metrics systems reject values that appear auto-generated. Abbreviating the SHA1 Git commit hash in the version string is a reasonable workaround for this. GitHub defaults to abbreviating the commit hash to 7 characters in its UI in seemingly all repos it hosts. For this reason, chose to abbreviate the hash to the first 7 characters for easier correlation between SPIRE versions and the commit on which that version is based. Signed-off-by: Ryan Turner <turner@uber.com>
0f12ba0 to
ff2bc17
Compare
Signed-off-by: Ryan Turner <turner@uber.com>
Signed-off-by: Ryan Turner <turner@uber.com>
pkg/common/telemetry/version.go
Outdated
| ) | ||
|
|
||
| func EmitVersion(m Metrics) { | ||
| m.SetGaugeWithLabels([]string{"version"}, 1, []Label{ |
There was a problem hiding this comment.
It feels a little odd to have a metric with constant value, where the name (and usefulness) of the metric is solely fulfilled by a label. What are your thoughts on making this a counter with a name like "agent_boot" and "server_boot" or something along those lines?
There was a problem hiding this comment.
I suppose just boot or started could suffice too since we already prefix with agent or server.
There was a problem hiding this comment.
Yeah I get the confusion, the value here is a little moot. I'm not sure a counter like you're suggesting is any clearer since the counter value would be reset anytime the server or agent was restarted, effectively making it a constant value of 1 for each active instance. Perhaps a middle-ground solution of calling this metric "{agent,server}_boot" and keeping it as a gauge with value 1 would be clearer? It seems a little more intuitive to me at least that a gauge (as opposed to a counter) might have a constant value for an extended period of time.
There was a problem hiding this comment.
The value-prop of a value of 1 is that you can use an aggregation function (such as sum) to get the count of instances, rather than "counting" the series.
Off the top of my head I'm not sure there exists a query language (such as promql, graphite, m3ql, etc.) that allows you to "count" series grouped by tags. For example, if I have the ability to query global metrics, and I want a count of running version by-deployment-area I think I need to use a sum / groupBy function, rather than a "count of series".
There was a problem hiding this comment.
I'm not sure a counter like you're suggesting is any clearer since the counter value would be reset anytime the server or agent was restarted, effectively making it a constant value of 1 for each active instance.
Hmm... it's my impression that the emitted value of a counter is the amount to increment by, rather than the absolute value. A little googling seems to support that, but maybe there is something I'm missing?
The value-prop of a value of 1 is that you can use an aggregation function (such as sum) to get the count of instances, rather than "counting" the series.
Heh... the value I see in a counter is that you can easily compute the start/restart rate across a set of hosts. I don't think that can be done currently? Seems like it would be useful.
Off the top of my head I'm not sure there exists a query language (such as promql, graphite, m3ql, etc.) that allows you to "count" series grouped by tags. For example, if I have the ability to query global metrics, and I want a count of running version by-deployment-area I think I need to use a sum / groupBy function, rather than a "count of series".
I see what you're saying... In the graphite docs, they have both a groupByTags function and countSeries function which looks like they can be combined to do something like this. That being said, the DataDog docs recommend a gauge for this use case. I guess I could go either way.
There was a problem hiding this comment.
None of the statsd docs seems to indicate that you must re-emit a gauge.
Gauges are a constant data type. They are not subject to averaging, and they don’t change unless you change them. That is, once you set a gauge value, it will be a flat line on the graph until you change it again.
There was a problem hiding this comment.
Thanks for all this supporting information @ZymoticB
I've taken to emitting "milliseconds since start" to monitor for crash loops.
That sounds much more effective than what I was suggesting.
With regards to emission frequency, I suppose we can always switch the behavior to emit more than once at some point in the future without affecting much (if we find that we need to).
There was a problem hiding this comment.
Makes sense, @ZymoticB . I think we are all in agreement that uptime is a really useful metric to emit. I can change this to be a gauge metric called "uptime" with a value of "milliseconds since startup" using the approach you suggested.
There was a problem hiding this comment.
I'm a bit confused by @rturner3 's comment so I want to clarify.
The suggestion to emit milliseconds since start to monitor for crash loops was in response to @evan2645 's request for a counter that could be used to monitor starts / stops. I agree we all see value in an uptime metric.
That said, IMO we should still have a gauge that emits version so that we can see "how many running instances are running version X", I don't think it should be a single uptime gauge.
There was a problem hiding this comment.
Thanks for clarifying, @ZymoticB . I think this thread diverged into multiple directions and the linear nature of GitHub conversations had me a little confused as to what exactly was being proposed for this PR.
I can create a separate issue for the uptime metric and keep this PR focused to just the version metric, which is currently implemented with a gauge.
Signed-off-by: Ryan Turner <turner@uber.com>
Signed-off-by: Ryan Turner <turner@uber.com>
Signed-off-by: Ryan Turner <turner@uber.com>
7dfc04e to
2972e93
Compare
Pull Request check list
Affected functionality
Adds new
versionmetric in SPIRE Server and AgentDescription of change
Upon startup of SPIRE Server and Agent, emits a new
versionmetric, which is a gauge with a constant value of 1 and aversionlabel with the version string of the component. The version string is built in the format <Major>.<Minor>.<Patch>-dev-<First 7 chars of SHA1 Git commit hash> . This is helpful in visualizing rollout of new versions of SPIRE Server and Agent within a deployment environment.