Emit version metric from SPIRE Server and Agent on startup by rturner3 · Pull Request #1208 · spiffe/spire

rturner3 · 2019-10-21T17:05:44Z

Pull Request check list

Commit conforms to CONTRIBUTING.md?
Proper tests/regressions included?
Documentation updated?

Affected functionality
Adds new version metric in SPIRE Server and Agent

Description of change
Upon startup of SPIRE Server and Agent, emits a new version metric, which is a gauge with a constant value of 1 and a version label with the version string of the component. The version string is built in the format <Major>.<Minor>.<Patch>-dev-<First 7 chars of SHA1 Git commit hash> . This is helpful in visualizing rollout of new versions of SPIRE Server and Agent within a deployment environment.

This also modifies the binary stamping to include the binary SHA in the version if its not exactly a git tag. Signed-off-by: Tyler Dixon <tylerd@uber.com>

Some metrics systems reject values that appear auto-generated. Abbreviating the SHA1 Git commit hash in the version string is a reasonable workaround for this. GitHub defaults to abbreviating the commit hash to 7 characters in its UI in seemingly all repos it hosts. For this reason, chose to abbreviate the hash to the first 7 characters for easier correlation between SPIRE versions and the commit on which that version is based. Signed-off-by: Ryan Turner <turner@uber.com>

tools/external/go.mod

go.sum

pkg/common/version/version.go

Signed-off-by: Ryan Turner <turner@uber.com>

go.sum

pkg/common/version/version.go

Signed-off-by: Ryan Turner <turner@uber.com>

evan2645

nice - this will be super useful. left a couple small comments - not sure if @azdagron and @ZymoticB are good with the rest or not

evan2645 · 2019-10-22T22:09:58Z

pkg/common/telemetry/version.go

+)
+
+func EmitVersion(m Metrics) {
+	m.SetGaugeWithLabels([]string{"version"}, 1, []Label{


It feels a little odd to have a metric with constant value, where the name (and usefulness) of the metric is solely fulfilled by a label. What are your thoughts on making this a counter with a name like "agent_boot" and "server_boot" or something along those lines?

I suppose just boot or started could suffice too since we already prefix with agent or server.

Yeah I get the confusion, the value here is a little moot. I'm not sure a counter like you're suggesting is any clearer since the counter value would be reset anytime the server or agent was restarted, effectively making it a constant value of 1 for each active instance. Perhaps a middle-ground solution of calling this metric "{agent,server}_boot" and keeping it as a gauge with value 1 would be clearer? It seems a little more intuitive to me at least that a gauge (as opposed to a counter) might have a constant value for an extended period of time.

The value-prop of a value of 1 is that you can use an aggregation function (such as sum) to get the count of instances, rather than "counting" the series.

Off the top of my head I'm not sure there exists a query language (such as promql, graphite, m3ql, etc.) that allows you to "count" series grouped by tags. For example, if I have the ability to query global metrics, and I want a count of running version by-deployment-area I think I need to use a sum / groupBy function, rather than a "count of series".

I'm not sure a counter like you're suggesting is any clearer since the counter value would be reset anytime the server or agent was restarted, effectively making it a constant value of 1 for each active instance.

Hmm... it's my impression that the emitted value of a counter is the amount to increment by, rather than the absolute value. A little googling seems to support that, but maybe there is something I'm missing?

The value-prop of a value of 1 is that you can use an aggregation function (such as sum) to get the count of instances, rather than "counting" the series.

Heh... the value I see in a counter is that you can easily compute the start/restart rate across a set of hosts. I don't think that can be done currently? Seems like it would be useful.

Off the top of my head I'm not sure there exists a query language (such as promql, graphite, m3ql, etc.) that allows you to "count" series grouped by tags. For example, if I have the ability to query global metrics, and I want a count of running version by-deployment-area I think I need to use a sum / groupBy function, rather than a "count of series".

I see what you're saying... In the graphite docs, they have both a groupByTags function and countSeries function which looks like they can be combined to do something like this. That being said, the DataDog docs recommend a gauge for this use case. I guess I could go either way.

None of the statsd docs seems to indicate that you must re-emit a gauge.

Gauges are a constant data type. They are not subject to averaging, and they don’t change unless you change them. That is, once you set a gauge value, it will be a flat line on the graph until you change it again.

https://statsd.readthedocs.io/en/v3.2.1/types.html#gauges

Thanks for all this supporting information @ZymoticB

I've taken to emitting "milliseconds since start" to monitor for crash loops.

That sounds much more effective than what I was suggesting.

With regards to emission frequency, I suppose we can always switch the behavior to emit more than once at some point in the future without affecting much (if we find that we need to).

Makes sense, @ZymoticB . I think we are all in agreement that uptime is a really useful metric to emit. I can change this to be a gauge metric called "uptime" with a value of "milliseconds since startup" using the approach you suggested.

I'm a bit confused by @rturner3 's comment so I want to clarify.

The suggestion to emit milliseconds since start to monitor for crash loops was in response to @evan2645 's request for a counter that could be used to monitor starts / stops. I agree we all see value in an uptime metric.

That said, IMO we should still have a gauge that emits version so that we can see "how many running instances are running version X", I don't think it should be a single uptime gauge.

Thanks for clarifying, @ZymoticB . I think this thread diverged into multiple directions and the linear nature of GitHub conversations had me a little confused as to what exactly was being proposed for this PR.

I can create a separate issue for the uptime metric and keep this PR focused to just the version metric, which is currently implemented with a gauge.

pkg/common/version/version.go

Signed-off-by: Ryan Turner <turner@uber.com>

go.mod

Signed-off-by: Ryan Turner <turner@uber.com>

Add version metric to spire-server and spire-agent

fb82ac8

This also modifies the binary stamping to include the binary SHA in the version if its not exactly a git tag. Signed-off-by: Tyler Dixon <tylerd@uber.com>

rturner3 requested review from MarcosDY, amartinezfayo, azdagron, drrt, evan2645, marcosy and martincapello as code owners October 21, 2019 17:05

rturner3 force-pushed the td-spire-version branch from 0f12ba0 to ff2bc17 Compare October 21, 2019 17:06

ZymoticB reviewed Oct 21, 2019

View reviewed changes

tools/external/go.mod Show resolved Hide resolved

go.sum Outdated Show resolved Hide resolved

azdagron reviewed Oct 21, 2019

View reviewed changes

pkg/common/version/version.go Outdated Show resolved Hide resolved

Addressing review comments

067e2c6

Signed-off-by: Ryan Turner <turner@uber.com>

zmt reviewed Oct 21, 2019

View reviewed changes

go.sum Show resolved Hide resolved

zmt reviewed Oct 21, 2019

View reviewed changes

pkg/common/version/version.go Outdated Show resolved Hide resolved

ZymoticB reviewed Oct 21, 2019

View reviewed changes

pkg/common/version/version.go Outdated Show resolved Hide resolved

Use default "unk" value when githash is not set

2fd52c8

Signed-off-by: Ryan Turner <turner@uber.com>

evan2645 reviewed Oct 22, 2019

View reviewed changes

Ryan Turner added 2 commits October 22, 2019 18:25

Shorten git commit hash in rev-parse command

79e5ef6

Signed-off-by: Ryan Turner <turner@uber.com>

Renaming metric from "version" to "started"

03276eb

Signed-off-by: Ryan Turner <turner@uber.com>

evan2645 reviewed Oct 25, 2019

View reviewed changes

go.mod Show resolved Hide resolved

Undoing go.mod changes

2972e93

Signed-off-by: Ryan Turner <turner@uber.com>

rturner3 force-pushed the td-spire-version branch from 7dfc04e to 2972e93 Compare October 30, 2019 19:50

evan2645 approved these changes Oct 30, 2019

View reviewed changes

Merge branch 'master' into td-spire-version

f206480

evan2645 merged commit 725fed5 into spiffe:master Oct 30, 2019

amartinezfayo mentioned this pull request Jan 6, 2021

Report uptime as metrics for spire server and agent with config #2032

Merged

3 tasks

Conversation

rturner3 commented Oct 21, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evan2645 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants