Skip to content

Emit version metric from SPIRE Server and Agent on startup#1208

Merged
evan2645 merged 8 commits intospiffe:masterfrom
ZymoticB:td-spire-version
Oct 30, 2019
Merged

Emit version metric from SPIRE Server and Agent on startup#1208
evan2645 merged 8 commits intospiffe:masterfrom
ZymoticB:td-spire-version

Conversation

@rturner3
Copy link
Collaborator

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality
Adds new version metric in SPIRE Server and Agent

Description of change
Upon startup of SPIRE Server and Agent, emits a new version metric, which is a gauge with a constant value of 1 and a version label with the version string of the component. The version string is built in the format <Major>.<Minor>.<Patch>-dev-<First 7 chars of SHA1 Git commit hash> . This is helpful in visualizing rollout of new versions of SPIRE Server and Agent within a deployment environment.

This also modifies the binary stamping to include the
binary SHA in the version if its not exactly a git tag.

Signed-off-by: Tyler Dixon <tylerd@uber.com>
Some metrics systems reject values that appear auto-generated.
Abbreviating the SHA1 Git commit hash in the version string is a
reasonable workaround for this. GitHub defaults to abbreviating the
commit hash to 7 characters in its UI in seemingly all repos it hosts.
For this reason, chose to abbreviate the hash to the first 7 characters
for easier correlation between SPIRE versions and the commit on which that
version is based.

Signed-off-by: Ryan Turner <turner@uber.com>
Signed-off-by: Ryan Turner <turner@uber.com>
Signed-off-by: Ryan Turner <turner@uber.com>
Copy link
Member

@evan2645 evan2645 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice - this will be super useful. left a couple small comments - not sure if @azdagron and @ZymoticB are good with the rest or not

)

func EmitVersion(m Metrics) {
m.SetGaugeWithLabels([]string{"version"}, 1, []Label{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a little odd to have a metric with constant value, where the name (and usefulness) of the metric is solely fulfilled by a label. What are your thoughts on making this a counter with a name like "agent_boot" and "server_boot" or something along those lines?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose just boot or started could suffice too since we already prefix with agent or server.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I get the confusion, the value here is a little moot. I'm not sure a counter like you're suggesting is any clearer since the counter value would be reset anytime the server or agent was restarted, effectively making it a constant value of 1 for each active instance. Perhaps a middle-ground solution of calling this metric "{agent,server}_boot" and keeping it as a gauge with value 1 would be clearer? It seems a little more intuitive to me at least that a gauge (as opposed to a counter) might have a constant value for an extended period of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value-prop of a value of 1 is that you can use an aggregation function (such as sum) to get the count of instances, rather than "counting" the series.

Off the top of my head I'm not sure there exists a query language (such as promql, graphite, m3ql, etc.) that allows you to "count" series grouped by tags. For example, if I have the ability to query global metrics, and I want a count of running version by-deployment-area I think I need to use a sum / groupBy function, rather than a "count of series".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure a counter like you're suggesting is any clearer since the counter value would be reset anytime the server or agent was restarted, effectively making it a constant value of 1 for each active instance.

Hmm... it's my impression that the emitted value of a counter is the amount to increment by, rather than the absolute value. A little googling seems to support that, but maybe there is something I'm missing?

The value-prop of a value of 1 is that you can use an aggregation function (such as sum) to get the count of instances, rather than "counting" the series.

Heh... the value I see in a counter is that you can easily compute the start/restart rate across a set of hosts. I don't think that can be done currently? Seems like it would be useful.

Off the top of my head I'm not sure there exists a query language (such as promql, graphite, m3ql, etc.) that allows you to "count" series grouped by tags. For example, if I have the ability to query global metrics, and I want a count of running version by-deployment-area I think I need to use a sum / groupBy function, rather than a "count of series".

I see what you're saying... In the graphite docs, they have both a groupByTags function and countSeries function which looks like they can be combined to do something like this. That being said, the DataDog docs recommend a gauge for this use case. I guess I could go either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the statsd docs seems to indicate that you must re-emit a gauge.

Gauges are a constant data type. They are not subject to averaging, and they don’t change unless you change them. That is, once you set a gauge value, it will be a flat line on the graph until you change it again.

https://statsd.readthedocs.io/en/v3.2.1/types.html#gauges

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all this supporting information @ZymoticB

I've taken to emitting "milliseconds since start" to monitor for crash loops.

That sounds much more effective than what I was suggesting.

With regards to emission frequency, I suppose we can always switch the behavior to emit more than once at some point in the future without affecting much (if we find that we need to).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, @ZymoticB . I think we are all in agreement that uptime is a really useful metric to emit. I can change this to be a gauge metric called "uptime" with a value of "milliseconds since startup" using the approach you suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by @rturner3 's comment so I want to clarify.

The suggestion to emit milliseconds since start to monitor for crash loops was in response to @evan2645 's request for a counter that could be used to monitor starts / stops. I agree we all see value in an uptime metric.

That said, IMO we should still have a gauge that emits version so that we can see "how many running instances are running version X", I don't think it should be a single uptime gauge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, @ZymoticB . I think this thread diverged into multiple directions and the linear nature of GitHub conversations had me a little confused as to what exactly was being proposed for this PR.

I can create a separate issue for the uptime metric and keep this PR focused to just the version metric, which is currently implemented with a gauge.

Ryan Turner added 2 commits October 22, 2019 18:25
Signed-off-by: Ryan Turner <turner@uber.com>
Signed-off-by: Ryan Turner <turner@uber.com>
Signed-off-by: Ryan Turner <turner@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants