Semantic conventions for Uptime Monitoring #185

jsuereth · 2021-11-10T18:54:50Z

A proposal and guide.

kjordy · 2021-11-10T20:02:17Z

text/metrics/0185-uptime-monitoring-semantic-conventions.md

+uptime is reported as a gauge with the value of the number of seconds that the process has been up. This is written as a gauge because users want to know the actual value of the number of seconds since the last restart to satisfy the use cases above. Sums are not a good fit for these use cases because most metric backends tend to default cumulative monotonic sums to rate-calculations, and have overflow handling that is undesired for this use case.
+
+Sums report a total value that has accumulated over a time window; it is valid, for instance, to subtract the current value of a cumulative sum and restart the start timestamp to now. (OpenTelemetry's Prometheus receiver does this, for instance.)
+An intended use case of a sum is to produce a meaningful value when aggregating away labels using sum. Such aggregations are not meaningful in the above use cases.


One thing implied here is that aggregating sums doesn't add together the seconds since last restart of all processes. Aggregating counters request time windows to be aligned. That alignment changes the value in the sum to be for the new time window which doesn't preserve the actual value of seconds since last restart as stated in the first paragraph of this section.

dustinlessard-wf · 2021-11-11T18:11:11Z

text/metrics/0185-uptime-monitoring-semantic-conventions.md

+
+## Trade-offs and mitigations
+
+The biggest tradeoff here is defining `uptime` metrics as non-montonic sums vs. either pure gauge or non-montonic sums. The fundmental question here is whether default sum-based aggregation is meaningful for this metric, in addition to the default-query-capabilities of common backends for cumulative sums. The proposal trades-off allowing an external observer to monitor uptime (with resets) in addition to common assumptions on querying rates for cumulative sums.


nit, small typo on non-montonic

also fundmental => fundamental

tigrannajaryan · 2021-11-12T01:19:00Z

text/metrics/0185-uptime-monitoring-semantic-conventions.md

+| ---------------------- | ---------------------------- | ----- | -----------------------------|
+| *.uptime               | Seconds since last restart   | s     | Asynchronous Gauge           |
+| *.health               | Availability flag.           | 1     | Asynchronous Gauge           |
+| *.restart_count        | Number of restarts.          | 1     | Asynchronous Counter         |


There is already a restart_count attribute defined for resources: https://github.com/open-telemetry/opentelemetry-specification/blob/a25d5f03ab58ecf88c09f635df97d2328b5ba237/specification/resource/semantic_conventions/k8s.md#container
It would be useful to tell how these two are related (if they are).

From a metric standpoint, having restart_count as a resource attribute is really really bad. I'm suprised no one commented on that, but I think it should be dropped, as it violates resource identity and causes high cardinality.

The TL;DR: though is that restart_count is something you'd want to alert on, and for metric-systems that means it needs to be a metric. They'd likely be the same value, but one is a resource attribute the other is a metric data point.

From a metric standpoint, having restart_count as a resource attribute is really really bad. I'm suprised no one commented on that, but I think it should be dropped, as it violates resource identity and causes high cardinality.

It was discussed, see open-telemetry/opentelemetry-specification#1945 (review)

The conclusion was that for a k8s container in a pod it is an identifying attribute, not a metric.

The Kubernetes restart count is not actually part of a unique-across-time identifier:

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.21/#containerstatus-v1-core

The number of times the container has been restarted, currently based on the number of dead containers that have not yet been removed. Note that this is calculated from dead containers. But those containers are subject to garbage collection. This value will get capped at 5 by GC.

Arguably that means it's not even useful as a metric, but it's certainly not a unique identifier.

tigrannajaryan · 2021-11-12T01:24:10Z

text/metrics/0185-uptime-monitoring-semantic-conventions.md

+| Name                   | Description                  | Units | Instrument Type              |
+| ---------------------- | ---------------------------- | ----- | -----------------------------|
+| *.uptime               | Seconds since last restart   | s     | Asynchronous Gauge           |
+| *.health               | Availability flag.           | 1     | Asynchronous Gauge           |


The paragraph above nicely tells the drawbacks of "health" as a metric. As oposed to that I like kubernetes's approach. The "liveliness" and "readyness" are easier to define precisely: "liveliness" means "I am alive, let me run" (and the opposite of that means "I am in trouble, need help, restart me, do something"), and "readyness" means "I can now accept a workload".
I have a hard time assigning a similarly precise meaning to the "Healthy" metric. What does "availability" mean?
Given this, should we perhaps avoid "heath" as a metric altogether and perhaps instead use "ready" and "lively" (or whatever the names) metrics?

This is meant to be the readyness flag. We could split this into two with readyness being recommended, and liveliness being optional.

I like the naming of readyness (readiness?) and liveliness, but if we're going down the road of multiple types, there actually could be arbitrary numbers of health status. For example, a backend could serve internal and external clients, and be unready for external clients but ready for internal clients. Can we just call it readyness and say "add an attribute if there are multiple distinct ready states"?

tigrannajaryan · 2021-11-12T01:25:51Z

text/metrics/0185-uptime-monitoring-semantic-conventions.md

+
+| Name                   | Description                  | Units | Instrument Type              |
+| ---------------------- | ---------------------------- | ----- | -----------------------------|
+| *.uptime               | Seconds since last restart   | s     | Asynchronous Gauge           |


The general recommendations appear to say that .time suffix needs a dot. Should this be *.up.time if we follow the recommendation?

It seems *.time is normally used for things that have additive property.

bogdandrutu · 2021-11-10T23:33:47Z

text/metrics/0185-uptime-monitoring-semantic-conventions.md

+
+## Motivation
+
+Why should we make this change? What new value would it bring? What use cases does it enable?


These are just question and not "motivation"

I think this just the copy of the template: https://github.com/open-telemetry/oteps/blob/main/0000-template.md
@jsuereth probably forgot to update this section :-)

I did, will update later in the week (a bit overloaded today)

jmacd · 2021-11-15T20:37:16Z

I'd like to consider an alternative not mentioned in this document, and I'm not sure where to propose it.

Instead of two metrics, "health" and "uptime", I propose a single non-monotonic Sum named "alive" with value 1. This data type requires that a start time be included with the measurement, unlike Gauge. The difference between the start time and the measurement time is the process uptime.

I have made this proposal already in connection with open-telemetry/opentelemetry-specification#1078, where I pointed out that we can implement service discovery in a push-based metrics system by joining this "alive" metric with information retrieved by service discovery.

jsuereth · 2021-11-16T17:58:33Z

@jmacd Commented offline, but recording here for posterity.

Instead of two metrics, "health" and "uptime", I propose a single non-monotonic Sum named "alive" with value 1. This data type requires that a start time be included with the measurement, unlike Gauge. The difference between the start time and the measurement time is the process uptime.

From a pure collection standpoint, I like a lot of what this brings, however I think we need to take an end-to-end focus. Specficially: "Can I write a query / dashboard / alert to solve the stated use cases?"

AFAICT, with known backends/query languages (Prometheus, Graphite, etc.) it's hard to pull the data back out, specifically the "Seconds since start" value in PromQL. We should make sure we have an answer to that.

OpenTelemetry is now working on defining how the health status should be reported via metrics: open-telemetry/oteps#185 I am removing it from here so that if the metric is added we can use that. In the unlikely even the metric is not added we can think whether we really need it as a field in OpAMP.

tedsuo · 2023-01-30T17:59:14Z

@jsuereth how important/relevant is this OTEP? Please assign an appropriate priority, or close if it's old and we no longer need it.

tomasmota · 2023-04-20T13:11:21Z

What is the state of this? It is still not clear to me how to implement this in otel. I suppose uptime is ok, but the health metric as 1|0 makes it not so useful. Should I then just do uptime for both, and only update health if the checks succeed?

Is it not a common use-case that most services would need this in some way? Or are people just relying directly on kubernetes checks instead? I understand that metric such as ops/sec. are much better, but not all services are doing stuff all the time, so this is much needed for those.

I had made an issue on this but closed it expecting this might progress. open-telemetry/opentelemetry-specification#2923

erasmas · 2023-06-13T15:20:24Z

I'm also curious about the state of this proposal since I'm having the same use case as described in open-telemetry/opentelemetry-specification#2923

tedsuo · 2023-07-31T16:27:12Z

@jsuereth is this stale, or is semconv currently working on this?

Manuelraa · 2023-09-29T09:51:54Z

I would also be interested in this. A generic up metric for creation of generic uptime alerting would be awesome.
Especially having it in the standard itself and e.g. integrated to OpenTelemetry Collector.

carlosalberto · 2024-12-04T15:26:07Z

OTEPs have been moved to the Specification repository. Please consider re-opening this PR against the new location. Closing.

jsuereth added 4 commits November 10, 2021 11:43

Add proposal around uptime monitoring semantic conventions for OTEL.

179c7c4

Fix some minor verbage around cumulative sums.

116d84b

Fix uptime as sum vs. uptime as gauge discussion.

7a0e61a

Move file to appropriate location.

2ee83dc

kjordy reviewed Nov 11, 2021

View reviewed changes

Fix lint.

05ff3ae

jsuereth marked this pull request as ready for review November 11, 2021 14:27

jsuereth requested review from a team November 11, 2021 14:27

jsuereth mentioned this pull request Nov 11, 2021

Add semantic conventions for process metrics open-telemetry/opentelemetry-specification#2061

Merged

dustinlessard-wf reviewed Nov 11, 2021

View reviewed changes

tigrannajaryan reviewed Nov 12, 2021

View reviewed changes

bogdandrutu reviewed Nov 12, 2021

View reviewed changes

tigrannajaryan mentioned this pull request Nov 17, 2021

Remove Health from Status Report message open-telemetry/opamp-spec#32

Merged

tigrannajaryan mentioned this pull request Feb 16, 2022

Basic health state in OpAMP spec open-telemetry/opamp-spec#62

Closed

jmacd mentioned this pull request Nov 15, 2022

How to create an "up" metric open-telemetry/opentelemetry-specification#2923

Open

djaglowski mentioned this pull request Dec 14, 2022

[receiver/mongodb]: Add uptime/health metrics open-telemetry/opentelemetry-collector-contrib#17022

Merged

tedsuo added the triaged label Jan 30, 2023

riverar mentioned this pull request Jul 20, 2023

Idiomatic way to implement a desktop app heartbeat open-telemetry/semantic-conventions#199

Open

trask mentioned this pull request Oct 24, 2023

Support collecting time series data on JVM runtime duration open-telemetry/opentelemetry-java-instrumentation#9750

Closed

This was referenced Oct 22, 2024

Add system uptime metric open-telemetry/semantic-conventions#1507

Merged

Change the process.uptime metric from a Counter to a Gauge (similar to system.uptime) open-telemetry/semantic-conventions#1518

Closed

braydonk mentioned this pull request Nov 5, 2024

Additional system attributes open-telemetry/opentelemetry-collector-contrib#31627

Closed

ChrsMark mentioned this pull request Nov 26, 2024

Add uptime metrics for container, k8s Pod and k8s Node open-telemetry/semantic-conventions#1617

Merged

3 tasks

carlosalberto closed this Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic conventions for Uptime Monitoring #185

Semantic conventions for Uptime Monitoring #185

jsuereth commented Nov 10, 2021

kjordy Nov 10, 2021

dustinlessard-wf Nov 11, 2021

evantorrie Nov 23, 2021

tigrannajaryan Nov 12, 2021

jsuereth Nov 12, 2021

tigrannajaryan Nov 12, 2021

quentinmit Nov 18, 2021

tigrannajaryan Nov 12, 2021

jsuereth Nov 12, 2021

quentinmit Nov 18, 2021

tigrannajaryan Nov 12, 2021

reyang Nov 12, 2021

bogdandrutu Nov 10, 2021

tigrannajaryan Nov 15, 2021

jsuereth Nov 16, 2021

jmacd commented Nov 15, 2021 •

edited

Loading

jsuereth commented Nov 16, 2021

tedsuo commented Jan 30, 2023

tomasmota commented Apr 20, 2023

erasmas commented Jun 13, 2023

tedsuo commented Jul 31, 2023

Manuelraa commented Sep 29, 2023

carlosalberto commented Dec 4, 2024


		## Trade-offs and mitigations

		The biggest tradeoff here is defining `uptime` metrics as non-montonic sums vs. either pure gauge or non-montonic sums. The fundmental question here is whether default sum-based aggregation is meaningful for this metric, in addition to the default-query-capabilities of common backends for cumulative sums. The proposal trades-off allowing an external observer to monitor uptime (with resets) in addition to common assumptions on querying rates for cumulative sums.


		## Motivation

		Why should we make this change? What new value would it bring? What use cases does it enable?

Semantic conventions for Uptime Monitoring #185

Semantic conventions for Uptime Monitoring #185

Conversation

jsuereth commented Nov 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd commented Nov 15, 2021 • edited Loading

jsuereth commented Nov 16, 2021

tedsuo commented Jan 30, 2023

tomasmota commented Apr 20, 2023

erasmas commented Jun 13, 2023

tedsuo commented Jul 31, 2023

Manuelraa commented Sep 29, 2023

carlosalberto commented Dec 4, 2024

jmacd commented Nov 15, 2021 •

edited

Loading