Semantic Conventions: default to seconds for duration units #2977

gouthamve · 2022-11-22T18:26:40Z

Prometheus and OpenMetrics strongly recommend that the units for measuring durations should be seconds. However, the semantic conventions here are in milliseconds.

This creates a lot of confusion in users who are expecting durations to be seconds and requires an addition / 1e3 when doing math with metrics that come from traditional Prometheus sources. For example, today Kubernetes metrics are all in _seconds mainly because of the Prometheus conventions. This would be similar for other existing systems as well.

Given its all floats, I think we should try and revist the decision to use milliseconds and try to align with Prometheus.

The text was updated successfully, but these errors were encountered:

gouthamve · 2022-11-22T18:27:28Z

cc @jmacd @jsuereth

jack-berg · 2022-11-22T18:58:34Z

The default buckets for explicit bucket histogram are aligned to milliseconds for http.server.duration. We've previously discussed that changing them is a breaking change and not allowed. Therefore, changing from milliseconds to seconds for duration would render the default buckets useless.

gouthamve · 2022-11-22T19:38:45Z

Hrm, so while converting from OTLP to Prometheus, we can always convert to _seconds in the case of explicit bucket histograms. Its simply downscaling the buckets by /1000. It is also inline with the spec which says:

the unit MUST be added as a suffix to the metric name, and SHOULD be converted to base units recommended by OpenMetrics when possible.

But this is not possible with exponential histograms. Can we make it seconds for exponential histograms?

gouthamve · 2022-11-22T19:43:52Z

This also causes inconsistencies when ingesting metrics from both Prometheus sources and OTel sources into a single db. Half the exponential histograms would be in seconds and the other half would be in milliseconds with no good way to reconcile/aggregate the two.

It is possible convert with explicit bucket histograms, hence initially I didn't mind the unit being milliseconds, but when I realised that its not possible with exponential histograms, I wanted to propose this change.

dashpole · 2022-11-22T20:09:18Z

If we introduced a way for instrumentation to override the default set of buckets (which could still be overridden by views), would that allow individual instrumentation libraries to switch to seconds?

jack-berg · 2022-11-22T23:47:25Z

If we introduced a way for instrumentation to override the default set of buckets

So you're imagining using what has been referred to as the "hint API" to perhaps have all http.server.duration instrumentation to report in seconds and specify alternative default buckets that are sensible for seconds?

nerdondon · 2022-11-23T00:41:42Z

I like the idea of allowing instrumentation to override buckets in a way that is more friendly to measurement in seconds but I want to interject with perspective from service mesh instrumentation.

Currently Istio, Dapr, and linkerd report request duration in milliseconds. A change to seconds as the unit of measurement seems like it would not be conducive to allowing the auto-instrumenting functions of these projects to adopt the HTTP semantic convention.

gouthamve · 2022-11-23T12:07:02Z

Currently Istio, Dapr, and linkerd report request duration in milliseconds

I don't see it being a big problem. Its easy to convert from a milliseconds to seconds in the fixed bucket histograms that they export. And they have the unit specified as well everywhere which means we can convert it internally.

If we land on seconds as the unit for exponential histograms then when the projects implement it, they can choose seconds. Switching from fixed bucket to exponential histograms is considered a breaking change in most projects so they could make the change when they make the switch.

Also Cilium supports seconds. I think we have a good opportunity here to align the industry here by having Prometheus and OTel recommend the same thing.

reyang · 2022-11-29T00:12:14Z

Two things to consider:

If we change milliseconds to seconds, we should update the default histogram buckets as pointed out by @jack-berg Semantic Conventions: default to seconds for duration units #2977 (comment)
Certain backends might prefer integer (which might be related to history, faster processing - e.g. int is in general faster than float, better storage efficiency - e.g. delta encoding).

jsuereth · 2022-11-29T16:14:43Z

@jack-berg The default buckets for explicit bucket histogram are aligned to milliseconds for http.server.duration. We've previously discussed that changing them is a breaking change and not allowed. Therefore, changing from milliseconds to seconds for duration would render the default buckets useless.

We had a lot of discussion in the semantic convention stability and actually the current proposal is that bucket boundary changes are not considered a breaking change, for a variety of reasons. I'm going to be submitting a PR shortly updating metric semconv stability definitiions in the stability specification, but this is a part of it.

The TL;DR; is that in practice histograms are interacted with using a "percentile(histogram, 0.9)" function in most backends, and this should remain stable across changes of buckets. You're just shifting where the error accrues.

pirgeo · 2022-11-29T16:19:30Z

One other thing to consider here: Durations are almost never recorded in seconds, all programming languages that I can think of are either using millis or nanos, so by specifying seconds we would force everybody to perform this (albeit cheap) conversion. This conversion would then only benefit users that use a backend that prefers seconds.

jack-berg · 2022-11-29T16:36:25Z

Here's previous discussion about whether changing default bucket boundaries is breaking. Based on the PR's phrasing "SDKs SHOULD use the default value when boundaries are not explicitly provided, unless they have good reasons to use something different (e.g. for backward compatibility reasons in a stable SDK release)", the conclusion is that even the less disruptive adding of buckets is probably a breaking change.

jsuereth · 2022-11-29T18:01:49Z

@jack-berg That logic is similar to claiming that changing label values is a breaking change, which we have not done yet. I think we should take that portion of the discussion to a future PR, but I disagree this should be considered a breaking change. I used to think this was a breaking change (e.g. during that PR), but in doing a lot of research into instrumentation stability requirements, I agree with @jmacd's comment.

E.g. OpenMetrics does NOT consider this breaking: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#histogram

jmacd · 2022-11-29T19:31:44Z

@gouthamve wrote:

But this is not possible with exponential histograms. Can we make it seconds for exponential histograms?

Yes, we should advise against scaling exponential histograms because it is a lossy operation. It is better to choose an ideal unit based on the range of measurement.

@jsuereth wrote:

The TL;DR; is that in practice histograms are interacted with using a "percentile(histogram, 0.9)" function in most backends, and this should remain stable across changes of buckets. You're just shifting where the error accrues.

The choice of units gives users a slight way to improve exponential histogram performance, because the representation favors values near 1.0. If you are histogramming a request that takes around 1 second, the best choice for units is seconds. If you're histogramming a request that takes around 1 millisecond, the best choice is milliseconds. Example: measurements in (1.0, 2.0] seconds for a coarse histogram of 4 buckets, compared with measurements in (1000, 2000] milliseconds. In both cases, we expect scale=2 because there are 4 buckets for 1 order of magnitude. These structures have the same relative error.

In the first case (seconds), buckets will have offset 0 or 1 with boundaries at 1.0, 1.189, 1.414, 1.682, 2.0.

In the second case (milliseconds), buckets will have offset 39 with lower boundaries at 861, 1024, 1217, 1448, 1722, 2048.

This makes the seconds histogram slightly more compressible than the milliseconds histogram; we can also see how it is impossible to convert without loss between these histogram representations by scaling bucket boundaries.

gouthamve · 2022-11-30T08:14:52Z

I was just thinking about @reyang's point:

Certain backends might prefer integer (which might be related to history, faster processing - e.g. int is in general faster than float, better storage efficiency - e.g. delta encoding)

I want to understand the storage reasoning, because the values for each bucket is an integer anyways. In Prometheus, the _bucket, _count are integer values as they are all counts, and only _sum is a float. So only ~10% of the samples would be in floats with seconds compared to milliseconds. I would think the storage benefits are small.

Another thought: this is slightly tangential to HTTP, but we won't be able to measure less than millisecond durations if we use an integer.

jmacd · 2022-11-30T16:18:37Z

we won't be able to measure less than millisecond durations if we use an integer.

This is an example for why we support mixed integer and floating point and do not consider change of number representation breaking, right? You may tell me about a backend which, once it is storing integer measurements for a timeseries lacks a way to change to floating point measurements, but this is a case that I am not sympathetic about--that backend should reconsider its choices.

In the real world, the precision and accuracy of an instrument are fundamentally tied with the units and range of values being measured. If I have a measurement in milliseconds, the precision and accuracy of the measurement are define in milliseconds. If I take a measurement using a milliseconds-timer and scale the result into a seconds or nanoseconds, the result has a misleading number of significant figures. This leads me to think that change of units should be considered a breaking change, but users should be allowed to configure the unit that suits them best depending on what they are measuring. If you are measuring a process that typically lasts seconds, you should use seconds. If you are measuring a process that typically lasts milliseconds, you should use milliseconds.

The real world is also full of examples based on temperature measurement. We have one unit for very cold temperatures (K) and we have one unit for room temperatures (C). Just because we have a formula to convert between these does not mean we should, because real-world thermometers are calibrated for specific temperature ranges. There is not an instrument that measures room temperatures in K nor an instrument that measures very low temperatures in C. This tells me there should not be a "one true unit" for temperature or duration or really any physical measurement.

jsuereth · 2022-11-30T17:54:53Z

@jmacd I think the proposal here is specific to HTTP semantic conventions. The question there is if we expect HTTP services to be typically measured in milliseconds or seconds.

@gouthamve I do think the points about exponential histogram here are important to consider if we push for a convention. I also think we need to fully align on the notion that Exponential histogram bucket boundaries (in OTEL) are designed so that they change to match the best resolution achievable in a limited amount of memory. The goal (from OTEL) is not to require users (or instrumentation authors) to understand dynamic range of what's being measured. From that perspective, the choice of unit entering into the instrument is important, particularly given the algorithm in use.

Personally, I'm still divided on this issue. I see a lot of prometheus instrumentation and I'd be concerned if the friction between OTEL Semantic conventions and prometheus defaults would limit OTEL instrumentation adoption (i.e. if I'm already using PromClient for metrics, will this cause enough friction we won't see OTEL adopted for traces+metrics where otherwise it'd be compatible?) I also understand the technical reasons why ms was chosen and see the friction it would cause in OTEL today to switch.

jmacd · 2022-11-30T20:42:13Z

@jsuereth based on your comments, I think we should leave the specification for milliseconds as the conventional unit for HTTP durations. Prometheus uses this unit, and the Statsd Timing measurement uses milliseconds as well. I take it we would recommend floating-point instruments when there is an expectation of sub-millisecond measurements and recommend integer instruments when there is no such expectation. Either way, the exported histogram data points do not reflect the original integer-vs-floating-point distinction.

gouthamve · 2022-11-30T21:07:58Z

based on your comments, I think we should leave the specification for milliseconds as the conventional unit for HTTP durations

I don't yet see it, but I might be mistaken. From what I understood, this is the algorithm to pick the scale: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#exponential-bucket-histogram-aggregation (with the default of 160 buckets, for example).

If the typical requests are around 500ms-2000ms, then picking seconds as unit would mean more accuracy. If they are around 1-100ms, then picking milliseconds would be better. Or am I completely off-base here somehow?

Prometheus uses this unit, and the Statsd Timing measurement uses milliseconds as well.

Prometheus uses seconds unfortunately :(

dpk83 · 2022-12-08T17:31:47Z

For lot of services, request latencies for most of the requests is desired to be within few milliseconds to few hundred milliseconds so having seconds as the default unit of measurement doesn't look to be the right direction.

Also, as @reyang mentioned earlier in many systems performance is critical and integer is preferred for performance reasons
as some backends can handle the storage and performance better with integer.

pirgeo · 2023-03-07T11:15:44Z

Yes, I agree. I think there is no right or wrong here, both milliseconds and seconds are used and will break some users, regardless of what we do.

I think the point that I am trying to make is that the OpenTelemetry community built these semantic conventions with a rather specific use case in mind, for which milliseconds usually work well. Instrumentations that already implemented these semantic conventions will no longer be compliant. That's okay, the semantic conventions are experimental right now. Users that get their data from other data sources that use milliseconds, and that align well with the OTel SemConv now will have to find ways of transforming their data to stay compliant or move away from the SemConvs. I think the problem we are facing here is that the milliseconds are already rather ingrained into the OTel world, and it will be hard to move all of that now. Maybe I am just wrong about that though, and it will be a breeze anyway. On the other hand, keeping them in milliseconds requires users that use a Prometheus backend to convert their new metrics if they want it to align with the other metrics that they have in seconds.

However, what worries me most is that we make such a sweeping change relatively late. With the push for HTTP semantic conventions stability, we will want to mark them as stable as soon as possible. Completely changing the metric shortly before stabilization seems... potentially disruptive. We will end up with a hodgepodge of instrumentations that do different things. That is probably also okay. It should work itself out over time, although I assume many people have a lot of work ahead of them to align this in their instrumentations, applications, and the data views in their backends. (I wonder if that will harm adoption.)

Either way, I understand the needs of both sides. But: can we introduce such a change so abruptly? I know I have been playing the devil’s advocate on this issue, but I really want us to do things right. I think this issue is also to a certain degree a question of how we deal with incompatibilities between the Prometheus and OpenTelemetry projects in the future. We have been trying to maintain compatibility in most areas but there will be differences here and there, especially as we mark more and more parts of the spec as stable. Do we go for compatibility at all costs? The answer might very well be yes, and that is okay, too! I think this is something for the OTel leadership to decide, and either way will work for some, and not work for others.

bertysentry · 2023-03-07T11:32:52Z

@pirgeo What do you think of my suggestion? (separate metrics for milliseconds (*.duration) and seconds (*.time)

pirgeo · 2023-03-07T14:59:50Z

@bertysentry In theory I think it's a good idea if we want to do a transition period, but it would duplicate a lot of the semantic conventions that we have today. It would probably lead to a similar state where every instrumentation is using the one that fits them, and we are just punting the problem. It's a possible solution, but I think we'd rather keep the semantic conventions we have today, and agree on one default.

yurishkuro · 2023-03-07T15:53:57Z

My 2c - this could be handled via stronger typing making the question of units a non-issue. Many languages today have established conventions for default time units, e.g., in Python it's seconds, in Go/Java the units are provided explicitly. Since capturing durations is one of the primary functions of OTEL, our APIs could have dedicated methods for that, where units are either explicit or follow language convention. This makes the instrumentation immune to changes being discussed in this issue.

Then we have transmission and exposition formats. In OTLP we can include units, or even specialize durations as a value type. In other exposition formats the exporter follows the existing format's conventions.

reyang · 2023-03-29T19:05:34Z

We've discussed this during the Feb. 14th specification SIG meeting:

We will make the change to use seconds instead of ms, which aligns with Prometheus.

^ this was the consensus from the spec meeting, if anyone disagrees and would like to request the TC (technical committee) to make the final call, please reply here explicitly before end of Mar. 7th (Pacific Time).

@trask FYI

The TC (technical committee) has done the voting and 5 out 9 members voted for seconds (s in the UCUM case sensitive ("c/s") format) as the recommended unit for duration.

In addition, we understand that semantic convention changes should be done in a careful way to reduce the negative impact. Several things should be considered:

Give the users a reasonable noticing period before the actual implementation change.
Explore ways to make it smoother (e.g. hint/advice API, translation).

Apr. 4th, 2023 update: editorial change to clarify "seconds" and UCUM case sensitive ("c/s") s.

Fixes #2229. Related to #3061 (lays groundwork but does not resolve). Related to #2977, which may use this new API to have `http.server.duration` report in seconds instead of ms without changing / breaking default bucket boundaries. Summary of the change: - Proposes a new parameter to optionally include when creating instruments, called "advice". - For the moment, advice only has one parameter for specifying the bucket boundaries of explicit bucket histogram. - Advice can be expanded with additional parameters in the future (e.g. default attributes to retain). The parameters may be general (aka applicable to all instruments) or specific to a particular instrument kind, like bucket boundaries. - Advice parameters can influence the [default aggregation](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#default-aggregation), which is used if there is no matching view and if the reader does not specify a preferred aggregation. - Not clear that all advice will be oriented towards configuring aggregation, so I've intentionally left the scope of what they can influence open ended. I've prototyped this in java [here](open-telemetry/opentelemetry-java#5217). Example usage: ``` DoubleHistogram doubleHistogram = meterProvider .get("meter") .histogramBuilder("histogram") .setUnit("foo") .setDescription("bar") .setAdvice( advice -> advice.setBoundaries(Arrays.asList(10.0, 20.0, 30.0))) .build(); ``` Advice could easily be changed to "hint" with everything else being equal. I thought "advice" clearly described what we're trying to accomplish, which is advice / recommend the implementation in providing useful output with minimal configuration. --------- Co-authored-by: Reiley Yang <[email protected]>

…telemetry#3216) Fixes open-telemetry#2229. Related to open-telemetry#3061 (lays groundwork but does not resolve). Related to open-telemetry#2977, which may use this new API to have `http.server.duration` report in seconds instead of ms without changing / breaking default bucket boundaries. Summary of the change: - Proposes a new parameter to optionally include when creating instruments, called "advice". - For the moment, advice only has one parameter for specifying the bucket boundaries of explicit bucket histogram. - Advice can be expanded with additional parameters in the future (e.g. default attributes to retain). The parameters may be general (aka applicable to all instruments) or specific to a particular instrument kind, like bucket boundaries. - Advice parameters can influence the [default aggregation](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#default-aggregation), which is used if there is no matching view and if the reader does not specify a preferred aggregation. - Not clear that all advice will be oriented towards configuring aggregation, so I've intentionally left the scope of what they can influence open ended. I've prototyped this in java [here](open-telemetry/opentelemetry-java#5217). Example usage: ``` DoubleHistogram doubleHistogram = meterProvider .get("meter") .histogramBuilder("histogram") .setUnit("foo") .setDescription("bar") .setAdvice( advice -> advice.setBoundaries(Arrays.asList(10.0, 20.0, 30.0))) .build(); ``` Advice could easily be changed to "hint" with everything else being equal. I thought "advice" clearly described what we're trying to accomplish, which is advice / recommend the implementation in providing useful output with minimal configuration. --------- Co-authored-by: Reiley Yang <[email protected]>

gouthamve added the spec:metrics Related to the specification/metrics directory label Nov 22, 2022

github-actions bot assigned yurishkuro Nov 22, 2022

yurishkuro removed their assignment Nov 22, 2022

gouthamve changed the title ~~Semantic Conventions: Default to seconds for duration units~~ Semantic Conventions: default to seconds for duration units Nov 22, 2022

gouthamve added this to Semantic Conventions + Instrumentation Stability WG Nov 23, 2022

gouthamve moved this to Blockers for HTTP semconv stability in Semantic Conventions + Instrumentation Stability WG Nov 23, 2022

carlosalberto added the [label deprecated] triaged-needmoreinfo [label deprecated] The issue is triaged - the OTel community needs more information to decide label Nov 28, 2022

arminru added the area:semantic-conventions Related to semantic conventions label Nov 28, 2022

reyang mentioned this issue Nov 30, 2022

Justify why "By" is used instead of "bytes" #2973

Closed

gouthamve mentioned this issue Dec 7, 2022

Don't recommend converting to prometheus base units, but do convert to full words #3019

Closed

pirgeo mentioned this issue Mar 7, 2023

Mark "Instrumentation Units" and "Instrumentation Types" sections of the general metric semantic conventions as stable #3294

Merged

jsuereth moved this from Blocker for HTTP semconv stability to In Progress in Semantic Conventions + Instrumentation Stability WG Mar 20, 2023

trask mentioned this issue Apr 2, 2023

Add new JVM runtime environment metrics #3352

Closed

trask moved this from In Progress to Blocker for HTTP semconv stability in Semantic Conventions + Instrumentation Stability WG Apr 2, 2023

reyang mentioned this issue Apr 4, 2023

Add metrics to ASP.NET Core dotnet/aspnetcore#46834

Merged

modulitos mentioned this issue Apr 10, 2023

The istio_request_duration_milliseconds metic isn't adhering to the Prometheus standard time unit istio/istio#44168

Closed

jack-berg mentioned this issue Apr 12, 2023

Specify seconds unit for measuring durations #3388

Merged

trask mentioned this issue Apr 13, 2023

Change http.server.duration and http.client.duration units to seconds #3389

Closed

reyang closed this as completed in #3388 Apr 14, 2023

github-project-automation bot moved this from Blocker for HTTP semconv stability to Done in Semantic Conventions + Instrumentation Stability WG Apr 14, 2023

gouthamve mentioned this issue Apr 18, 2023

Support for millisecond granularity on latency buckets? pyrra-dev/pyrra#667

Closed

trask mentioned this issue Apr 19, 2023

Change JVM GC duration metric from milliseconds to seconds #3414

Closed

fstab mentioned this issue May 16, 2023

Adapt default histogram boundaries to seconds as the new base unit #3509

Closed

jonatan-ivanov mentioned this issue Jun 23, 2023

Configurable base time unit for OTLP registry micrometer-metrics/micrometer#3870

Closed

tamasfe mentioned this issue Jul 6, 2023

is counter macro supports floating point values? metrics-rs/metrics#367

Closed

bohandley mentioned this issue Sep 12, 2023

Exemplars: won't visualize unless timestamps are second-level precision grafana/grafana#73586

Open

lhotari mentioned this issue Mar 6, 2024

[feat][misc] PIP-264: Implement topic lookup metrics using OpenTelemetry apache/pulsar#22058

Merged

15 tasks

sivadeepN mentioned this issue Apr 2, 2024

Difference in standard unit for http and rpc request duration open-telemetry/semantic-conventions#868

Open

bbrandt mentioned this issue May 25, 2024

Emit critical time and processing time meters Particular/NServiceBus#6953

Closed

alexghr mentioned this issue Jun 24, 2024

feat: add otel spec AztecProtocol/engineering-designs#1

Merged

aygalinc mentioned this issue Nov 18, 2024

feat: introduce metrics based on system diagnostics rabbitmq/rabbitmq-amqp-dotnet-client#84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Conventions: default to seconds for duration units #2977

Semantic Conventions: default to seconds for duration units #2977

gouthamve commented Nov 22, 2022

gouthamve commented Nov 22, 2022

jack-berg commented Nov 22, 2022

gouthamve commented Nov 22, 2022 •

edited

Loading

gouthamve commented Nov 22, 2022

dashpole commented Nov 22, 2022

jack-berg commented Nov 22, 2022

nerdondon commented Nov 23, 2022

gouthamve commented Nov 23, 2022 •

edited

Loading

reyang commented Nov 29, 2022

jsuereth commented Nov 29, 2022

pirgeo commented Nov 29, 2022

jack-berg commented Nov 29, 2022

jsuereth commented Nov 29, 2022

jmacd commented Nov 29, 2022

gouthamve commented Nov 30, 2022

jmacd commented Nov 30, 2022

jsuereth commented Nov 30, 2022

jmacd commented Nov 30, 2022 •

edited

Loading

gouthamve commented Nov 30, 2022

dpk83 commented Dec 8, 2022

pirgeo commented Mar 7, 2023

bertysentry commented Mar 7, 2023

pirgeo commented Mar 7, 2023

yurishkuro commented Mar 7, 2023

reyang commented Mar 29, 2023 •

edited

Loading

Semantic Conventions: default to seconds for duration units #2977

Semantic Conventions: default to seconds for duration units #2977

Comments

gouthamve commented Nov 22, 2022

gouthamve commented Nov 22, 2022

jack-berg commented Nov 22, 2022

gouthamve commented Nov 22, 2022 • edited Loading

gouthamve commented Nov 22, 2022

dashpole commented Nov 22, 2022

jack-berg commented Nov 22, 2022

nerdondon commented Nov 23, 2022

gouthamve commented Nov 23, 2022 • edited Loading

reyang commented Nov 29, 2022

jsuereth commented Nov 29, 2022

pirgeo commented Nov 29, 2022

jack-berg commented Nov 29, 2022

jsuereth commented Nov 29, 2022

jmacd commented Nov 29, 2022

gouthamve commented Nov 30, 2022

jmacd commented Nov 30, 2022

jsuereth commented Nov 30, 2022

jmacd commented Nov 30, 2022 • edited Loading

gouthamve commented Nov 30, 2022

dpk83 commented Dec 8, 2022

pirgeo commented Mar 7, 2023

bertysentry commented Mar 7, 2023

pirgeo commented Mar 7, 2023

yurishkuro commented Mar 7, 2023

reyang commented Mar 29, 2023 • edited Loading

gouthamve commented Nov 22, 2022 •

edited

Loading

gouthamve commented Nov 23, 2022 •

edited

Loading

jmacd commented Nov 30, 2022 •

edited

Loading

reyang commented Mar 29, 2023 •

edited

Loading