-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supplementary guidance for metrics additive property #2571
Changes from 14 commits
637dc8d
5ad535d
9b3e7f6
0188b44
b7a953e
2dd6361
cf8347d
4791d82
7cbd3ab
3e96143
4daba5a
3cbc095
930181b
ab91015
f421bf0
66bc491
6d58179
28716a4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,9 @@ requirements to the existing specifications. | |
- [Guidelines for instrumentation library authors](#guidelines-for-instrumentation-library-authors) | ||
* [Instrument selection](#instrument-selection) | ||
* [Additive property](#additive-property) | ||
+ [Numeric type selection](#numeric-type-selection) | ||
- [Integer](#integer) | ||
- [Float](#float) | ||
* [Monotonicity property](#monotonicity-property) | ||
* [Semantic convention](#semantic-convention) | ||
- [Guidelines for SDK authors](#guidelines-for-sdk-authors) | ||
|
@@ -79,6 +82,139 @@ Here is one way of choosing the correct instrument: | |
|
||
### Additive property | ||
|
||
In OpenTelemetry a [Measurement](./api.md#measurement) encapsulates a value and | ||
a set of [`Attributes`](../common/README.md#attribute). Depending on the nature | ||
of the measurements, they can be additive, non-additive or somewhere in the | ||
middle. Here are some examples: | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* The server temperature is non-additive. The temperatures in the table below | ||
add up to `226.2`, but this value has no practical meaning. | ||
|
||
| Host Name | Temperature (F) | | ||
| --------- | --------------- | | ||
| MachineA | 58.8 | | ||
| MachineB | 86.1 | | ||
| MachineC | 81.3 | | ||
|
||
* The mass of planets is additive, the value `1.18e25` (`3.30e23 + 6.42e23 + | ||
4.87e24 + 5.97e24`) means the combined mass of terrestrial planets in the | ||
solar system. | ||
|
||
| Planet Name | Mass (kg) | | ||
| ----------- | --------------- | | ||
| Mercury | 3.30e23 | | ||
| Mars | 6.42e23 | | ||
| Venus | 4.87e24 | | ||
| Earth | 5.97e24 | | ||
|
||
* The voltage of battery cells can be added up if the batteries are connected in | ||
series. However, if the batteries are connected in parallel, it makes no sense | ||
to add up the voltage values anymore. | ||
|
||
In OpenTelemetry, each [Instrument](./api.md#instrument) implies whether it is | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
additive or not. For example, [Counter](./api.md#counter) is additive while | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[Asynchronous Gauge](./api.md#asynchronous-gauge) is non-additive. | ||
|
||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
#### Numeric type selection | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
For Instruments which take increments and/or decrements as the input (e.g. | ||
[Counter](./api.md#counter) and [UpDownCounter](./api.md#updowncounter)), the | ||
underlying numeric types (e.g., signed integer, unsigned integer, double) have | ||
direct impact on the dynamic range, precision, and how the data is interpreted. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we codify how the implementers choice of datatype impacts what values can be stored for intermediate aggregates, on the wire, and long-term persistent aggregates? We are telling them there is impact and the examples give suggestions of what the impacts might be, but extrapolating these examples still left me with ambiguity. For example what parts of these examples are guarantees that all OpenTelemetry compliant implementations would/would not overflow and what parts were implementation specific choices how one particular SDK and backend might behave? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no guarantee - as called out in the first line of this doc ("Note: this document is NOT a spec, it is provided to support the Metrics API and SDK specifications, it does NOT add any extra requirements to the existing specifications.") However, I do think it'll be helpful to give some suggestions (not requirements) to certain situations - e.g. JavaScript might not support int64, Python might treat integers as BigInt. |
||
Typically, integers are precise but have limited dynamic range, and might see | ||
overflow/underflow. [IEEE-754 double-precision floating-point | ||
format](https://wikipedia.org/wiki/Double-precision_floating-point_format) has a | ||
wide dynamic range of numeric values with the sacrifice on precision. | ||
|
||
##### Integer | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Let's take an example: a 16-bit signed integer is used to count the committed | ||
transactions in a database, reported as cumulative sum every 15 seconds: | ||
|
||
* During (T<sub>0</sub>, T<sub>1</sub>], we reported `70`. | ||
* During (T<sub>1</sub>, T<sub>2</sub>], we reported `115`. | ||
* During (T<sub>2</sub>, T<sub>3</sub>], we reported `116`. | ||
* During (T<sub>3</sub>, T<sub>4</sub>], we reported `128`. | ||
* During (T<sub>4</sub>, T<sub>5</sub>], we reported `128`. | ||
* During (T<sub>5</sub>, T<sub>6</sub>], we reported `173`. | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* ... | ||
* During (T<sub>n</sub>, T<sub>n+1</sub>], we reported `1,872`. | ||
* During (T<sub>n+1</sub>, T<sub>n+2</sub>], we reported `35`. | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* During (T<sub>n+2</sub>, T<sub>n+3</sub>], we reported `76`. | ||
|
||
In the above case, a backend system could tell that there was likely a system | ||
restart (and the counter was reset to zero) during (T<sub>n+1</sub>, | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
T<sub>n+2</sub>], so it has chance to adjust the data to: | ||
|
||
* (T<sub>n+1</sub>, T<sub>n+2</sub>] : `1,907` (1,872 + 35). | ||
* (T<sub>n+2</sub>, T<sub>n+3</sub>] : `1,948` (1,872 + 76). | ||
|
||
Imagine we keep the database running: | ||
|
||
* During (T<sub>m</sub>, T<sub>m+1</sub>], we reported `32,758`. | ||
* During (T<sub>m+1</sub>, T<sub>m+2</sub>], we reported `32,762`. | ||
* During (T<sub>m+2</sub>, T<sub>m+3</sub>], we reported `-32,738`. | ||
* During (T<sub>m+3</sub>, T<sub>m+4</sub>], we reported `-32,712`. | ||
|
||
In the above case, the backend system could tell that there was an integer | ||
overflow during (T<sub>m+2</sub>, T<sub>m+3</sub>], so it has chance to adjust | ||
the data to: | ||
|
||
* (T<sub>m+2</sub>, T<sub>m+3</sub>] : `32,798` (32,762 + 36). | ||
* (T<sub>m+3</sub>, T<sub>m+4</sub>] : `32,824` (32,762 + 62). | ||
|
||
The backend system could tell that there was an integer overflow during | ||
(T<sub>n+1</sub>, T<sub>n+2</sub>], so it has chance to "correct" the data. | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
As we can see in this example, even with the limitation of 16-bit integer, we | ||
can count the database transactions with high fidelity, without having to | ||
worry about information loss caused by integer overflows. | ||
|
||
It is important to understand that we are handling counter reset and integer | ||
overflow/underflow based on the assumption that we've picked the proper dynamic | ||
range and reporting frequency. Imagine if we use the same 16-bit signed integer | ||
to count the transactions in a data center (which could have thousands if not | ||
millions of transactions per second), we wouldn't be able to tell if `-32,738` | ||
was a result of `32,762 + 36` or `32,762 + 65,572` or even `32,762 + 131,108` if | ||
we report the data every 15 seconds. In this situation, either using a larger | ||
number (e.g. 32-bit integer) or increasing the reporting frequency (e.g. every | ||
microsecond, if we can afford the cost) would help. | ||
|
||
##### Float | ||
|
||
Let's take an example: an [IEEE-754 double precision floating | ||
point](https://en.wikipedia.org/wiki/Double-precision_floating-point_format) is | ||
used to count the number of positrons detected by an alpha magnetic | ||
spectrometer. Each time a positron is detected, the spectrometer will invoke | ||
`counter.Add(1)`, and the result is reported as cumulative sum every 1 second: | ||
|
||
* During (T<sub>0</sub>, T<sub>1</sub>], we reported `131,108`. | ||
* During (T<sub>1</sub>, T<sub>2</sub>], we reported `375,463`. | ||
* During (T<sub>2</sub>, T<sub>3</sub>], we reported `832,019`. | ||
* During (T<sub>3</sub>, T<sub>4</sub>], we reported `1,257,308`. | ||
* During (T<sub>4</sub>, T<sub>5</sub>], we reported `1,860,103`. | ||
* ... | ||
* During (T<sub>n</sub>, T<sub>n+1</sub>], we reported `9,007,199,254,325,789`. | ||
* During (T<sub>n+1</sub>, T<sub>n+2</sub>], we reported `9,007,199,254,740,992`. | ||
* During (T<sub>n+2</sub>, T<sub>n+3</sub>], we reported `9,007,199,254,740,992`. | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In the above case, the counter stopped increasing at some point between | ||
T<sub>n+1</sub> and T<sub>n+2</sub>, because the IEEE-754 double counter is | ||
"saturated", `9,007,199,254,740,992 + 1` will result in `9,007,199,254,740,992` | ||
so the number stopped growing. | ||
|
||
Note: in ECMAScript 6 the number `9,007,199,254,740,991` (`2 ^ 53 - 1`) is known | ||
as `Number.MAX_SAFE_INTEGER`, which is the maximum integer that can be exactly | ||
represented as an IEEE-754 double precision number, and whose IEEE-754 | ||
representation cannot be the result of rounding any other integer to fit the | ||
IEEE-754 representation. | ||
|
||
In addition to the "saturation" issue, we should also understand that IEEE-754 | ||
double supports [subnormal | ||
numbers](https://en.wikipedia.org/wiki/Subnormal_number). For example, | ||
`1.0E308 + 1.0E308` would result in `+Inf` (positive infinity). Certain metrics | ||
backend might have trouble handling subnormal numbers. | ||
|
||
### Monotonicity property | ||
|
||
In the OpenTelemetry Metrics [Data Model](./datamodel.md) and [API](./api.md) | ||
|
@@ -123,6 +259,8 @@ the total number of bytes received in a cumulative sum data stream: | |
|
||
The backend system could tell that there was integer overflow or system restart | ||
during (T<sub>n+1</sub>, T<sub>n+2</sub>], so it has chance to "fix" the data. | ||
Refer to [additive property](#additive-property) for more information about | ||
integer overflow. | ||
|
||
Let's take another example with a process using an [Asynchronous | ||
Counter](./api.md#asynchronous-counter) to report the total page faults of the | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a user picks an Additive instrument, what agreement is being made between the instrumenter and the tools that will interpret and visualize the measurements?
I'm suspicious that the text below will lead instrumenters to interpret additivity with a low bar (2) but UI authors will mostly care about defaults so they will only care about Additivity if it gives them (1). Making a clear stand in the spec could help everyone get on the same page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no direct agreement between instrumenter and the backend, as there are several roles involved - instrumenter (folks who use the API), application owner (who configures the SDK and exporter), backend (time-series database + query engine) and UX (query client + UI).
I think the UI is using the type of the metrics stream (e.g. sum, gauge, histogram, summary, etc.) plus additional metadata (e.g. the name of the metrics stream if it's a well known thing as defined by the semantic convention, unit) to provide the experience.
The way I look at the supplementary guideline - it is trying to provide additional context which might help to give clarity, or help folks to think about things that they haven't thought about previously, or just calling out that something is hard/complex without providing a simple answer.