Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supplementary guidance for metrics additive property #2571

Merged
merged 18 commits into from
May 26, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions specification/metrics/supplementary-guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ requirements to the existing specifications.
- [Guidelines for instrumentation library authors](#guidelines-for-instrumentation-library-authors)
* [Instrument selection](#instrument-selection)
* [Additive property](#additive-property)
+ [Numeric type selection](#numeric-type-selection)
- [Integer](#integer)
- [Float](#float)
* [Monotonicity property](#monotonicity-property)
* [Semantic convention](#semantic-convention)
- [Guidelines for SDK authors](#guidelines-for-sdk-authors)
Expand Down Expand Up @@ -79,6 +82,139 @@ Here is one way of choosing the correct instrument:

### Additive property

In OpenTelemetry a [Measurement](./api.md#measurement) encapsulates a value and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a user picks an Additive instrument, what agreement is being made between the instrumenter and the tools that will interpret and visualize the measurements?

  1. Whenever these measurements are presented in summarized form, the sum over all attributes is recommended as the most useful default summary value.
  2. Whenever these measurements are presented in summarized form, the sum over all attributes isn't necessarily the most useful or a good default, but it would make conceptual sense to provide it as an option.
  3. Something else?

I'm suspicious that the text below will lead instrumenters to interpret additivity with a low bar (2) but UI authors will mostly care about defaults so they will only care about Additivity if it gives them (1). Making a clear stand in the spec could help everyone get on the same page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no direct agreement between instrumenter and the backend, as there are several roles involved - instrumenter (folks who use the API), application owner (who configures the SDK and exporter), backend (time-series database + query engine) and UX (query client + UI).

I think the UI is using the type of the metrics stream (e.g. sum, gauge, histogram, summary, etc.) plus additional metadata (e.g. the name of the metrics stream if it's a well known thing as defined by the semantic convention, unit) to provide the experience.

The way I look at the supplementary guideline - it is trying to provide additional context which might help to give clarity, or help folks to think about things that they haven't thought about previously, or just calling out that something is hard/complex without providing a simple answer.

a set of [`Attributes`](../common/README.md#attribute). Depending on the nature
of the measurements, they can be additive, non-additive or somewhere in the
middle. Here are some examples:
reyang marked this conversation as resolved.
Show resolved Hide resolved

* The server temperature is non-additive. The temperatures in the table below
add up to `226.2`, but this value has no practical meaning.

| Host Name | Temperature (F) |
| --------- | --------------- |
| MachineA | 58.8 |
| MachineB | 86.1 |
| MachineC | 81.3 |

* The mass of planets is additive, the value `1.18e25` (`3.30e23 + 6.42e23 +
4.87e24 + 5.97e24`) means the combined mass of terrestrial planets in the
solar system.

| Planet Name | Mass (kg) |
| ----------- | --------------- |
| Mercury | 3.30e23 |
| Mars | 6.42e23 |
| Venus | 4.87e24 |
| Earth | 5.97e24 |

* The voltage of battery cells can be added up if the batteries are connected in
series. However, if the batteries are connected in parallel, it makes no sense
to add up the voltage values anymore.

In OpenTelemetry, each [Instrument](./api.md#instrument) implies whether it is
reyang marked this conversation as resolved.
Show resolved Hide resolved
additive or not. For example, [Counter](./api.md#counter) and
[UpDownCounter](./api.md#updowncounter) are additive while [Asynchronous
Gauge](./api.md#asynchronous-gauge) is non-additive.

#### Numeric type selection
reyang marked this conversation as resolved.
Show resolved Hide resolved

reyang marked this conversation as resolved.
Show resolved Hide resolved
For Instruments which take increments and/or decrements as the input (e.g.
[Counter](./api.md#counter) and [UpDownCounter](./api.md#updowncounter)), the
underlying numeric types (e.g., signed integer, unsigned integer, double) have
direct impact on the dynamic range, precision, and how the data is interpreted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we codify how the implementers choice of datatype impacts what values can be stored for intermediate aggregates, on the wire, and long-term persistent aggregates? We are telling them there is impact and the examples give suggestions of what the impacts might be, but extrapolating these examples still left me with ambiguity. For example what parts of these examples are guarantees that all OpenTelemetry compliant implementations would/would not overflow and what parts were implementation specific choices how one particular SDK and backend might behave?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee - as called out in the first line of this doc ("Note: this document is NOT a spec, it is provided to support the Metrics API and SDK specifications, it does NOT add any extra requirements to the existing specifications.")

However, I do think it'll be helpful to give some suggestions (not requirements) to certain situations - e.g. JavaScript might not support int64, Python might treat integers as BigInt.

Typically, integers are precise but have limited dynamic range, and might see
overflow/underflow. [IEEE-754 double-precision floating-point
format](https://wikipedia.org/wiki/Double-precision_floating-point_format) has a
wide dynamic range of numeric values with the sacrifice on precision.

##### Integer
reyang marked this conversation as resolved.
Show resolved Hide resolved

Let's take an example: a 16-bit signed integer is used to count the committed
transactions in a database, reported as cumulative sum every 15 seconds:

* During (T<sub>0</sub>, T<sub>1</sub>], we reported `70`.
* During (T<sub>0</sub>, T<sub>2</sub>], we reported `115`.
* During (T<sub>0</sub>, T<sub>3</sub>], we reported `116`.
* During (T<sub>0</sub>, T<sub>4</sub>], we reported `128`.
* During (T<sub>0</sub>, T<sub>5</sub>], we reported `128`.
* During (T<sub>0</sub>, T<sub>6</sub>], we reported `173`.
* ...
* During (T<sub>0</sub>, T<sub>n+1</sub>], we reported `1,872`.
* During (T<sub>n+2</sub>, T<sub>n+3</sub>], we reported `35`.
* During (T<sub>n+2</sub>, T<sub>n+4</sub>], we reported `76`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'report' mean here? Originally I thought the context was the instrument API so I expected report to mean app code is reporting a measurement to the SDK. However explicit notions of a time range I think only show up in the SDK which makes me think this report is SDK->Exporter or Exporter->Back end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However explicit notions of a time range I think only show up in the SDK which makes me think this report is SDK->Exporter or Exporter->Back end?

Yep, report here is referring to the cumulative sum from the SDK output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Ln132-133 "count the committed transactions in a database, reported as cumulative sum every 15 seconds" - "count" is the action from API perspective (count transactions by calling counter.Add(increment)), report is the result from the SDK (after aggregation).


In the above case, a backend system could tell that there was likely a system
restart (because the start time has changed from T<sub>0</sub> to
T<sub>n+2</sub>) during (T<sub>n+1</sub>, T<sub>n+2</sub>], so it has chance to
adjust the data to:

* (T<sub>0</sub>, T<sub>n+3</sub>] : `1,907` (1,872 + 35).
* (T<sub>0</sub>, T<sub>n+4</sub>] : `1,948` (1,872 + 76).

Imagine we keep the database running:

* During (T<sub>0</sub>, T<sub>m+1</sub>], we reported `32,758`.
* During (T<sub>0</sub>, T<sub>m+2</sub>], we reported `32,762`.
* During (T<sub>0</sub>, T<sub>m+3</sub>], we reported `-32,738`.
* During (T<sub>0</sub>, T<sub>m+4</sub>], we reported `-32,712`.

In the above case, the backend system could tell that there was an integer
overflow during (T<sub>m+2</sub>, T<sub>m+3</sub>] (because the start time
reyang marked this conversation as resolved.
Show resolved Hide resolved
remains the same as before, and the value becomes negative), so it has chance to
adjust the data to:

* (T<sub>0</sub>, T<sub>m+3</sub>] : `32,798` (32,762 + 36).
* (T<sub>0</sub>, T<sub>m+4</sub>] : `32,824` (32,762 + 62).

As we can see in this example, even with the limitation of 16-bit integer, we
can count the database transactions with high fidelity, without having to
worry about information loss caused by integer overflows.

It is important to understand that we are handling counter reset and integer
overflow/underflow based on the assumption that we've picked the proper dynamic
range and reporting frequency. Imagine if we use the same 16-bit signed integer
to count the transactions in a data center (which could have thousands if not
millions of transactions per second), we wouldn't be able to tell if `-32,738`
was a result of `32,762 + 36` or `32,762 + 65,572` or even `32,762 + 131,108` if
we report the data every 15 seconds. In this situation, either using a larger
number (e.g. 32-bit integer) or increasing the reporting frequency (e.g. every
microsecond, if we can afford the cost) would help.

##### Float

Let's take an example: an [IEEE-754 double precision floating
point](https://en.wikipedia.org/wiki/Double-precision_floating-point_format) is
used to count the number of positrons detected by an alpha magnetic
spectrometer. Each time a positron is detected, the spectrometer will invoke
`counter.Add(1)`, and the result is reported as cumulative sum every 1 second:

* During (T<sub>0</sub>, T<sub>1</sub>], we reported `131,108`.
* During (T<sub>0</sub>, T<sub>2</sub>], we reported `375,463`.
* During (T<sub>0</sub>, T<sub>3</sub>], we reported `832,019`.
* During (T<sub>0</sub>, T<sub>4</sub>], we reported `1,257,308`.
* During (T<sub>0</sub>, T<sub>5</sub>], we reported `1,860,103`.
* ...
* During (T<sub>0</sub>, T<sub>n+1</sub>], we reported `9,007,199,254,325,789`.
* During (T<sub>0</sub>, T<sub>n+2</sub>], we reported `9,007,199,254,740,992`.
* During (T<sub>0</sub>, T<sub>n+3</sub>], we reported `9,007,199,254,740,992`.

In the above case, the counter stopped increasing at some point between
T<sub>n+1</sub> and T<sub>n+2</sub>, because the IEEE-754 double counter is
"saturated", `9,007,199,254,740,992 + 1` will result in `9,007,199,254,740,992`
so the number stopped growing.

Note: in ECMAScript 6 the number `9,007,199,254,740,991` (`2 ^ 53 - 1`) is known
as `Number.MAX_SAFE_INTEGER`, which is the maximum integer that can be exactly
represented as an IEEE-754 double precision number, and whose IEEE-754
representation cannot be the result of rounding any other integer to fit the
IEEE-754 representation.

In addition to the "saturation" issue, we should also understand that IEEE-754
double supports [subnormal
numbers](https://en.wikipedia.org/wiki/Subnormal_number). For example,
`1.0E308 + 1.0E308` would result in `+Inf` (positive infinity). Certain metrics
backend might have trouble handling subnormal numbers.

### Monotonicity property

In the OpenTelemetry Metrics [Data Model](./datamodel.md) and [API](./api.md)
Expand Down Expand Up @@ -123,6 +259,8 @@ the total number of bytes received in a cumulative sum data stream:

The backend system could tell that there was integer overflow or system restart
during (T<sub>n+1</sub>, T<sub>n+2</sub>], so it has chance to "fix" the data.
Refer to [additive property](#additive-property) for more information about
integer overflow.

Let's take another example with a process using an [Asynchronous
Counter](./api.md#asynchronous-counter) to report the total page faults of the
Expand Down