Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify metrics monotonicity #1995

Merged
10 changes: 5 additions & 5 deletions specification/metrics/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -700,8 +700,8 @@ operation is provided by the `callback`, which is registered during the
`UpDownCounter` is a [synchronous Instrument](#synchronous-instrument) which
supports increments and decrements.

Note: if the value grows
[monotonically](https://wikipedia.org/wiki/Monotonic_function), use
Note: if the value is
[monotonically](https://wikipedia.org/wiki/Monotonic_function) increasing, use
[Counter](#counter) instead.

Example uses for `UpDownCounter`:
Expand Down Expand Up @@ -844,8 +844,8 @@ process heap size - it makes sense to report the heap size from multiple
processes and sum them up, so we get the total heap usage_) when the instrument
is being observed.

Note: if the value grows
[monotonically](https://wikipedia.org/wiki/Monotonic_function), use
Note: if the value is
[monotonically](https://wikipedia.org/wiki/Monotonic_function) increasing, use
[Asynchronous Counter](#asynchronous-counter) instead; if the value is
non-additive, use [Asynchronous Gauge](#asynchronous-gauge) instead.

Expand Down Expand Up @@ -886,7 +886,7 @@ The `callback` function is responsible for reporting the
observed. [OpenTelemetry API](../overview.md#api) authors SHOULD define whether
this callback function needs to be reentrant safe / thread safe or not.

Note: Unlike [UpDownCounter.Add()](#add) which takes the increment/delta value,
Note: Unlike [UpDownCounter.Add()](#add-1) which takes the increment/delta value,
the callback function reports the absolute value of the Asynchronous
UpDownCounter. To determine the reported rate the Asynchronous UpDownCounter is
changing, the difference between successive measurements is used.
Expand Down
66 changes: 66 additions & 0 deletions specification/metrics/supplementary-guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Table of Contents:
* [Guidelines for instrumentation library
authors](#guidelines-for-instrumentation-library-authors)
* [Instrument selection](#instrument-selection)
* [Additive property](#additive-property)
* [Monotonicity property](#monotonicity-property)
* [Semantic convention](#semantic-convention)
* [Guidelines for SDK authors](#guidelines-for-sdk-authors)
* [Aggregation temporality](#aggregation-temporality)
Expand Down Expand Up @@ -62,6 +64,70 @@ Here is one way of choosing the correct instrument:
* If the value is NOT monotonically increasing - use an [Asynchronous
UpDownCounter](./api.md#asynchronous-updowncounter).

### Additive property

### Monotonicity property

In the OpenTelemetry Metrics [Data Model](./datamodel.md) and [API](./api.md)
specifications, the word `monotonic` has been used frequently.

Monotonicity is important because it
reyang marked this conversation as resolved.
Show resolved Hide resolved

It is important to understand that different
[Instruments](#instrument-selection) handles monotonicity differently.

Let's take an example with a network driver using a [Counter](./api.md#counter)
to record the total number of bytes received:

* During the time range (T<sub>0</sub>, T<sub>1</sub>]:
* no network packet has been received
* During the time range (T<sub>1</sub>, T<sub>2</sub>]:
* received a packet with `30` bytes - `Counter.Add(30)`
* received a packet with `200` bytes - `Counter.Add(200)`
* received a packet with `50` bytes - `Counter.Add(50)`
* During the time range (T<sub>2</sub>, T<sub>3</sub>]
* received a packet with `100` bytes - `Counter.Add(100)`

You can see that the total increment during (T<sub>0</sub>, T<sub>1</sub>] is
`0`, the total increment during (T<sub>1</sub>, T<sub>2</sub>] is `280` (`30 +
200 + 50`), the total increment during (T<sub>2</sub>, T<sub>3</sub>] is `100`,
and the total increment during (T<s3ub>0</sub>, T<sub>3</sub>] is `380` (`0 +
280 + 100`). All the increments are non-negative, in other words, the **sum is
monotonically increasing**.

Note that it is inaccurate to say "the total bytes received by T<sub>3</sub> is
`380`", because there might be network packets received by the driver before we
started to observe it (e.g. before the last operating system reboot). The
accurate way is to say "the total bytes received during (T<sub>0</sub>,
T<sub>3</sub>] is `380`". In a nutshell, the count represents a **rate** which
is associated with a time range.

Let's take another example with a process using an [Asynchronous
Counter](./api.md#asynchronous-counter) to report the total page faults of the
process:

The page faults are managed by the operating system, and the process could
retrieve the number of page faults via some system APIs.

* At T<sub>0</sub>:
* the process started
* the process didn't ask the operating system to report the page faults
* At T<sub>1</sub>:
* the operating system reported with `1000` page faults for the process
* At T<sub>2</sub>:
* the process didn't ask the operating system to report the page faults
* At T<sub>3</sub>:
* the operating system reported with `1050` page faults for the process
* At T<sub>4</sub>:
* the operating system reported with `1200` page faults for the process

You can see that the number being reported is the absolute value rather than
increments, and the value is monotonically increasing.

If we need to calculate "how many page faults have been introduced during
(T<sub>3</sub>, T<sub>4</sub>]", we need to apply subtraction `1200 - 1050 =
150`.

### Semantic convention

Once you decided [which instrument(s) to be used](#instrument-selection), you
Expand Down