diff --git a/specification/metrics/supplementary-guidelines.md b/specification/metrics/supplementary-guidelines.md index 37f2750d584..4b7cf93b2db 100644 --- a/specification/metrics/supplementary-guidelines.md +++ b/specification/metrics/supplementary-guidelines.md @@ -12,6 +12,9 @@ requirements to the existing specifications. - [Guidelines for instrumentation library authors](#guidelines-for-instrumentation-library-authors) * [Instrument selection](#instrument-selection) * [Additive property](#additive-property) + + [Numeric type selection](#numeric-type-selection) + - [Integer](#integer) + - [Float](#float) * [Monotonicity property](#monotonicity-property) * [Semantic convention](#semantic-convention) - [Guidelines for SDK authors](#guidelines-for-sdk-authors) @@ -79,6 +82,139 @@ Here is one way of choosing the correct instrument: ### Additive property +In OpenTelemetry a [Measurement](./api.md#measurement) encapsulates a value and +a set of [`Attributes`](../common/README.md#attribute). Depending on the nature +of the measurements, they can be additive, non-additive or somewhere in the +middle. Here are some examples: + +* The server temperature is non-additive. The temperatures in the table below + add up to `226.2`, but this value has no practical meaning. + + | Host Name | Temperature (F) | + | --------- | --------------- | + | MachineA | 58.8 | + | MachineB | 86.1 | + | MachineC | 81.3 | + +* The mass of planets is additive, the value `1.18e25` (`3.30e23 + 6.42e23 + + 4.87e24 + 5.97e24`) means the combined mass of terrestrial planets in the + solar system. + + | Planet Name | Mass (kg) | + | ----------- | --------------- | + | Mercury | 3.30e23 | + | Mars | 6.42e23 | + | Venus | 4.87e24 | + | Earth | 5.97e24 | + +* The voltage of battery cells can be added up if the batteries are connected in + series. However, if the batteries are connected in parallel, it makes no sense + to add up the voltage values anymore. + +In OpenTelemetry, each [Instrument](./api.md#instrument) implies whether it is +additive or not. For example, [Counter](./api.md#counter) and +[UpDownCounter](./api.md#updowncounter) are additive while [Asynchronous +Gauge](./api.md#asynchronous-gauge) is non-additive. + +#### Numeric type selection + +For Instruments which take increments and/or decrements as the input (e.g. +[Counter](./api.md#counter) and [UpDownCounter](./api.md#updowncounter)), the +underlying numeric types (e.g., signed integer, unsigned integer, double) have +direct impact on the dynamic range, precision, and how the data is interpreted. +Typically, integers are precise but have limited dynamic range, and might see +overflow/underflow. [IEEE-754 double-precision floating-point +format](https://wikipedia.org/wiki/Double-precision_floating-point_format) has a +wide dynamic range of numeric values with the sacrifice on precision. + +##### Integer + +Let's take an example: a 16-bit signed integer is used to count the committed +transactions in a database, reported as cumulative sum every 15 seconds: + +* During (T0, T1], we reported `70`. +* During (T0, T2], we reported `115`. +* During (T0, T3], we reported `116`. +* During (T0, T4], we reported `128`. +* During (T0, T5], we reported `128`. +* During (T0, T6], we reported `173`. +* ... +* During (T0, Tn+1], we reported `1,872`. +* During (Tn+2, Tn+3], we reported `35`. +* During (Tn+2, Tn+4], we reported `76`. + +In the above case, a backend system could tell that there was likely a system +restart (because the start time has changed from T0 to +Tn+2) during (Tn+1, Tn+2], so it has chance to +adjust the data to: + +* (T0, Tn+3] : `1,907` (1,872 + 35). +* (T0, Tn+4] : `1,948` (1,872 + 76). + +Imagine we keep the database running: + +* During (T0, Tm+1], we reported `32,758`. +* During (T0, Tm+2], we reported `32,762`. +* During (T0, Tm+3], we reported `-32,738`. +* During (T0, Tm+4], we reported `-32,712`. + +In the above case, the backend system could tell that there was an integer +overflow during (Tm+2, Tm+3] (because the start time +remains the same as before, and the value becomes negative), so it has chance to +adjust the data to: + +* (T0, Tm+3] : `32,798` (32,762 + 36). +* (T0, Tm+4] : `32,824` (32,762 + 62). + +As we can see in this example, even with the limitation of 16-bit integer, we +can count the database transactions with high fidelity, without having to +worry about information loss caused by integer overflows. + +It is important to understand that we are handling counter reset and integer +overflow/underflow based on the assumption that we've picked the proper dynamic +range and reporting frequency. Imagine if we use the same 16-bit signed integer +to count the transactions in a data center (which could have thousands if not +millions of transactions per second), we wouldn't be able to tell if `-32,738` +was a result of `32,762 + 36` or `32,762 + 65,572` or even `32,762 + 131,108` if +we report the data every 15 seconds. In this situation, either using a larger +number (e.g. 32-bit integer) or increasing the reporting frequency (e.g. every +microsecond, if we can afford the cost) would help. + +##### Float + +Let's take an example: an [IEEE-754 double precision floating +point](https://en.wikipedia.org/wiki/Double-precision_floating-point_format) is +used to count the number of positrons detected by an alpha magnetic +spectrometer. Each time a positron is detected, the spectrometer will invoke +`counter.Add(1)`, and the result is reported as cumulative sum every 1 second: + +* During (T0, T1], we reported `131,108`. +* During (T0, T2], we reported `375,463`. +* During (T0, T3], we reported `832,019`. +* During (T0, T4], we reported `1,257,308`. +* During (T0, T5], we reported `1,860,103`. +* ... +* During (T0, Tn+1], we reported `9,007,199,254,325,789`. +* During (T0, Tn+2], we reported `9,007,199,254,740,992`. +* During (T0, Tn+3], we reported `9,007,199,254,740,992`. + +In the above case, the counter stopped increasing at some point between +Tn+1 and Tn+2, because the IEEE-754 double counter is +"saturated", `9,007,199,254,740,992 + 1` will result in `9,007,199,254,740,992` +so the number stopped growing. + +Note: in ECMAScript 6 the number `9,007,199,254,740,991` (`2 ^ 53 - 1`) is known +as `Number.MAX_SAFE_INTEGER`, which is the maximum integer that can be exactly +represented as an IEEE-754 double precision number, and whose IEEE-754 +representation cannot be the result of rounding any other integer to fit the +IEEE-754 representation. + +In addition to the "saturation" issue, we should also understand that IEEE-754 +double supports [subnormal +numbers](https://en.wikipedia.org/wiki/Subnormal_number). For example, +`1.0E308 + 1.0E308` would result in `+Inf` (positive infinity). Certain metrics +backend might have trouble handling subnormal numbers. + ### Monotonicity property In the OpenTelemetry Metrics [Data Model](./datamodel.md) and [API](./api.md) @@ -123,6 +259,8 @@ the total number of bytes received in a cumulative sum data stream: The backend system could tell that there was integer overflow or system restart during (Tn+1, Tn+2], so it has chance to "fix" the data. +Refer to [additive property](#additive-property) for more information about +integer overflow. Let's take another example with a process using an [Asynchronous Counter](./api.md#asynchronous-counter) to report the total page faults of the