open-telemetry · reyang · May 26, 2022 · May 19, 2022 · May 21, 2022 · May 21, 2022
@@ -12,6 +12,9 @@ requirements to the existing specifications.
 - [Guidelines for instrumentation library authors](#guidelines-for-instrumentation-library-authors)
   * [Instrument selection](#instrument-selection)
   * [Additive property](#additive-property)
+    + [Numeric type selection](#numeric-type-selection)
+      - [Integer](#integer)
+      - [Float](#float)
   * [Monotonicity property](#monotonicity-property)
   * [Semantic convention](#semantic-convention)
 - [Guidelines for SDK authors](#guidelines-for-sdk-authors)
@@ -79,6 +82,139 @@ Here is one way of choosing the correct instrument:
 
 ### Additive property
 
+In OpenTelemetry a [Measurement](./api.md#measurement) encapsulates a value and
+a set of [`Attributes`](../common/README.md#attribute). Depending on the nature
+of the measurements, they can be additive, non-additive or somewhere in the
+middle. Here are some examples:
+
+* The server temperature is non-additive. The temperatures in the table below
+  add up to `226.2`, but this value has no practical meaning.
+
+  | Host Name | Temperature (F) |
+  | --------- | --------------- |
+  | MachineA  | 58.8            |
+  | MachineB  | 86.1            |
+  | MachineC  | 81.3            |
+
+* The mass of planets is additive, the value `1.18e25` (`3.30e23 + 6.42e23 +
+  4.87e24 + 5.97e24`) means the combined mass of terrestrial planets in the
+  solar system.
+
+  | Planet Name | Mass (kg)       |
+  | ----------- | --------------- |
+  | Mercury     | 3.30e23         |
+  | Mars        | 6.42e23         |
+  | Venus       | 4.87e24         |
+  | Earth       | 5.97e24         |
+
+* The voltage of battery cells can be added up if the batteries are connected in
+  series. However, if the batteries are connected in parallel, it makes no sense
+  to add up the voltage values anymore.
+
+In OpenTelemetry, each [Instrument](./api.md#instrument) implies whether it is
+additive or not. For example, [Counter](./api.md#counter) and
+[UpDownCounter](./api.md#updowncounter) are additive while [Asynchronous
+Gauge](./api.md#asynchronous-gauge) is non-additive.
+
+#### Numeric type selection
+
+For Instruments which take increments and/or decrements as the input (e.g.
+[Counter](./api.md#counter) and [UpDownCounter](./api.md#updowncounter)), the
+underlying numeric types (e.g., signed integer, unsigned integer, double) have
+direct impact on the dynamic range, precision, and how the data is interpreted.
+Typically, integers are precise but have limited dynamic range, and might see
+overflow/underflow. [IEEE-754 double-precision floating-point
+format](https://wikipedia.org/wiki/Double-precision_floating-point_format) has a
+wide dynamic range of numeric values with the sacrifice on precision.
+
+##### Integer
+
+Let's take an example: a 16-bit signed integer is used to count the committed
+transactions in a database, reported as cumulative sum every 15 seconds:
+
+* During (T<sub>0</sub>, T<sub>1</sub>], we reported `70`.
+* During (T<sub>0</sub>, T<sub>2</sub>], we reported `115`.
+* During (T<sub>0</sub>, T<sub>3</sub>], we reported `116`.
+* During (T<sub>0</sub>, T<sub>4</sub>], we reported `128`.
+* During (T<sub>0</sub>, T<sub>5</sub>], we reported `128`.
+* During (T<sub>0</sub>, T<sub>6</sub>], we reported `173`.
+* ...
+* During (T<sub>0</sub>, T<sub>n+1</sub>], we reported `1,872`.
+* During (T<sub>n+2</sub>, T<sub>n+3</sub>], we reported `35`.
+* During (T<sub>n+2</sub>, T<sub>n+4</sub>], we reported `76`.
+
+In the above case, a backend system could tell that there was likely a system
+restart (because the start time has changed from T<sub>0</sub> to
+T<sub>n+2</sub>) during (T<sub>n+1</sub>, T<sub>n+2</sub>], so it has chance to
+adjust the data to:
+
+* (T<sub>0</sub>, T<sub>n+3</sub>] : `1,907` (1,872 + 35).
+* (T<sub>0</sub>, T<sub>n+4</sub>] : `1,948` (1,872 + 76).
+
+Imagine we keep the database running:
+
+* During (T<sub>0</sub>, T<sub>m+1</sub>], we reported `32,758`.
+* During (T<sub>0</sub>, T<sub>m+2</sub>], we reported `32,762`.
+* During (T<sub>0</sub>, T<sub>m+3</sub>], we reported `-32,738`.
+* During (T<sub>0</sub>, T<sub>m+4</sub>], we reported `-32,712`.
+
+In the above case, the backend system could tell that there was an integer
+overflow during (T<sub>m+2</sub>, T<sub>m+3</sub>] (because the start time
+remains the same as before, and the value becomes negative), so it has chance to
+adjust the data to:
+
+* (T<sub>0</sub>, T<sub>m+3</sub>] : `32,798` (32,762 + 36).
+* (T<sub>0</sub>, T<sub>m+4</sub>] : `32,824` (32,762 + 62).
+
+As we can see in this example, even with the limitation of 16-bit integer, we
+can count the database transactions with high fidelity, without having to
+worry about information loss caused by integer overflows.
+
+It is important to understand that we are handling counter reset and integer
+overflow/underflow based on the assumption that we've picked the proper dynamic
+range and reporting frequency. Imagine if we use the same 16-bit signed integer
+to count the transactions in a data center (which could have thousands if not
+millions of transactions per second), we wouldn't be able to tell if `-32,738`
+was a result of `32,762 + 36` or `32,762 + 65,572` or even `32,762 + 131,108` if
+we report the data every 15 seconds. In this situation, either using a larger
+number (e.g. 32-bit integer) or increasing the reporting frequency (e.g. every
+microsecond, if we can afford the cost) would help.
+
+##### Float
+
+Let's take an example: an [IEEE-754 double precision floating
+point](https://en.wikipedia.org/wiki/Double-precision_floating-point_format) is
+used to count the number of positrons detected by an alpha magnetic
+spectrometer. Each time a positron is detected, the spectrometer will invoke
+`counter.Add(1)`, and the result is reported as cumulative sum every 1 second:
+
+* During (T<sub>0</sub>, T<sub>1</sub>], we reported `131,108`.
+* During (T<sub>0</sub>, T<sub>2</sub>], we reported `375,463`.
+* During (T<sub>0</sub>, T<sub>3</sub>], we reported `832,019`.
+* During (T<sub>0</sub>, T<sub>4</sub>], we reported `1,257,308`.
+* During (T<sub>0</sub>, T<sub>5</sub>], we reported `1,860,103`.
+* ...
+* During (T<sub>0</sub>, T<sub>n+1</sub>], we reported `9,007,199,254,325,789`.
+* During (T<sub>0</sub>, T<sub>n+2</sub>], we reported `9,007,199,254,740,992`.
+* During (T<sub>0</sub>, T<sub>n+3</sub>], we reported `9,007,199,254,740,992`.
+
+In the above case, the counter stopped increasing at some point between
+T<sub>n+1</sub> and T<sub>n+2</sub>, because the IEEE-754 double counter is
+"saturated", `9,007,199,254,740,992 + 1` will result in `9,007,199,254,740,992`
+so the number stopped growing.
+
+Note: in ECMAScript 6 the number `9,007,199,254,740,991` (`2 ^ 53 - 1`) is known
+as `Number.MAX_SAFE_INTEGER`, which is the maximum integer that can be exactly
+represented as an IEEE-754 double precision number, and whose IEEE-754
+representation cannot be the result of rounding any other integer to fit the
+IEEE-754 representation.
+
+In addition to the "saturation" issue, we should also understand that IEEE-754
+double supports [subnormal
+numbers](https://en.wikipedia.org/wiki/Subnormal_number). For example,
+`1.0E308 + 1.0E308` would result in `+Inf` (positive infinity). Certain metrics
+backend might have trouble handling subnormal numbers.
+
 ### Monotonicity property
 
 In the OpenTelemetry Metrics [Data Model](./datamodel.md) and [API](./api.md)
@@ -123,6 +259,8 @@ the total number of bytes received in a cumulative sum data stream:
 
 The backend system could tell that there was integer overflow or system restart
 during (T<sub>n+1</sub>, T<sub>n+2</sub>], so it has chance to "fix" the data.
+Refer to [additive property](#additive-property) for more information about
+integer overflow.
 
 Let's take another example with a process using an [Asynchronous
 Counter](./api.md#asynchronous-counter) to report the total page faults of the