Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute sets not observed during async callbacks are not exported #3242

Merged
merged 4 commits into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ release.
([#3648](https://github.com/open-telemetry/opentelemetry-specification/pull/3648))
- MetricReader.Collect ignores Resource from MetricProducer.Produce.
([#3636](https://github.com/open-telemetry/opentelemetry-specification/pull/3636))
- Attribute sets not observed during async callbacks are not exported.
([#3242](https://github.com/open-telemetry/opentelemetry-specification/pull/3242))

### Logs

Expand Down
4 changes: 4 additions & 0 deletions specification/metrics/sdk.md
Original file line number Diff line number Diff line change
Expand Up @@ -701,6 +701,10 @@ execution.
The implementation MUST complete the execution of all callbacks for a
given instrument before starting a subsequent round of collection.

The implementation SHOULD NOT produce aggregated metric data for a
previously-observed attribute set which is not observed during a successful
callback.

### Cardinality limits

**Status**: [Experimental](../document-status.md)
Expand Down
134 changes: 76 additions & 58 deletions specification/metrics/supplementary-guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -431,25 +431,30 @@ Instrument](./api.md#histogram). What if we collect measurements from an
[Asynchronous Counter](./api.md#asynchronous-counter)?

The following example shows the number of [page
faults](https://en.wikipedia.org/wiki/Page_fault) of each thread since the
thread ever started:
faults](https://en.wikipedia.org/wiki/Page_fault) of each process since
it started:

* During the time range (T<sub>0</sub>, T<sub>1</sub>]:
* pid = `1001`, tid = `1`, #PF = `50`
* pid = `1001`, tid = `2`, #PF = `30`
* pid = `1001`, #PF = `50`
* pid = `1002`, #PF = `30`
* During the time range (T<sub>1</sub>, T<sub>2</sub>]:
* pid = `1001`, tid = `1`, #PF = `53`
* pid = `1001`, tid = `2`, #PF = `38`
* pid = `1001`, #PF = `53`
* pid = `1002`, #PF = `38`
* During the time range (T<sub>2</sub>, T<sub>3</sub>]
* pid = `1001`, tid = `1`, #PF = `56`
* pid = `1001`, tid = `2`, #PF = `42`
* pid = `1001`, #PF = `56`
* pid = `1002`, #PF = `42`
* During the time range (T<sub>3</sub>, T<sub>4</sub>]:
* pid = `1001`, tid = `1`, #PF = `60`
* pid = `1001`, tid = `2`, #PF = `47`
* pid = `1001`, #PF = `60`
* pid = `1002`, #PF = `47`
* During the time range (T<sub>4</sub>, T<sub>5</sub>]:
* thread 1 died, thread 3 started
* pid = `1001`, tid = `2`, #PF = `53`
* pid = `1001`, tid = `3`, #PF = `5`
* process 1001 died, process 1003 started
* pid = `1002`, #PF = `53`
* pid = `1003`, #PF = `5`
* During the time range (T<sub>5</sub>, T<sub>6</sub>]:
* A new process 1001 started
* pid = `1001`, #PF = `10`
* pid = `1002`, #PF = `57`
* pid = `1003`, #PF = `8`

Note that in the following examples, Cumulative aggregation
temporality is discussed before Delta aggregation temporality because
Expand All @@ -461,47 +466,56 @@ API with specified Cumulative aggregation temporality.
If we export the metrics using **Cumulative Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* attributes: {pid = `1001`, tid = `1`}, sum: `50`
* attributes: {pid = `1001`, tid = `2`}, sum: `30`
* attributes: {pid = `1001`}, sum: `50`
* attributes: {pid = `1002`}, sum: `30`
* (T<sub>0</sub>, T<sub>2</sub>]
* attributes: {pid = `1001`, tid = `1`}, sum: `53`
* attributes: {pid = `1001`, tid = `2`}, sum: `38`
* attributes: {pid = `1001`}, sum: `53`
* attributes: {pid = `1002`}, sum: `38`
* (T<sub>0</sub>, T<sub>3</sub>]
* attributes: {pid = `1001`, tid = `1`}, sum: `56`
* attributes: {pid = `1001`, tid = `2`}, sum: `42`
* attributes: {pid = `1001`}, sum: `56`
* attributes: {pid = `1002`}, sum: `42`
* (T<sub>0</sub>, T<sub>4</sub>]
* attributes: {pid = `1001`, tid = `1`}, sum: `60`
* attributes: {pid = `1001`, tid = `2`}, sum: `47`
* attributes: {pid = `1001`}, sum: `60`
* attributes: {pid = `1002`}, sum: `47`
* (T<sub>0</sub>, T<sub>5</sub>]
* attributes: {pid = `1001`, tid = `2`}, sum: `53`
* attributes: {pid = `1001`, tid = `3`}, sum: `5`
* attributes: {pid = `1002`}, sum: `53`
* (T<sub>4</sub>, T<sub>5</sub>]
reyang marked this conversation as resolved.
Show resolved Hide resolved
* attributes: {pid = `1003`}, sum: `5`
* (T<sub>5</sub>, T<sub>6</sub>]
* attributes: {pid = `1001`}, sum: `10`
* (T<sub>0</sub>, T<sub>6</sub>]
* attributes: {pid = `1002`}, sum: `57`
* (T<sub>4</sub>, T<sub>6</sub>]
* attributes: {pid = `1003`}, sum: `8`

The behavior in the first four periods is quite straightforward - we
just take the data being reported from the asynchronous instruments
and send them.

The data model prescribes several valid behaviors at T<sub>5</sub> in
this case, where one stream dies and another starts. The [Resets and
Gaps](./data-model.md#resets-and-gaps) section describes how start
timestamps and staleness markers can be used to increase the
The data model prescribes several valid behaviors at T<sub>5</sub> and
T<sub>6</sub> in this case, where one stream dies and another starts.
The [Resets and Gaps](./data-model.md#resets-and-gaps) section describes
how start timestamps and staleness markers can be used to increase the
receiver's understanding of these events.

Consider whether the SDK maintains individual timestamps for the
individual stream, or just one per process. In this example, where a
thread can die and start counting page faults from zero, the valid
behaviors at T<sub>5</sub> are:
process can die and restart, it starts counting page faults from zero.
In this case, the valid behaviors at T<sub>5</sub> and T<sub>6</sub>
are:

1. If all streams in the process share a start time, and the SDK is
not required to remember all past streams: the thread restarts with
zero sum. Receivers with reset detection are able to calculate a
correct rate (except for frequent restarts relative to the
collection interval), however the precise time of a reset will be
unknown.
2. If the SDK maintains per-stream start times, it signals to the
receiver precisely when a stream started, making the first
observation in a stream more useful for diagnostics. Receivers can
perform overlap detection or duplicate suppression and do not
require reset detection, in this case.
zero sum, and the start time of the process. Receivers with reset
detection are able to calculate a correct rate (except for frequent
restarts relative to the collection interval), however the precise
time of a reset will be unknown.
2. If the SDK maintains per-stream start times, it provides the previous
callback time as the start time, as this time is before the occurrence
of any events which are measured during the subsequent callback. This
makes the first observation in a stream more useful for diagnostics,
as downstream consumers can perform overlap detection or duplicate
suppression and do not require reset detection in this case.
3. Independent of above treatments, the SDK can add a staleness marker
to indicate the start of a gap in the stream when one thread dies
by remembering which streams have previously reported but are not
Expand All @@ -519,20 +533,23 @@ data model.
If we export the metrics using **Delta Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* attributes: {pid = `1001`, tid = `1`}, delta: `50`
* attributes: {pid = `1001`, tid = `2`}, delta: `30`
* attributes: {pid = `1002`}, delta: `30`
* (T<sub>1</sub>, T<sub>2</sub>]
* attributes: {pid = `1001`, tid = `1`}, delta: `3`
* attributes: {pid = `1001`, tid = `2`}, delta: `8`
* attributes: {pid = `1001`}, delta: `3`
* attributes: {pid = `1002`}, delta: `8`
* (T<sub>2</sub>, T<sub>3</sub>]
* attributes: {pid = `1001`, tid = `1`}, delta: `3`
* attributes: {pid = `1001`, tid = `2`}, delta: `4`
* attributes: {pid = `1001`}, delta: `3`
* attributes: {pid = `1002`}, delta: `4`
* (T<sub>3</sub>, T<sub>4</sub>]
* attributes: {pid = `1001`, tid = `1`}, delta: `4`
* attributes: {pid = `1001`, tid = `2`}, delta: `5`
* attributes: {pid = `1001`}, delta: `4`
* attributes: {pid = `1002`}, delta: `5`
* (T<sub>4</sub>, T<sub>5</sub>]
* attributes: {pid = `1001`, tid = `2`}, delta: `6`
* attributes: {pid = `1001`, tid = `3`}, delta: `5`
* attributes: {pid = `1002`}, delta: `6`
* attributes: {pid = `1003`}, delta: `5`
* (T<sub>5</sub>, T<sub>6</sub>]
* attributes: {pid = `1001`}, delta: `10`
* attributes: {pid = `1002`}, delta: `4`
* attributes: {pid = `1003`}, delta: `3`

You can see that we are performing Cumulative->Delta conversion, and it requires
us to remember the last value of **every single permutation we've encountered so
Expand Down Expand Up @@ -560,27 +577,28 @@ So here are some suggestions that we encourage SDK implementers to consider:
##### Asynchronous example: attribute removal in a view

Suppose the metrics in the asynchronous example above are exported
through a view configured to remove the `tid` attribute, leaving a
single-dimensional count of page faults by `pid`. For each metric
stream, two measurements are produced covering the same interval of
time, which the SDK is expected to aggregate before producing the
output.
through a view configured to remove the `pid` attribute, leaving a
count of page faults. For each metric stream, two measurements are produced
covering the same interval of time, which the SDK is expected to aggregate
before producing the output.

The data model specifies to use the "natural merge" function, in this
case meaning to add the current point values together because they
are `Sum` data points. The expected output is, still in **Cumulative
Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* dimensions: {pid = `1001`}, sum: `80`
* dimensions: {}, sum: `80`
* (T<sub>0</sub>, T<sub>2</sub>]
* dimensions: {pid = `1001`}, sum: `91`
* dimensions: {}, sum: `91`
* (T<sub>0</sub>, T<sub>3</sub>]
* dimensions: {pid = `1001`}, sum: `98`
* dimensions: {}, sum: `98`
* (T<sub>0</sub>, T<sub>4</sub>]
* dimensions: {pid = `1001`}, sum: `107`
* dimensions: {}, sum: `107`
* (T<sub>0</sub>, T<sub>5</sub>]
* dimensions: {pid = `1001`}, sum: `58`
* dimensions: {}, sum: `58`
* (T<sub>0</sub>, T<sub>6</sub>]
* dimensions: {}, sum: `75`

As discussed in the asynchronous cumulative temporality example above,
there are various treatments available for detecting resets. Even if
Expand Down