Clarify MetricReader collect result #2295

legendecas · 2022-01-27T07:05:35Z

What are you trying to achieve?

When implementing MetricReader.Collect, the spec is said:

Collect SHOULD provide a way to let the caller know whether it succeeded, failed or timed out.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#collect

In JavaScript implementation, the timeout is only meaningful on the async instrument callback context. Synchronous instruments collection can not be interrupted or fail (if the SDK is implemented correctly). That is said, async instruments may time out or fail, but synchronous instruments always can be collected without failure. So to JavaScript implementation, it can do its best to collect metric instruments and inform the MetricReader that some instruments collection failed, but here you can still get part of the result.

However, the spec has no words on if the implementations should achieve a greater availability on collection.

jsuereth · 2022-01-29T00:27:59Z

It seems like this may be a good use of something akin to "stale" metric points for these async instruments that fail. The issue with that is you really don't know what attributes any particular callback would have produced.

My $.02 here is that we should focus on uses of the return status of MetricReader to resolve this, and look at back-end usages.

Regarding failed async reads and backend concerns:

For Gauge, I think just not including the value should be ok for almost all backends.
Many cumulative metric backends have mehcanisms to deal with missing metrics (e.g. "stale" points in prometheus). Just not including timed out metrics is fine here.
For DELTA metric backends that might be creating DELTAs from cumulative counter callbacks, they may need to understand there's missing data. However, if start/stop timestamps are correct, we should be able to identify the broken metric stream (at least that's a primary goal of our data model).

Now, for in-process handling of this status code, what do we expect on partial collection failures?

Send nothing
Send what we have (with error count self-observability)
Send what we have with no explicit signals (implicitly you can detect missing points and broken timeseries query-time in your backend).
Do something special with dropped metric streams (e.g. report a stale point for all async callbacks that fail based on their previously known metric streams).

I think Option 3 is totally fine. Option 4 is too hard to get right given the API, but ideally we'd work our ways towards Option 2.

For JavaScript - What are various ways you've interacted with MetricReader.Collect return value?

legendecas · 2022-01-31T08:11:51Z

Send what we have (with error count self-observability)

I think Option 2 with mere count doesn't help in the condition too much. It didn't report which metric collection failed nor why it fails.

implicitly you can detect missing points and broken timeseries query-time in your backend.

+1, we can already find out that points are missing with current backends, I'd think diagnostic logging with the detailed errors can be more helpful.

For JavaScript - What are various ways you've interacted with MetricReader.Collect return value?

In JavaScript, as the operation collect is asynchronous, we can return with Promise<value> in the case. At the moment we are returning Promise<MetricData[]> where the MetricData is the collected metric points. We can also wrap the MetricData with a more informational structure, like:

interface MetricCollectionResult {
  failedCount: number;
  data: MetricData[];
}

It can reshape depending on what the spec is defined.

dyladan · 2022-03-02T21:30:58Z

Ping @open-telemetry/specs-metrics-approvers

legendecas added the spec:metrics Related to the specification/metrics directory label Jan 27, 2022

github-actions bot assigned yurishkuro Jan 27, 2022

legendecas mentioned this issue Jan 27, 2022

feat(sdk-metrics-base): async instruments callback timeout open-telemetry/opentelemetry-js#2742

Merged

6 tasks

legendecas mentioned this issue Mar 2, 2022

fix(sdk-metrics-base): combine concurrent observable callback invocations open-telemetry/opentelemetry-js#2785

Closed

6 tasks

legendecas mentioned this issue Apr 18, 2022

Clarify MetricReader.Collect result #2495

Merged

reyang closed this as completed in #2495 Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify MetricReader collect result #2295

Clarify MetricReader collect result #2295

legendecas commented Jan 27, 2022

jsuereth commented Jan 29, 2022

legendecas commented Jan 31, 2022 •

edited

Loading

dyladan commented Mar 2, 2022

Clarify MetricReader collect result #2295

Clarify MetricReader collect result #2295

Comments

legendecas commented Jan 27, 2022

jsuereth commented Jan 29, 2022

legendecas commented Jan 31, 2022 • edited Loading

dyladan commented Mar 2, 2022

legendecas commented Jan 31, 2022 •

edited

Loading