Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify MetricReader collect result #2295

Closed
legendecas opened this issue Jan 27, 2022 · 3 comments · Fixed by #2495
Closed

Clarify MetricReader collect result #2295

legendecas opened this issue Jan 27, 2022 · 3 comments · Fixed by #2495
Assignees
Labels
spec:metrics Related to the specification/metrics directory

Comments

@legendecas
Copy link
Member

What are you trying to achieve?

When implementing MetricReader.Collect, the spec is said:

Collect SHOULD provide a way to let the caller know whether it succeeded, failed or timed out.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#collect

In JavaScript implementation, the timeout is only meaningful on the async instrument callback context. Synchronous instruments collection can not be interrupted or fail (if the SDK is implemented correctly). That is said, async instruments may time out or fail, but synchronous instruments always can be collected without failure. So to JavaScript implementation, it can do its best to collect metric instruments and inform the MetricReader that some instruments collection failed, but here you can still get part of the result.

However, the spec has no words on if the implementations should achieve a greater availability on collection.

@jsuereth
Copy link
Contributor

It seems like this may be a good use of something akin to "stale" metric points for these async instruments that fail. The issue with that is you really don't know what attributes any particular callback would have produced.

My $.02 here is that we should focus on uses of the return status of MetricReader to resolve this, and look at back-end usages.

Regarding failed async reads and backend concerns:

  1. For Gauge, I think just not including the value should be ok for almost all backends.
  2. Many cumulative metric backends have mehcanisms to deal with missing metrics (e.g. "stale" points in prometheus). Just not including timed out metrics is fine here.
  3. For DELTA metric backends that might be creating DELTAs from cumulative counter callbacks, they may need to understand there's missing data. However, if start/stop timestamps are correct, we should be able to identify the broken metric stream (at least that's a primary goal of our data model).

Now, for in-process handling of this status code, what do we expect on partial collection failures?

  1. Send nothing
  2. Send what we have (with error count self-observability)
  3. Send what we have with no explicit signals (implicitly you can detect missing points and broken timeseries query-time in your backend).
  4. Do something special with dropped metric streams (e.g. report a stale point for all async callbacks that fail based on their previously known metric streams).

I think Option 3 is totally fine. Option 4 is too hard to get right given the API, but ideally we'd work our ways towards Option 2.

For JavaScript - What are various ways you've interacted with MetricReader.Collect return value?

@legendecas
Copy link
Member Author

legendecas commented Jan 31, 2022

  1. Send what we have (with error count self-observability)

I think Option 2 with mere count doesn't help in the condition too much. It didn't report which metric collection failed nor why it fails.

implicitly you can detect missing points and broken timeseries query-time in your backend.

+1, we can already find out that points are missing with current backends, I'd think diagnostic logging with the detailed errors can be more helpful.

For JavaScript - What are various ways you've interacted with MetricReader.Collect return value?

In JavaScript, as the operation collect is asynchronous, we can return with Promise<value> in the case. At the moment we are returning Promise<MetricData[]> where the MetricData is the collected metric points. We can also wrap the MetricData with a more informational structure, like:

interface MetricCollectionResult {
  failedCount: number;
  data: MetricData[];
}

It can reshape depending on what the spec is defined.

@dyladan
Copy link
Member

dyladan commented Mar 2, 2022

Ping @open-telemetry/specs-metrics-approvers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants