Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add ADR dir and error handling ADR #2664

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

scottgerring
Copy link
Contributor

@scottgerring scottgerring commented Feb 14, 2025

Fixes #2571

Capturing decision record for error handling in repo with new docs.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@scottgerring scottgerring marked this pull request as ready for review February 14, 2025 09:59
@scottgerring scottgerring requested a review from a team as a code owner February 14, 2025 09:59
Copy link

codecov bot commented Feb 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.4%. Comparing base (92303b6) to head (fc86870).

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #2664   +/-   ##
=====================================
  Coverage   79.4%   79.4%           
=====================================
  Files        123     123           
  Lines      22770   22770           
=====================================
  Hits       18092   18092           
  Misses      4678    4678           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@scottgerring
Copy link
Contributor Author

@cijothomas re-wrote a bunch to reflect discussion and current state!


### When to box custom errors

Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this phrasing!

@scottgerring
Copy link
Contributor Author

@cijothomas is this good to merge? It'd be great to have a concrete example in place so we can start to follow the pattern - for instance for the tracing interop Björn is working on

@@ -0,0 +1,5 @@
# Architectural Decision Records

This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. You can read more about ADRs [here](https://adr.github.io/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid links to the adr.github.io/similar. We can simply document the designs here, without necessarily adhering to any particular version of it.


Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller.

If the caller may potentially recover from an error, we will follow [canonical's rust best practices](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to remove the link to canonical guides. It is not clear whether we'll always follow it or not.


## Considered Options

**Option 1: Continue as is**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still suggest to only list the decision here to avoid long reads. At the end of the doc, we can mention considered-alternatives, and move this there.


## Accepted Option

**Option 3**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to make it super easy for future reader to know what is the design guidance here, without having to scan through rest of doc. i.e Let's put the followed design right here, and a why that was chosen just below it.

Everything else can be moved to bottom of the doc. https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/log/DESIGN.md#rejected-alternatives has something like this.

Our preference for error types is thus:

1. Consolidated error that covers all methods of a particular "trait type" (e.g., signal export) and method
1. Devolves into error type per method of a particular trait type (e.g., `SdkShutdownResult`, `SdkExportResult`) _if the error types need to diverge_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final outcome is - there is no separate Export vs Shutdown result.

pub trait LogExporter {
fn export(...) -> OtelSdkResult;
fn shutdown(...) -> OtelSdkResult;
fn force_flush(...) -> OTelSdkResult;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - indentation

@scottgerring
Copy link
Contributor Author

As discussed @cijothomas i've made this super prescriptive guidance now, and linked out to the ADR-format discussion for more detail.


### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not

Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller.
Note above that we do not box any `Error` into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller.


Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller.

If the caller may potentially recover from an error, we will follow the generally-accepted best practice (e.g., see [canonical's guide](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit addition: Mention that there is no place in the repo today that needed preserving error.

### 2. Consolidate error types within a trait where we can, let them diverge when we can't**

We aim to consolidate error types where possible _without indicating a function may return more errors than it can actually return_.
Here's an example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the content, but code example is not intuitive. May I suggest a pattern like
"don't do this"
show bad code
"instead, do this"
show correct one.

I think such a pattern makes it easier to follow.

If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle -
we would let that error traits diverge at that point.

### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also mention we use thiserror to manage Error types. As indicated in its readme, it won't appear in public API, and we may remove it in the future (if there is a need), without breaking public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also should we consider declear error message itself can change at anytime to leave us some room for improvement in the future?

```

If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle -
we would let that error traits diverge at that point.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets be clear that we cannot change the enum post 1.0 (unless when we bump to 2.0), as it'll be a breaking change; adding a a new variant/ removing one, changing type are all breaking change.

Copy link
Contributor

@TommyCpp TommyCpp Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it'll be a breaking change;

if the enum has #[non_exhaustive] adding a variant won't be breaking changes. User will still need to modify the code to handle it but at least it will still compile

For error types I think it's worth adding #[non_exhaustive] attributes to force consumer to think about "future errors" when they implementing it and make it possible for us to introduce new error variant

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great start to following this practice throughout.

Left some small comments. Would love to get one more eyes to review!

### 1. No panics from SDK APIs
Failures during regular operation should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate.
Some of the opentelemetry SDK interfaces are dictated by the specification in way such that they may not return errors.

Copy link
Member

@lalitb lalitb Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that some of the dependencies can panic - which may be the right approach when it happens during SDK configuration time. Or do we expect the SDK APIs to catch all such panic and return as error or log them? Either is fine to me, but good to document that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling startup failures is not covered here. It should be discussed and added additionally.
I don't think we should panic. Returning Result should be the behavior.

Do we have situation where a dependency throws panic, and we cannot prevent it?

Anyway, we can discuss this separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on nt panic in opentelemetry components. We cannot control panic in our dependencies but let's make a note that prinpical is no panic from openetlemety unless with very good reason

Copy link
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nit comments. Thanks for adding this ADR, definitely bring clarity in our error handling design.


## Summary

This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue.
This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue.

### 1. No panics from SDK APIs
Failures during regular operation should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate.
Some of the opentelemetry SDK interfaces are dictated by the specification in way such that they may not return errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on nt panic in opentelemetry components. We cannot control panic in our dependencies but let's make a note that prinpical is no panic from openetlemety unless with very good reason

```

If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle -
we would let that error traits diverge at that point.
Copy link
Contributor

@TommyCpp TommyCpp Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it'll be a breaking change;

if the enum has #[non_exhaustive] adding a variant won't be breaking changes. User will still need to modify the code to handle it but at least it will still compile

For error types I think it's worth adding #[non_exhaustive] attributes to force consumer to think about "future errors" when they implementing it and make it possible for us to introduce new error variant

If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle -
we would let that error traits diverge at that point.

### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also should we consider declear error message itself can change at anytime to leave us some room for improvement in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error handling ADR
4 participants