-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Add ADR dir and error handling ADR #2664
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2664 +/- ##
=====================================
Coverage 79.4% 79.4%
=====================================
Files 123 123
Lines 22770 22770
=====================================
Hits 18092 18092
Misses 4678 4678 ☔ View full report in Codecov by Sentry. |
@cijothomas re-wrote a bunch to reflect discussion and current state! |
docs/adr/001_error_handling.md
Outdated
|
||
### When to box custom errors | ||
|
||
Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this phrasing!
@cijothomas is this good to merge? It'd be great to have a concrete example in place so we can start to follow the pattern - for instance for the tracing interop Björn is working on |
docs/adr/README.md
Outdated
@@ -0,0 +1,5 @@ | |||
# Architectural Decision Records | |||
|
|||
This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. You can read more about ADRs [here](https://adr.github.io/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid links to the adr.github.io/similar. We can simply document the designs here, without necessarily adhering to any particular version of it.
docs/adr/001_error_handling.md
Outdated
|
||
Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller. | ||
|
||
If the caller may potentially recover from an error, we will follow [canonical's rust best practices](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to remove the link to canonical guides. It is not clear whether we'll always follow it or not.
docs/adr/001_error_handling.md
Outdated
|
||
## Considered Options | ||
|
||
**Option 1: Continue as is** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still suggest to only list the decision here to avoid long reads. At the end of the doc, we can mention considered-alternatives, and move this there.
docs/adr/001_error_handling.md
Outdated
|
||
## Accepted Option | ||
|
||
**Option 3** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to make it super easy for future reader to know what is the design guidance here, without having to scan through rest of doc. i.e Let's put the followed design right here, and a why that was chosen just below it.
Everything else can be moved to bottom of the doc. https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/log/DESIGN.md#rejected-alternatives has something like this.
docs/adr/001_error_handling.md
Outdated
Our preference for error types is thus: | ||
|
||
1. Consolidated error that covers all methods of a particular "trait type" (e.g., signal export) and method | ||
1. Devolves into error type per method of a particular trait type (e.g., `SdkShutdownResult`, `SdkExportResult`) _if the error types need to diverge_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final outcome is - there is no separate Export vs Shutdown result.
docs/adr/001_error_handling.md
Outdated
pub trait LogExporter { | ||
fn export(...) -> OtelSdkResult; | ||
fn shutdown(...) -> OtelSdkResult; | ||
fn force_flush(...) -> OTelSdkResult; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - indentation
As discussed @cijothomas i've made this super prescriptive guidance now, and linked out to the ADR-format discussion for more detail. |
|
||
### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not | ||
|
||
Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller. | |
Note above that we do not box any `Error` into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller. |
|
||
Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller. | ||
|
||
If the caller may potentially recover from an error, we will follow the generally-accepted best practice (e.g., see [canonical's guide](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit addition: Mention that there is no place in the repo today that needed preserving error.
### 2. Consolidate error types within a trait where we can, let them diverge when we can't** | ||
|
||
We aim to consolidate error types where possible _without indicating a function may return more errors than it can actually return_. | ||
Here's an example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the content, but code example is not intuitive. May I suggest a pattern like
"don't do this"
show bad code
"instead, do this"
show correct one.
I think such a pattern makes it easier to follow.
If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle - | ||
we would let that error traits diverge at that point. | ||
|
||
### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets also mention we use thiserror
to manage Error types. As indicated in its readme, it won't appear in public API, and we may remove it in the future (if there is a need), without breaking public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, also should we consider declear error message itself can change at anytime to leave us some room for improvement in the future?
``` | ||
|
||
If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle - | ||
we would let that error traits diverge at that point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets be clear that we cannot change the enum post 1.0 (unless when we bump to 2.0), as it'll be a breaking change; adding a a new variant/ removing one, changing type are all breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it'll be a breaking change;
if the enum has #[non_exhaustive] adding a variant won't be breaking changes. User will still need to modify the code to handle it but at least it will still compile
For error types I think it's worth adding #[non_exhaustive]
attributes to force consumer to think about "future errors" when they implementing it and make it possible for us to introduce new error variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great start to following this practice throughout.
Left some small comments. Would love to get one more eyes to review!
### 1. No panics from SDK APIs | ||
Failures during regular operation should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate. | ||
Some of the opentelemetry SDK interfaces are dictated by the specification in way such that they may not return errors. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible that some of the dependencies can panic - which may be the right approach when it happens during SDK configuration time. Or do we expect the SDK APIs to catch all such panic and return as error or log them? Either is fine to me, but good to document that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling startup failures is not covered here. It should be discussed and added additionally.
I don't think we should panic. Returning Result should be the behavior.
Do we have situation where a dependency throws panic, and we cannot prevent it?
Anyway, we can discuss this separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on nt panic in opentelemetry components. We cannot control panic in our dependencies but let's make a note that prinpical is no panic from openetlemety unless with very good reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nit comments. Thanks for adding this ADR, definitely bring clarity in our error handling design.
|
||
## Summary | ||
|
||
This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue. | |
This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue. |
### 1. No panics from SDK APIs | ||
Failures during regular operation should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate. | ||
Some of the opentelemetry SDK interfaces are dictated by the specification in way such that they may not return errors. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on nt panic in opentelemetry components. We cannot control panic in our dependencies but let's make a note that prinpical is no panic from openetlemety unless with very good reason
``` | ||
|
||
If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle - | ||
we would let that error traits diverge at that point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it'll be a breaking change;
if the enum has #[non_exhaustive] adding a variant won't be breaking changes. User will still need to modify the code to handle it but at least it will still compile
For error types I think it's worth adding #[non_exhaustive]
attributes to force consumer to think about "future errors" when they implementing it and make it possible for us to introduce new error variant
If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle - | ||
we would let that error traits diverge at that point. | ||
|
||
### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, also should we consider declear error message itself can change at anytime to leave us some room for improvement in the future?
Fixes #2571
Capturing decision record for error handling in repo with new docs.
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes