docs: Add ADR dir and error handling ADR #2664

scottgerring · 2025-02-14T09:59:12Z

Fixes #2571

Capturing decision record for error handling in repo with new docs.

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

codecov · 2025-02-14T10:02:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.4%. Comparing base (92303b6) to head (fc86870).

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #2664   +/-   ##
=====================================
  Coverage   79.4%   79.4%           
=====================================
  Files        123     123           
  Lines      22770   22770           
=====================================
  Hits       18092   18092           
  Misses      4678    4678

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/adr/001_error_handling.md

scottgerring · 2025-02-17T12:54:54Z

@cijothomas re-wrote a bunch to reflect discussion and current state!

cijothomas · 2025-02-18T03:14:30Z

docs/adr/001_error_handling.md

+
+### When to box custom errors
+
+Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller.


Love this phrasing!

scottgerring · 2025-02-24T07:08:49Z

@cijothomas is this good to merge? It'd be great to have a concrete example in place so we can start to follow the pattern - for instance for the tracing interop Björn is working on

cijothomas · 2025-02-24T14:48:49Z

docs/adr/README.md

@@ -0,0 +1,5 @@
+# Architectural Decision Records
+
+This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. You can read more about ADRs [here](https://adr.github.io/).


I'd avoid links to the adr.github.io/similar. We can simply document the designs here, without necessarily adhering to any particular version of it.

cijothomas · 2025-02-24T14:49:35Z

docs/adr/001_error_handling.md

+
+Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller.
+
+If the caller may potentially recover from an error, we will follow [canonical's rust best practices](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error.


I'd suggest to remove the link to canonical guides. It is not clear whether we'll always follow it or not.

docs/adr/001_error_handling.md

cijothomas · 2025-02-24T14:51:07Z

docs/adr/001_error_handling.md

+
+## Considered Options
+
+**Option 1: Continue as is**


I still suggest to only list the decision here to avoid long reads. At the end of the doc, we can mention considered-alternatives, and move this there.

cijothomas · 2025-02-24T14:52:44Z

docs/adr/001_error_handling.md

+
+## Accepted Option
+
+**Option 3** 


I suggest to make it super easy for future reader to know what is the design guidance here, without having to scan through rest of doc. i.e Let's put the followed design right here, and a why that was chosen just below it.

Everything else can be moved to bottom of the doc. https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/log/DESIGN.md#rejected-alternatives has something like this.

cijothomas · 2025-02-24T14:55:38Z

docs/adr/001_error_handling.md

+Our preference for error types is thus:
+
+1. Consolidated error that covers all methods of a particular "trait type" (e.g., signal export) and method
+1. Devolves into error type per method of a particular trait type (e.g., `SdkShutdownResult`, `SdkExportResult`) _if the error types need to diverge_


The final outcome is - there is no separate Export vs Shutdown result.

lalitb · 2025-02-24T17:21:37Z

docs/adr/001_error_handling.md

+pub trait LogExporter {
+	fn export(...) -> OtelSdkResult;
+	fn shutdown(...) -> OtelSdkResult; 
+  fn force_flush(...) -> OTelSdkResult;


nit - indentation

docs/adr/001_error_handling.md

scottgerring · 2025-02-27T15:03:26Z

As discussed @cijothomas i've made this super prescriptive guidance now, and linked out to the ADR-format discussion for more detail.

docs/adr/001_error_handling.md

cijothomas · 2025-02-28T15:07:40Z

docs/adr/001_error_handling.md

+
+### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not
+
+Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller. 


Suggested change

Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller.

Note above that we do not box any `Error` into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller.

cijothomas · 2025-02-28T15:09:10Z

docs/adr/001_error_handling.md

+
+Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller. 
+
+If the caller may potentially recover from an error, we will follow the generally-accepted best practice (e.g., see [canonical's guide](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error:


nit addition: Mention that there is no place in the repo today that needed preserving error.

cijothomas · 2025-02-28T15:12:47Z

docs/adr/001_error_handling.md

+### 2. Consolidate error types within a trait where we can, let them diverge when we can't**
+
+We aim to consolidate error types where possible _without indicating a function may return more errors than it can actually return_. 
+Here's an example:


Agree with the content, but code example is not intuitive. May I suggest a pattern like
"don't do this"
show bad code
"instead, do this"
show correct one.

I think such a pattern makes it easier to follow.

cijothomas · 2025-02-28T15:14:28Z

docs/adr/001_error_handling.md

+If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle - 
+we would let that error traits diverge at that point. 
+
+### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not


lets also mention we use thiserror to manage Error types. As indicated in its readme, it won't appear in public API, and we may remove it in the future (if there is a need), without breaking public API.

+1, also should we consider declear error message itself can change at anytime to leave us some room for improvement in the future?

cijothomas · 2025-02-28T15:16:57Z

docs/adr/001_error_handling.md

+```
+
+If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle - 
+we would let that error traits diverge at that point. 


lets be clear that we cannot change the enum post 1.0 (unless when we bump to 2.0), as it'll be a breaking change; adding a a new variant/ removing one, changing type are all breaking change.

as it'll be a breaking change;

if the enum has #[non_exhaustive] adding a variant won't be breaking changes. User will still need to modify the code to handle it but at least it will still compile

For error types I think it's worth adding #[non_exhaustive] attributes to force consumer to think about "future errors" when they implementing it and make it possible for us to introduce new error variant

cijothomas

LGTM. Great start to following this practice throughout.

Left some small comments. Would love to get one more eyes to review!

lalitb · 2025-02-28T19:07:45Z

docs/adr/001_error_handling.md

+### 1. No panics from SDK APIs
+Failures during regular operation should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate.
+Some of the opentelemetry SDK interfaces are dictated by the specification in way such that they may not return errors. 
+


It is possible that some of the dependencies can panic - which may be the right approach when it happens during SDK configuration time. Or do we expect the SDK APIs to catch all such panic and return as error or log them? Either is fine to me, but good to document that.

Handling startup failures is not covered here. It should be discussed and added additionally.
I don't think we should panic. Returning Result should be the behavior.

Do we have situation where a dependency throws panic, and we cannot prevent it?

Anyway, we can discuss this separately.

+1 on nt panic in opentelemetry components. We cannot control panic in our dependencies but let's make a note that prinpical is no panic from openetlemety unless with very good reason

lalitb

LGTM with nit comments. Thanks for adding this ADR, definitely bring clarity in our error handling design.

TommyCpp · 2025-02-28T19:25:00Z

docs/adr/001_error_handling.md

+
+## Summary
+
+This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue. 


Suggested change

This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue.

This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue.

TommyCpp · 2025-02-28T19:26:43Z

docs/adr/001_error_handling.md

+### 1. No panics from SDK APIs
+Failures during regular operation should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate.
+Some of the opentelemetry SDK interfaces are dictated by the specification in way such that they may not return errors. 
+


+1 on nt panic in opentelemetry components. We cannot control panic in our dependencies but let's make a note that prinpical is no panic from openetlemety unless with very good reason

TommyCpp · 2025-02-28T19:31:56Z

docs/adr/001_error_handling.md

+```
+
+If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle - 
+we would let that error traits diverge at that point. 


as it'll be a breaking change;

if the enum has #[non_exhaustive] adding a variant won't be breaking changes. User will still need to modify the code to handle it but at least it will still compile

For error types I think it's worth adding #[non_exhaustive] attributes to force consumer to think about "future errors" when they implementing it and make it possible for us to introduce new error variant

TommyCpp · 2025-02-28T19:33:35Z

docs/adr/001_error_handling.md

+If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle - 
+we would let that error traits diverge at that point. 
+
+### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not


+1, also should we consider declear error message itself can change at anytime to leave us some room for improvement in the future?

scottgerring marked this pull request as ready for review February 14, 2025 09:59

scottgerring requested a review from a team as a code owner February 14, 2025 09:59

cijothomas reviewed Feb 14, 2025

View reviewed changes

docs/adr/001_error_handling.md Outdated Show resolved Hide resolved

cijothomas reviewed Feb 14, 2025

View reviewed changes

docs/adr/001_error_handling.md Outdated Show resolved Hide resolved

cijothomas reviewed Feb 14, 2025

View reviewed changes

docs/adr/001_error_handling.md Outdated Show resolved Hide resolved

scottgerring requested a review from cijothomas February 17, 2025 12:54

cijothomas reviewed Feb 18, 2025

View reviewed changes

cijothomas reviewed Feb 24, 2025

View reviewed changes

docs/adr/001_error_handling.md Show resolved Hide resolved

cijothomas reviewed Feb 24, 2025

View reviewed changes

lalitb reviewed Feb 24, 2025

View reviewed changes

scottgerring added 5 commits February 27, 2025 15:27

docs: Add ADR dir and error handling ADR

a6d6fad

chore: Add more info about ADRs

7faa9a6

Some more detail

95090ea

fix links

f2a9bf8

Update ADR format to be more prescriptive

f152a0a

scottgerring force-pushed the main branch from 651cc41 to f152a0a Compare February 27, 2025 15:02

cijothomas reviewed Feb 27, 2025

View reviewed changes

docs/adr/001_error_handling.md Outdated Show resolved Hide resolved

changed startup wording

9c46625

scottgerring requested a review from cijothomas February 28, 2025 07:17

cijothomas reviewed Feb 28, 2025

View reviewed changes

cijothomas approved these changes Feb 28, 2025

View reviewed changes

Merge branch 'main' into main

8135270

lalitb reviewed Feb 28, 2025

View reviewed changes

lalitb approved these changes Feb 28, 2025

View reviewed changes

TommyCpp reviewed Feb 28, 2025

View reviewed changes

Merge branch 'main' into main

fc86870

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add ADR dir and error handling ADR #2664

docs: Add ADR dir and error handling ADR #2664

scottgerring commented Feb 14, 2025 •

edited

Loading

codecov bot commented Feb 14, 2025 •

edited

Loading

scottgerring commented Feb 17, 2025

cijothomas Feb 18, 2025

scottgerring commented Feb 24, 2025

cijothomas Feb 24, 2025

cijothomas Feb 24, 2025

cijothomas Feb 24, 2025

cijothomas Feb 24, 2025

cijothomas Feb 24, 2025

lalitb Feb 24, 2025

scottgerring commented Feb 27, 2025

cijothomas Feb 28, 2025

cijothomas Feb 28, 2025

cijothomas Feb 28, 2025

cijothomas Feb 28, 2025

TommyCpp Feb 28, 2025

cijothomas Feb 28, 2025

TommyCpp Feb 28, 2025 •

edited

Loading

cijothomas left a comment

lalitb Feb 28, 2025 •

edited

Loading

cijothomas Feb 28, 2025

TommyCpp Feb 28, 2025

lalitb left a comment

TommyCpp Feb 28, 2025

TommyCpp Feb 28, 2025

TommyCpp Feb 28, 2025 •

edited

Loading

TommyCpp Feb 28, 2025


		### When to box custom errors

		Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller.

		@@ -0,0 +1,5 @@
		# Architectural Decision Records

		This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. You can read more about ADRs [here](https://adr.github.io/).


		Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller.

		If the caller may potentially recover from an error, we will follow [canonical's rust best practices](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error.


		### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not

		Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller.


		## Summary

		This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue.

docs: Add ADR dir and error handling ADR #2664

Are you sure you want to change the base?

docs: Add ADR dir and error handling ADR #2664

Conversation

scottgerring commented Feb 14, 2025 • edited Loading

Merge requirement checklist

codecov bot commented Feb 14, 2025 • edited Loading

Codecov Report

scottgerring commented Feb 17, 2025

Choose a reason for hiding this comment

scottgerring commented Feb 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottgerring commented Feb 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TommyCpp Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

cijothomas left a comment

Choose a reason for hiding this comment

lalitb Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lalitb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TommyCpp Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottgerring commented Feb 14, 2025 •

edited

Loading

codecov bot commented Feb 14, 2025 •

edited

Loading

TommyCpp Feb 28, 2025 •

edited

Loading

lalitb Feb 28, 2025 •

edited

Loading

TommyCpp Feb 28, 2025 •

edited

Loading