Skip to content

[consumererror] Add OTLP-centric error type#13042

Merged
bogdandrutu merged 25 commits into
open-telemetry:mainfrom
evan-bradley:consumererror-otlp-type
Jul 9, 2025
Merged

[consumererror] Add OTLP-centric error type#13042
bogdandrutu merged 25 commits into
open-telemetry:mainfrom
evan-bradley:consumererror-otlp-type

Conversation

@evan-bradley
Copy link
Copy Markdown
Contributor

Description

Continuation of #11085.

Link to tracking issue

Fixes #7047

@evan-bradley evan-bradley requested a review from a team as a code owner May 15, 2025 21:32
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2025

Codecov Report

Attention: Patch coverage is 92.85714% with 8 lines in your changes missing coverage. Please review.

Project coverage is 91.57%. Comparing base (1046576) to head (20af93c).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...sumererror/internal/statusconversion/conversion.go 80.00% 8 Missing ⚠️

❌ Your patch status has failed because the patch coverage (92.85%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13042      +/-   ##
==========================================
+ Coverage   91.55%   91.57%   +0.01%     
==========================================
  Files         526      528       +2     
  Lines       29365    29474     +109     
==========================================
+ Hits        26886    26991     +105     
- Misses       1953     1958       +5     
+ Partials      526      525       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@evan-bradley
Copy link
Copy Markdown
Contributor Author

I'll look at improving the code coverage tomorrow. In the meantime, this should be in a pretty good state.

@evan-bradley
Copy link
Copy Markdown
Contributor Author

The remaining functions missing test coverage are the status code conversion functions, which are pretty direct. I don't think tests are very helpful since the functions are pretty direct mappings. The only thing I can think of that would meaningfully improve coverage is to store the mappings in a map object as opposed to in a switch statement, but feels like a slightly worse implementation.

Copy link
Copy Markdown
Member

@TylerHelmuth TylerHelmuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so excited to see this revived

Comment thread consumer/consumererror/error.go Outdated
Copy link
Copy Markdown
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seemed to be consensus on the last iteration on this implementation, I think what we need now is to test this in real life, thus I am approving this so we can move forward

@mx-psi
Copy link
Copy Markdown
Member

mx-psi commented May 21, 2025

Since this was specially controversial last time, I suggest we wait either until we have more approvals (I suggest 4) or some time has passed (I would suggest Friday next week).

cc @open-telemetry/collector-approvers

Copy link
Copy Markdown
Member

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, liked the idea

Copy link
Copy Markdown
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good to see this moving forward. This looks the way I would expect it to look after reviewing earlier feedback from @bogdandrutu.

Comment thread consumer/consumererror/error.go Outdated
Comment thread consumer/consumererror/error.go Outdated
Comment thread consumer/consumererror/error.go
Copy link
Copy Markdown
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a big problem we identified in the past, which is that the default behavior of the errors in the collector pipelines is that they are retryable. It seems that this PR changes that, which I 1000% support, but we need to make sure we document this change and analyze the impact of that.

Comment thread consumer/consumererror/error.go Outdated
Comment thread consumer/consumererror/error.go
Comment thread consumer/consumererror/error.go
@evan-bradley
Copy link
Copy Markdown
Contributor Author

evan-bradley commented Jun 23, 2025

@bogdandrutu @jmacd Thanks for your questions, they've helped me challenge some assumptions in the implementation.

Before I make code changes (though I would be happy to do so if you want me to illustrate any points), I want to propose how we handle the cases you've asked about. I want to try to answer your questions in a single comment instead of individually.

  1. In the case where origErr does not contain any nested information conflicting with the status code, use the caller-provided status code. I think this is straightforward and agreed upon.
  2. In the case where origErr does contain conflicting information (e.g. a nested status.Status) assume the caller has the most relevant context and use the provided HTTP/gRPC status code.
  3. When origErr is nil, we assume the caller does not have or doesn't want to include any underlying error. In this case, we return the most we can from the Error method and nil from Unwrap.
  4. When the user passes an HTTP or gRPC success code, we assume this is a qualified success and don't take any action. Our OTLP exporters are the only places where we explicitly handle the case that @jmacd pointed out, so I don't think we would save much complexity by avoiding this. gRPC also handles this slightly differently and assumes an "OK" status is actually the result of erroneous code, and remaps the status in the resulting *status.Status to be codes.Unknown. Additionally, the OTLP specification explicitly calls out that partial success responses have a success response code, so I think there's some validity to not explicitly overriding this.

Let me know if you have an issue with points 1-3, though I think they should be fairly uncontroversial. I think the situation with the most nuance is point 4, to which I can think of a few approaches to take:

  1. Do nothing and assume this is a qualified success. I don't want to discuss exactly what that will look like in this PR, and until we determine a way to concretely represent that, I think we should advise that these errors aren't created with success response codes.
  2. Map success codes to nil error values to override the fact that they aren't errors. If we make a separate error type for representing partial success responses or other qualified successes, I think this will likely be okay.
  3. Map success codes to codes.Unknown like gRPC does. I think I like this option the least of the three.

My proposal would be approach 1 until we determine how to proceed with partial success responses. We can decide whether to switch to approach 2 at that time.

@jmacd
Copy link
Copy Markdown
Contributor

jmacd commented Jun 30, 2025

Do nothing and assume this is a qualified success.

Sounds good and conservative. I'd say it gives room to improve in the future, though nothing's easy.

@bogdandrutu
Copy link
Copy Markdown
Member

In the case where origErr does contain conflicting information (e.g. a nested status.Status) assume the caller has the most relevant context and use the provided HTTP/gRPC status code.

if the origErr is grpc.Status and we call with a new HTTP status, are you also planning to remove the grpc.Status from the error chain? Otherwise the new generated error may be in the same time both and usages of "error.Is" may break or get confused.

When the user passes an HTTP or gRPC success code, we assume this is a qualified success and don't take any action.

To make sure I understand, does this mean you return nil from NewOTLPHTTPError(nil, 200)?

@evan-bradley
Copy link
Copy Markdown
Contributor Author

if the origErr is grpc.Status and we call with a new HTTP status, are you also planning to remove the grpc.Status from the error chain? Otherwise the new generated error may be in the same time both and usages of "error.Is" may break or get confused.

Just to verify, did you mean errors.As? I think errors.Is will always return false since we aren't working with instantiated errors.

As for using errors.As, if I understand you correctly, even if the error is both, I expect callers will check for consumererror.Error first and status.Status second, in which case there shouldn't be any breakage or confusion.

To make sure I understand, does this mean you return nil from NewOTLPHTTPError(nil, 200)?

This will still return an error. We need to develop this further, but the goal for now is that it is up to the caller to determine whether they want to return an error or not, and we don't make any assumptions about their intent. Maybe "qualified success" isn't the right term for now since nowhere else in the Collector understands how to handle them.

@bogdandrutu
Copy link
Copy Markdown
Member

As for using errors.As, if I understand you correctly, even if the error is both, I expect callers will check for consumererror.Error first and status.Status second, in which case there shouldn't be any breakage or confusion.

This is not what I see today in code, see OTLP receiver.

@bogdandrutu
Copy link
Copy Markdown
Member

This will still return an error. We need to develop this further, but the goal for now is that it is up to the caller to determine whether they want to return an error or not, and we don't make any assumptions about their intent. Maybe "qualified success" isn't the right term for now since nowhere else in the Collector understands how to handle them.

This may break existing code which will return an error and caller may retry and cause lots of duplicate data, possible infinite retries, etc.

@evan-bradley
Copy link
Copy Markdown
Contributor Author

This is not what I see today in code, see OTLP receiver.

The goal is that we will supplement or replace that code using this error type; a major motivating factor for this new error type is that translating the gRPC status code into an HTTP status code is currently a lossy operation.

This may break existing code which will return an error and caller may retry and cause lots of duplicate data, possible infinite retries, etc.

I think that will only occur if exporters make significant changes to the way they handle errors. For example, in the OTLP/HTTP exporter, this line will go from:

return formattedErr

to

return consumererror.NewOTLPHTTPError(formattedErr, resp.StatusCode)

In this case, we've already validated that resp.StatusCode is not 200. I still don't expect anyone to go out of their way to use a 200 error code here, but I don't want to be prescriptive until we've decided on how we want to handle qualified successes.

@bogdandrutu
Copy link
Copy Markdown
Member

I still don't expect anyone to go out of their way to use a 200 error code here, but I don't want to be prescriptive until we've decided on how we want to handle qualified successes.

Then let's make it impossible to call with 200 (success) and forbid that, so we can in the future change it the way we want that behavior.

@evan-bradley
Copy link
Copy Markdown
Contributor Author

evan-bradley commented Jul 2, 2025

Then let's make it impossible to call with 200 (success) and forbid that, so we can in the future change it the way we want that behavior.

I like the idea of forbidding it so we can make future changes not breaking, but in a "new error" function that feels challenging. I see three options here:

  1. Make the functions panic, like we did here. I don't like the idea of panicking, especially at runtime, but this guarantees us that callers will check for and avoid calls using success codes.
  2. Return nil. This has the upside of being a "no-op" in case callers blindly use a 200 code, but also silently masks what I would consider a bug, and will be a breaking change if we decide to return an error in the future.
  3. Have the New[...]Error functions return (error, error). This is how we usually handle these situations, but in this case that feels odd.

A side note, for the implementation, I will do this for codes.Ok for gRPC (the behavior of the OTLP exporter) and [200, 299] for HTTP, since that's the behavior of the OTLP/HTTP exporter.

@sfc-gh-bdrutu
Copy link
Copy Markdown

I am ok to panic in this and clearly document it.

@evan-bradley evan-bradley requested a review from bogdandrutu July 7, 2025 14:01
@@ -20,7 +20,7 @@ type Traces struct {
func NewTraces(err error, data ptrace.Traces) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate concern: Not sure these are used anymore.

@bogdandrutu bogdandrutu enabled auto-merge July 9, 2025 12:27
@bogdandrutu bogdandrutu added this pull request to the merge queue Jul 9, 2025
Merged via the queue into open-telemetry:main with commit e8ccfc3 Jul 9, 2025
53 of 56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate how to expose exporterhelper.NewThrottleRetry in the consumererror

7 participants