[consumererror] Add OTLP-centric error type#13042
Conversation
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (92.85%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #13042 +/- ##
==========================================
+ Coverage 91.55% 91.57% +0.01%
==========================================
Files 526 528 +2
Lines 29365 29474 +109
==========================================
+ Hits 26886 26991 +105
- Misses 1953 1958 +5
+ Partials 526 525 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I'll look at improving the code coverage tomorrow. In the meantime, this should be in a pretty good state. |
|
The remaining functions missing test coverage are the status code conversion functions, which are pretty direct. I don't think tests are very helpful since the functions are pretty direct mappings. The only thing I can think of that would meaningfully improve coverage is to store the mappings in a map object as opposed to in a switch statement, but feels like a slightly worse implementation. |
TylerHelmuth
left a comment
There was a problem hiding this comment.
so excited to see this revived
mx-psi
left a comment
There was a problem hiding this comment.
There seemed to be consensus on the last iteration on this implementation, I think what we need now is to test this in real life, thus I am approving this so we can move forward
|
Since this was specially controversial last time, I suggest we wait either until we have more approvals (I suggest 4) or some time has passed (I would suggest Friday next week). cc @open-telemetry/collector-approvers |
jmacd
left a comment
There was a problem hiding this comment.
Really good to see this moving forward. This looks the way I would expect it to look after reviewing earlier feedback from @bogdandrutu.
bogdandrutu
left a comment
There was a problem hiding this comment.
There is a big problem we identified in the past, which is that the default behavior of the errors in the collector pipelines is that they are retryable. It seems that this PR changes that, which I 1000% support, but we need to make sure we document this change and analyze the impact of that.
|
@bogdandrutu @jmacd Thanks for your questions, they've helped me challenge some assumptions in the implementation. Before I make code changes (though I would be happy to do so if you want me to illustrate any points), I want to propose how we handle the cases you've asked about. I want to try to answer your questions in a single comment instead of individually.
Let me know if you have an issue with points 1-3, though I think they should be fairly uncontroversial. I think the situation with the most nuance is point 4, to which I can think of a few approaches to take:
My proposal would be approach 1 until we determine how to proceed with partial success responses. We can decide whether to switch to approach 2 at that time. |
Sounds good and conservative. I'd say it gives room to improve in the future, though nothing's easy. |
if the origErr is grpc.Status and we call with a new HTTP status, are you also planning to remove the grpc.Status from the error chain? Otherwise the new generated error may be in the same time both and usages of "error.Is" may break or get confused.
To make sure I understand, does this mean you return |
Just to verify, did you mean As for using
This will still return an error. We need to develop this further, but the goal for now is that it is up to the caller to determine whether they want to return an error or not, and we don't make any assumptions about their intent. Maybe "qualified success" isn't the right term for now since nowhere else in the Collector understands how to handle them. |
This is not what I see today in code, see OTLP receiver. |
This may break existing code which will return an error and caller may retry and cause lots of duplicate data, possible infinite retries, etc. |
The goal is that we will supplement or replace that code using this error type; a major motivating factor for this new error type is that translating the gRPC status code into an HTTP status code is currently a lossy operation.
I think that will only occur if exporters make significant changes to the way they handle errors. For example, in the OTLP/HTTP exporter, this line will go from: to In this case, we've already validated that |
Then let's make it impossible to call with 200 (success) and forbid that, so we can in the future change it the way we want that behavior. |
I like the idea of forbidding it so we can make future changes not breaking, but in a "new error" function that feels challenging. I see three options here:
A side note, for the implementation, I will do this for |
|
I am ok to panic in this and clearly document it. |
| @@ -20,7 +20,7 @@ type Traces struct { | |||
| func NewTraces(err error, data ptrace.Traces) error { | |||
There was a problem hiding this comment.
Separate concern: Not sure these are used anymore.
e8ccfc3
Description
Continuation of #11085.
Link to tracking issue
Fixes #7047