Http Event Listener retry for codes not in [200, 300) and expections, rertry delay calculation fix, better logging#10566
Conversation
There was a problem hiding this comment.
Commit messge Add more logging the event listener -> Add more logging in the HTTP event listener
There was a problem hiding this comment.
feels to verbose. I think it should be debug
There was a problem hiding this comment.
I see you changed that in latter commit but it should be debug already in "Add more logging to the HTTP event listener"
There was a problem hiding this comment.
My bad, I just renamed the first commit and added this to the second one. I see you approved already, I can go back and make this change if you want.
There was a problem hiding this comment.
Please do. We want clean commit history when possible. I will merge when CI passes.
a8619f9 to
bc62232
Compare
|
Thanks for the review! I implemented the requested changes. I also found a bug in the next delay calculation and fixed that as well (+ changed PR title to include that). |
bc62232 to
6d3a951
Compare
| verify(result != null); | ||
|
|
||
| if (result.getStatusCode() >= 500 && attempt < config.getRetryCount()) { | ||
| if (!(result.getStatusCode() >= 200 && result.getStatusCode() < 300) && attempt < config.getRetryCount()) { |
There was a problem hiding this comment.
I'm not sure how would retrying any 4xx error ever succeed? Wouldn't we end up retrying until retry attempts are exhausted?
There was a problem hiding this comment.
Yes, we would retry until attempts are exhausted.
There was a problem hiding this comment.
Seems wasteful - specially since the event listener is synchronous and retrying a 4xx error seems guaranteed to fail unless for maybe HTTP 429.
Thanks for clarifying. I believe the intent is to use this mechanism to identify the scenarios we don't handle well currently and then fix them.
There was a problem hiding this comment.
I think it has its use-cases (which admittedly are edge-cases) and it's better to be safe than drop data. This doesn't run on the query execution threads so no direct slowdown.
I have been running into problems while using the event listener. Problems are usually one of:
Broken Pipe Exceptions, which are irregular and usually occur less than 2 times per hour. These aren't critical but cause dropped events. (Cause of these I suspect might be bad timing between trino and the receiving server in regards to timeouts) here is an exception;
Regular timeouts. There are cases where most of the plugin's attempts to send events will end with a timeout (from the http-client, not from the receiving server). The cause of these issues I haven't been able to track down yet. What I know is that it's probably not the fault of the ingest server because when the plugin keeps timing-out other requests to that server went through correctly (event from the same machine as trino). This usually happens when there are high loads on trino.
The changes this PR implements are simple and self-explanatory from the title.
These should fix problem 1. and help with tracking down problem 2.
Any ideas regarding problem 2 are greatly appreciated!