Skip to content

Http Event Listener retry for codes not in [200, 300) and expections, rertry delay calculation fix, better logging#10566

Merged
losipiuk merged 3 commits intotrinodb:masterfrom
mosiac1:369_events_try
Jan 14, 2022
Merged

Http Event Listener retry for codes not in [200, 300) and expections, rertry delay calculation fix, better logging#10566
losipiuk merged 3 commits intotrinodb:masterfrom
mosiac1:369_events_try

Conversation

@mosiac1
Copy link
Copy Markdown
Contributor

@mosiac1 mosiac1 commented Jan 12, 2022

I have been running into problems while using the event listener. Problems are usually one of:

  1. Broken Pipe Exceptions, which are irregular and usually occur less than 2 times per hour. These aren't critical but cause dropped events. (Cause of these I suspect might be bad timing between trino and the receiving server in regards to timeouts) here is an exception;

  2. Regular timeouts. There are cases where most of the plugin's attempts to send events will end with a timeout (from the http-client, not from the receiving server). The cause of these issues I haven't been able to track down yet. What I know is that it's probably not the fault of the ingest server because when the plugin keeps timing-out other requests to that server went through correctly (event from the same machine as trino). This usually happens when there are high loads on trino.

The changes this PR implements are simple and self-explanatory from the title.

These should fix problem 1. and help with tracking down problem 2.

Any ideas regarding problem 2 are greatly appreciated!

@cla-bot cla-bot bot added the cla-signed label Jan 12, 2022
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit messge Add more logging the event listener -> Add more logging in the HTTP event listener

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels to verbose. I think it should be debug

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you changed that in latter commit but it should be debug already in "Add more logging to the HTTP event listener"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I just renamed the first commit and added this to the second one. I see you approved already, I can go back and make this change if you want.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do. We want clean commit history when possible. I will merge when CI passes.

Copy link
Copy Markdown
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - minor comments

@mosiac1 mosiac1 changed the title Http Event Listener retry for codes not in [200, 300) and expections, better logging Http Event Listener retry for codes not in [200, 300) and expections, rertry delay calculation fix, better logging Jan 13, 2022
@mosiac1
Copy link
Copy Markdown
Contributor Author

mosiac1 commented Jan 13, 2022

Thanks for the review! I implemented the requested changes.

I also found a bug in the next delay calculation and fixed that as well (+ changed PR title to include that).

@losipiuk losipiuk merged commit 107b42f into trinodb:master Jan 14, 2022
@losipiuk losipiuk mentioned this pull request Jan 14, 2022
@github-actions github-actions bot added this to the 369 milestone Jan 14, 2022
verify(result != null);

if (result.getStatusCode() >= 500 && attempt < config.getRetryCount()) {
if (!(result.getStatusCode() >= 200 && result.getStatusCode() < 300) && attempt < config.getRetryCount()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how would retrying any 4xx error ever succeed? Wouldn't we end up retrying until retry attempts are exhausted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would retry until attempts are exhausted.

Copy link
Copy Markdown
Member

@hashhar hashhar Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems wasteful - specially since the event listener is synchronous and retrying a 4xx error seems guaranteed to fail unless for maybe HTTP 429.

Thanks for clarifying. I believe the intent is to use this mechanism to identify the scenarios we don't handle well currently and then fix them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has its use-cases (which admittedly are edge-cases) and it's better to be safe than drop data. This doesn't run on the query execution threads so no direct slowdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants