Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(writer): add retry logic #2459

Merged
merged 7 commits into from
May 25, 2021
Merged

Conversation

P403n1x87
Copy link
Contributor

@P403n1x87 P403n1x87 commented May 20, 2021

Description

This change introduces a Fibonacci retry policy (with jitter) to the agent writer to mitigate networking issues (e.g. timeouts, broken pipes, ...), similar to what the profiler does already.

Memory usage might increase up to an extra payload buffer size, meaning that occasionally the tracer would require about 8MB more memory for processing. That is because the buffer will start filling up again while the writer retries flushing the previous payload.

Resolves #1413.

Checklist

  • Added to the correct milestone.
  • Tests provided or description of manual testing performed is included in the code or PR.
  • Library documentation is updated.
  • Corp site documentation is updated (link to the PR).

This change introduces a Fibonacci retry policy (with jitter) to the
agent writer to mitigate networking issues (e.g. timeouts, broken pipes,
...), similar to what the profiler does already.

Resolves DataDog#1413.
@P403n1x87 P403n1x87 requested a review from a team as a code owner May 20, 2021 10:45
@P403n1x87 P403n1x87 requested a review from jd May 20, 2021 10:46
@P403n1x87
Copy link
Contributor Author

P403n1x87 commented May 20, 2021

Impact

When the agent is available, the impact of the retry logic is negligible

Screenshot 2021-05-20 at 11 35 10

However, when the agent is not available, or unreachable over the majority of the connection attempts, there is a slight impact from attempting to connect multiple times in a row.

@P403n1x87 P403n1x87 added this to the 0.50.0 milestone May 20, 2021
self._metrics_dist("http.errors", tags=["type:err"])
self._metrics_dist("http.dropped.bytes", len(encoded))
self._metrics_dist("http.dropped.traces", len(enc_traces))
if raise_exc:
raise
e.reraise()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a reraise=True in tenacity if that fits you better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, I missed that from the docs! I think I'll keep it as it is so to avoid having the same set of exception classes in two different places.

@P403n1x87 P403n1x87 requested a review from a team May 20, 2021 17:22
@mergify mergify bot merged commit e0a0781 into DataDog:master May 25, 2021
@inverse
Copy link

inverse commented Jul 6, 2021

@Kyle-Verhoog what release did this land in? I couldn't see it in the release notes?

@majorgreys
Copy link
Contributor

@inverse It will be available in v0.50.0 for which we have pre-release versions. You can try them out from pypi: https://pypi.org/project/ddtrace/#history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to send traces to Datadog Agent - Time out
5 participants