Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backoff-retry to uploads #56

Merged
merged 1 commit into from
May 13, 2021
Merged

Conversation

tjkirch
Copy link
Contributor

@tjkirch tjkirch commented May 12, 2021

Issue #, if available:

Helps work around #43.

Description of changes:

Add backoff-retry delays between block upload attempts

There's more potential for error when uploading a snapshot.  For example, we've
seen 4 "connection closed" errors in a row for a block upload.  The most
pernicious is the "snapshot does not exist" error, when there are timing issues
that cause new snapshots to not be immediately available for block upload,
which can take a couple minutes to pass.

The main issue in #43 is block upload failures due to "snapshot does not exist." The EBS team confirmed that the right approach for now is to wait and retry, and they recommended a 2 minute cap. This change adds increasing delays before retry attempts, and increases the retry total to just over 2 minutes, giving us enough buffer for the (occasional) new snapshot to become available for uploads. It should handle the vast majority of errors and cause a bit less stress.

Testing done:

A bunch of before/after testing is described in #43. In short, before, I was usually seeing failures (after 3 quick retries) within 50-100 uploads. After, I've run thousands of uploads successfully. In particular, I saw a "does not exist" case succeed after 7 retries (56 seconds).


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@tjkirch tjkirch requested review from bcressey and webern May 12, 2021 20:46
@tjkirch
Copy link
Contributor Author

tjkirch commented May 12, 2021

Made #57 to address the CI failure, which is just a new clippy warning.

Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧷

Nice approach!

Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job finding a solution.

There's more potential for error when uploading a snapshot.  For example, we've
seen 4 "connection closed" errors in a row for a block upload.  The most
pernicious is the "snapshot does not exist" error, when there are timing issues
that cause new snapshots to not be immediately available for block upload,
which can take a couple minutes to pass.
@tjkirch
Copy link
Contributor Author

tjkirch commented May 13, 2021

^ Rebase on develop to get the clippy fix that tripped up CI.

@tjkirch tjkirch merged commit 506f9e8 into awslabs:develop May 13, 2021
@tjkirch tjkirch deleted the backoff-retry branch May 13, 2021 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants