-
Notifications
You must be signed in to change notification settings - Fork 246
DRIVERS-1934: withTransaction API retries too frequently #1851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
sleepyStick
wants to merge
8
commits into
master
Choose a base branch
from
DRIVERS-1934
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 4 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
25759c4
add exponential backoff to withTransactions API
sleepyStick bdcd2ef
run pre-commit
sleepyStick 48890a2
add design rational for backoff
sleepyStick 71ba1ba
fix prose test
sleepyStick b602606
fix pseudocode
sleepyStick 057fbbf
fix test
sleepyStick 42e4d94
account for CSOT / timeoutMS in algorithm
sleepyStick c11aef8
add more details to tests
sleepyStick File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 here was a bit of an arbitrary number. The most important part of this value is that its greater than 1
3 just felt like a small enough to be a quick test but big enough to conclude backoff is consistently happening.
If folks have more opinions on this number, I'm not attached to 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the
_transaction_retry_backoffsconcept is viable for testing here since:Instead I'd suggest my test from earlier where we fail the transaction
Xtimes and assert the run time is greater than some thresholdT.Xshould be large enough to reduce false positives where the test fails due to jitter resulting in a small delay for every retry.We can calculate
Tby recording the command failed+succeeded events, summing their duration, and adding a fixed constant.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, it took me a bit but i figured out what X and T are for python -- I don't know if they'll be the same for the other languages tho? Should the test description leave it as X and T? or should I put in the numbers that I have and see what others have to say about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative I'm considering in Node is to configure the random number generator to be deterministic for testing purposes. Because this would make the tests both simpler to reason about and deterministic. ex: make random() always return 1, then we can make assertions on the timing of retries deterministically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh! I like that idea! Would you want that to replace the current proposed test idea, or is this in addition to the previously defined test?
I'm imaging this as a second test where we fail the transactions a handful of times and calculate the backoff times assuming random() always returns 1 and ensure that the transaction takes longer than the sum of the backoff times.
The main reason to keep the first test (where we don't configure random) is to just ensure that jitter is applied? If so, then should that first test be modified to assert that the transaction succeeds in < T where T is the sum of the maximum backoff times?
Then together we're effectively checking a minimum and maximum threshold on the backoff? Did that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends exactly what we think we need to test here, I think. If the goal is only to test that retry backoff is enforced, I think your current implementation works.
Although I'd be worried about flakiness, because anything that relies on random values is inherently non-determinstic and just due to the distribution of the delays it seems like flakes might be likely:
So, short delays in the last few retries could make make the test run a lot shorter than expected. A determinsitic jitter would solve this problem. That, and the test is a bit simpler to reason about imo (I spent a bit of time trying to calculate the probability that this test fails assuming random() returns an equal distribution of values in [0,1], which goes away if we random() is deterministic).
I like the idea of keeping both tests if we do want to make sure jitter is applied, but a determinsitic alternative would be to make random() return a non-1 value (like .5) and assert that the test takes between [total sleep time with jitter = .5, total sleep time with jitter = 1 (or some other value)].
I also know this is feasible in Node, I'm not sure how feasible configuring the random() function would be in other drivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think theoretically the test described by shane works but I agree that flakiness is likely an issue. I tried to account for that with the "optionally change initial_backoff to a higher value" but I don't know how feasible that is across various drivers (hence the optional)
As you've stated, the effect of jitter on the later attempt have a higher impact than the earlier ones, but I generally hope that over time the avg of random should be ~0.5 which sanity checked my initial 1.25 second timeout (discovered through guess and check for better or worse) with random jitter -- but clearly we've been observing that it's not consistent enough across languages
Noting my journey in trying to get values that were consistent in python and seeing that they don't carry over to Node seems to imply that if this test were to actually be implemented, each driver may have to find their own values of
XandT? which feels silly imo?All of that is to say, I think at this point i'm convinced a deterministic approach is the way to go for tests. Just to be clear, you are suggesting that the two test be:
aand assert that the test takes between [total sleep time with jitter =b, total sleep time with jitter =c(or some other value)].does
a=bor can it be such thatb<=a? obviouslyc>a, correct?how do we want to decide on
a,b, andc? There is a part of me that worries if the time difference between total sleep time withcvs total sleep time withaisn't big enough, the rest of the driver operations could total test time > total sleep time with jitter =c? idk i think i'm rambling now.