Skip to content
Draft
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions source/transactions-convenient-api/tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,18 @@ If possible, drivers should implement these tests without requiring the test run
the retry timeout. This might be done by internally modifying the timeout value used by `withTransaction` with some
private API or using a mock timer.

### Retry Backoff is Enforced

Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3
Copy link
Author

@sleepyStick sleepyStick Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 here was a bit of an arbitrary number. The most important part of this value is that its greater than 1
3 just felt like a small enough to be a quick test but big enough to conclude backoff is consistently happening.
If folks have more opinions on this number, I'm not attached to 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the _transaction_retry_backoffs concept is viable for testing here since:

  • many languages don't have a way to implement it without making the attribute public
  • It's an unbounded list that can grow forever.
  • there's no prior art for something like this in the driver as far as I know.

Instead I'd suggest my test from earlier where we fail the transaction X times and assert the run time is greater than some threshold T. X should be large enough to reduce false positives where the test fails due to jitter resulting in a small delay for every retry.

We can calculate T by recording the command failed+succeeded events, summing their duration, and adding a fixed constant.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, it took me a bit but i figured out what X and T are for python -- I don't know if they'll be the same for the other languages tho? Should the test description leave it as X and T? or should I put in the numbers that I have and see what others have to say about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative I'm considering in Node is to configure the random number generator to be deterministic for testing purposes. Because this would make the tests both simpler to reason about and deterministic. ex: make random() always return 1, then we can make assertions on the timing of retries deterministically

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh! I like that idea! Would you want that to replace the current proposed test idea, or is this in addition to the previously defined test?
I'm imaging this as a second test where we fail the transactions a handful of times and calculate the backoff times assuming random() always returns 1 and ensure that the transaction takes longer than the sum of the backoff times.

The main reason to keep the first test (where we don't configure random) is to just ensure that jitter is applied? If so, then should that first test be modified to assert that the transaction succeeds in < T where T is the sum of the maximum backoff times?

Then together we're effectively checking a minimum and maximum threshold on the backoff? Did that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends exactly what we think we need to test here, I think. If the goal is only to test that retry backoff is enforced, I think your current implementation works.

Although I'd be worried about flakiness, because anything that relies on random values is inherently non-determinstic and just due to the distribution of the delays it seems like flakes might be likely:

  • assuming jitter = 1, the max delay for the test is 3064ms
  • the last 4 retries account for almost 1750ms of delay in the test (330.8ms, 413.5ms, 500ms, 500ms)

So, short delays in the last few retries could make make the test run a lot shorter than expected. A determinsitic jitter would solve this problem. That, and the test is a bit simpler to reason about imo (I spent a bit of time trying to calculate the probability that this test fails assuming random() returns an equal distribution of values in [0,1], which goes away if we random() is deterministic).

I like the idea of keeping both tests if we do want to make sure jitter is applied, but a determinsitic alternative would be to make random() return a non-1 value (like .5) and assert that the test takes between [total sleep time with jitter = .5, total sleep time with jitter = 1 (or some other value)].

I also know this is feasible in Node, I'm not sure how feasible configuring the random() function would be in other drivers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think theoretically the test described by shane works but I agree that flakiness is likely an issue. I tried to account for that with the "optionally change initial_backoff to a higher value" but I don't know how feasible that is across various drivers (hence the optional)

As you've stated, the effect of jitter on the later attempt have a higher impact than the earlier ones, but I generally hope that over time the avg of random should be ~0.5 which sanity checked my initial 1.25 second timeout (discovered through guess and check for better or worse) with random jitter -- but clearly we've been observing that it's not consistent enough across languages

Noting my journey in trying to get values that were consistent in python and seeing that they don't carry over to Node seems to imply that if this test were to actually be implemented, each driver may have to find their own values of X and T? which feels silly imo?

All of that is to say, I think at this point i'm convinced a deterministic approach is the way to go for tests. Just to be clear, you are suggesting that the two test be:

  • random always returns 1 and assert that the test takes more than total sleep time (with jitter = 1)
  • random always returns non-1 value, a and assert that the test takes between [total sleep time with jitter = b, total sleep time with jitter = c (or some other value)].
    does a = b or can it be such that b <= a? obviously c > a, correct?
    how do we want to decide on a, b, and c? There is a part of me that worries if the time difference between total sleep time with c vs total sleep time with a isn't big enough, the rest of the driver operations could total test time > total sleep time with jitter = c? idk i think i'm rambling now.

retries. Ensure that:

- 3 backoffs occurred
- each backoff was greater than or equal to 0
- the total operation time took more than the sum of the individual backoffs

## Changelog

- 2025-10-17: Added Backoff test.
- 2024-09-06: Migrated from reStructuredText to Markdown.
- 2024-02-08: Converted legacy tests to unified format.
- 2021-04-29: Remove text about write concern timeouts from prose test.
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,8 @@ has not been exceeded, the driver MUST retry a transaction that fails with an er
"TransientTransactionError" label. Since retrying the entire transaction will entail invoking the callback again,
drivers MUST document that the callback may be invoked multiple times (i.e. one additional time per retry attempt) and
MUST document the risk of side effects from using a non-idempotent callback. If the retry timeout has been exceeded,
drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller.
drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. When retrying,
drivers MUST implement an exponential backoff with jitter following the algorithm described below.

If an error bearing neither the UnknownTransactionCommitResult nor the TransientTransactionError label is encountered at
any point, the driver MUST NOT retry and MUST allow `withTransaction` to propagate the error to its caller.
Expand Down Expand Up @@ -128,11 +129,21 @@ This method should perform the following sequence of actions:
6. If the callback reported an error:
1. If the ClientSession is in the "starting transaction" or "transaction in progress" state, invoke
[abortTransaction](../transactions/transactions.md#aborttransaction) on the session.

2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is
less than 120 seconds, jump back to step two.
less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where:

1. jitter is a random float between \[0, 1)
2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed
3. BACKOFF_INITIAL is 1ms
4. BACKOFF_MAX is 500ms

Append this sleep duration to a list for testing purposes. Then, jump back to step two.

3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually
committed a transaction, propagate the callback's error to the caller of `withTransaction` and return
immediately.

4. Otherwise, propagate the callback's error to the caller of `withTransaction` and return immediately.
7. If the ClientSession is in the "no transaction", "transaction aborted", or "transaction committed" state, assume the
callback intentionally aborted or committed the transaction and return immediately.
Expand All @@ -154,11 +165,21 @@ This method should perform the following sequence of actions:
This method can be expressed by the following pseudo-code:

```typescript
var BACKOFF_INITIAL = 1 // 1ms initial backoff
var BACKOFF_MAX = 500 // 500ms max backoff
withTransaction(callback, options) {
// Note: drivers SHOULD use a monotonic clock to determine elapsed time
var startTime = Date.now(); // milliseconds since Unix epoch
var retry = 0;
this._transaction_retry_backoffs = []; // for testing purposes

retryTransaction: while (true) {
if (retry > 0):
var backoff = Math.random() * min(BACKOFF_INITIAL * (1.25**retry),
BACKOFF_MAX)
this._transaction_retry_backoffs.push(backoff)
sleep(backoff)
retry += 1
this.startTransaction(options); // may throw on error

try {
Expand Down Expand Up @@ -324,8 +345,8 @@ exceed the user's original intention for `maxTimeMS`.
The callback may be executed any number of times. Drivers are free to encourage their users to design idempotent
callbacks.

A previous design had no limits for retrying commits or entire transactions. The callback is always able indicate that
`withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly;
A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate
that `withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly;
however, that puts the onus on avoiding very long (or infinite) retry loops on the application. We expect the most
common cause of retry loops will be due to TransientTransactionErrors caused by write conflicts, as those can occur
regularly in a healthy application, as opposed to UnknownTransactionCommitResult, which would typically be caused by an
Expand All @@ -338,6 +359,16 @@ non-configurable default and is intentionally twice the value of MongoDB 4.0's d
parameter (60 seconds). Applications that desire longer retry periods may call `withTransaction` additional times as
needed. Applications that desire shorter retry periods should not use this method.

### Backoff Benefits

Previously, the driver would retry transactions immediately, which is fine for low levels of contention. But, as the
server load increases, immediate retries can result in retry storms, unnecessarily further overloading the server.

Exponential backoff is well-researched and accepted backoff strategy that is simple to implement. A low initial backoff
(1-millisecond) and growth value (1.25x) were chosen specifically to mitigate latency in low levels of contention.
Empirical evidence suggests that 500-millisecond max backoff ensured that a transaction did not wait so long as to
exceed the 120-second timeout and reduced load spikes.

## Backwards Compatibility

The specification introduces a new method on the ClientSession class and does not introduce any backward breaking
Expand All @@ -357,6 +388,8 @@ provides an implementation of a technique already described in the MongoDB 4.0 d

## Changelog

- 2025-10-17: withTransaction applies exponential backoff when retrying.

- 2024-09-06: Migrated from reStructuredText to Markdown.

- 2023-11-22: Document error handling inside the callback.
Expand Down
Loading