fix(NODE-6858): treat MongoServerSelectionError as a resumable error for Change Streams #4653
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Summary of Changes
The Change Streams specification classifies network-related failures (including those that occur during elections) as resumable. In practice, some network outages do not surface as raw socket errors; instead, they can be translated into a
MongoServerSelectionError
after the driver is unable to select a server withinserverSelectionTimeoutMS
.Today, our ChangeStream treats
MongoServerSelectionError
as non-resumable and throws, which can prematurely terminate an otherwise resumable stream. This PR updates the behavior so thatMongoServerSelectionError
is handled as resumable: the stream uses its cached resume token to resume from the correct point once a suitable server becomes available again.This aligns with the Node.js driver’s behavior of:
MongoServerSelectionError
), resuming from the most recent cached resume token so that no events are lost.Notes for Reviewers
A reliable way to surface this bug is to make the current primary unreachable to the driver and then force an election:
ping
andhello
viafailCommand
.replSetStepDown
.The driver can no longer reach the old primary and, after
serverSelectionTimeoutMS
(default: 30s), throwsMongoServerSelectionError
. Prior to this change, that error would be treated as non-resumable.To avoid interfering with other tests and to scope failpoints precisely, the test run generates a unique
appName
and uses it both in the driver and the failpoint configuration. This allows us to leave the failpoint on the original primary without extra cleanup. Maintaining a direct connection to the primary isn’t feasible in a general replicaset connection string scenario, soappName
scoping is the cleaner approach here.See NODE-6858 for more information.
Release Highlight
Change Streams now resume on
MongoServerSelectionError
When the driver encounters a
MongoServerSelectionError
while processing a Change Stream (e.g., due to a transient network issue or during an election), it now treats the error as resumable and attempts to resume using the latest cached resume token.This applies to both iterator and event-emitter usage:
There are no API changes. If you previously caught
MongoServerSelectionError
and implemented manual resume logic, you can now rely on the driver’s built-in resume mechanism, which uses the cached resume token from the change event’s_id
to continue without losing events.Huge thanks to @grossbart for bringing this bug to our attention, investigating it and for sharing code to reproduce it!
Double check the following
npm run check:lint
)type(NODE-xxxx)[!]: description
feat(NODE-1234)!: rewriting everything in coffeescript