Skip to content

Conversation

tadjik1
Copy link
Contributor

@tadjik1 tadjik1 commented Sep 9, 2025

Description

Summary of Changes

The Change Streams specification classifies network-related failures (including those that occur during elections) as resumable. In practice, some network outages do not surface as raw socket errors; instead, they can be translated into a MongoServerSelectionError after the driver is unable to select a server within serverSelectionTimeoutMS.

Today, our ChangeStream treats MongoServerSelectionError as non-resumable and throws, which can prematurely terminate an otherwise resumable stream. This PR updates the behavior so that MongoServerSelectionError is handled as resumable: the stream uses its cached resume token to resume from the correct point once a suitable server becomes available again.

This aligns with the Node.js driver’s behavior of:

  • Caching the resume token for each event (the token is stored on the change event’s _id field and cached internally by the driver), and
  • Automatically attempting to reestablish connections in the face of transient network errors or elections (representing as MongoServerSelectionError), resuming from the most recent cached resume token so that no events are lost.
Notes for Reviewers

A reliable way to surface this bug is to make the current primary unreachable to the driver and then force an election:

  • Block heartbeats to the primary by failing ping and hello via failCommand.
  • Force an election with replSetStepDown.

The driver can no longer reach the old primary and, after serverSelectionTimeoutMS (default: 30s), throws MongoServerSelectionError. Prior to this change, that error would be treated as non-resumable.

To avoid interfering with other tests and to scope failpoints precisely, the test run generates a unique appName and uses it both in the driver and the failpoint configuration. This allows us to leave the failpoint on the original primary without extra cleanup. Maintaining a direct connection to the primary isn’t feasible in a general replicaset connection string scenario, so appName scoping is the cleaner approach here.

See NODE-6858 for more information.

Release Highlight

Change Streams now resume on MongoServerSelectionError

When the driver encounters a MongoServerSelectionError while processing a Change Stream (e.g., due to a transient network issue or during an election), it now treats the error as resumable and attempts to resume using the latest cached resume token.

This applies to both iterator and event-emitter usage:

// Iterator form
const changeStream = collection.watch([]);
for await (const change of changeStream) {
  // process change
}
// Event-emitter form
const changeStream = collection.watch([]);
changeStream.on('change', (change) => {
  // process change
});

There are no API changes. If you previously caught MongoServerSelectionError and implemented manual resume logic, you can now rely on the driver’s built-in resume mechanism, which uses the cached resume token from the change event’s _id to continue without losing events.

Huge thanks to @grossbart for bringing this bug to our attention, investigating it and for sharing code to reproduce it!

Double check the following

  • Lint is passing (npm run check:lint)
  • Self-review completed using the steps outlined here
  • PR title follows the correct format: type(NODE-xxxx)[!]: description
    • Example: feat(NODE-1234)!: rewriting everything in coffeescript
  • Changes are covered by tests
  • New TODOs have a related JIRA ticket

@tadjik1 tadjik1 force-pushed the NODE-6858 branch 2 times, most recently from 7539b5d to 69c8141 Compare September 12, 2025 14:11
@tadjik1 tadjik1 changed the title fix(NODE-6858): handle ServerSelectionError in ChangeStream fix(NODE-6858): Treat ServerSelectionError as a resumable error for Change Streams Sep 15, 2025
@tadjik1 tadjik1 changed the title fix(NODE-6858): Treat ServerSelectionError as a resumable error for Change Streams fix(NODE-6858): Treat MongoServerSelectionError as a resumable error for Change Streams Sep 15, 2025
@tadjik1 tadjik1 marked this pull request as ready for review September 15, 2025 14:07
@tadjik1 tadjik1 requested a review from a team as a code owner September 15, 2025 14:07
@tadjik1 tadjik1 changed the title fix(NODE-6858): Treat MongoServerSelectionError as a resumable error for Change Streams fix(NODE-6858): treat MongoServerSelectionError as a resumable error for Change Streams Sep 15, 2025
@durran durran self-assigned this Sep 16, 2025
@durran durran added the Primary Review In Review with primary reviewer, not yet ready for team's eyes label Sep 16, 2025
@durran durran merged commit c6d64e7 into main Sep 16, 2025
23 of 25 checks passed
@durran durran deleted the NODE-6858 branch September 16, 2025 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Primary Review In Review with primary reviewer, not yet ready for team's eyes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants