Skip to content

Conversation

@kevinAlbs
Copy link
Collaborator

@kevinAlbs kevinAlbs commented Jan 7, 2026

Summary

Break the change stream resume loop after two socket timeouts.

Background & Motivation

DRIVERS-1404 describes the problem this PR intends to address:

If the getMore needs 5 minutes to do this and the socket timeout on the MongoClient is 1 min, I think we'll enter a repeated resume loop because the aggregate will succeed and the getMore will always time out.

This scenario has known impact (HELP-83560). SERVER-48526 may help address this in the server. This PR proposes a driver solution: break after two timeouts in the resume loop. This is similarly suggested in DRIVERS-1309:

short-circuiting the retry loop after two socket timeouts (not necessarily consecutive) seems like a good compromise.

To propagate a timeout from mongoc_stream_t, a bool is set on mongoc_server_stream_t. mongoc_stream_t is destroyed after a network error in _handle_network_error, so it cannot later be checked with mongoc_stream_timed_out. This is similarly propagated in mongoc_cursor_t, which only retrieves a mongoc_server_stream for the duration of a command.

Rejected alternative: bound resume loop by socketTimeoutMS

Another idea considered was to bound the resume loop by the duration socketTimeoutMS. Bounding by socketTimeoutMS may better align with timeoutMS:

If a resume is required for a next call on a change stream, the timeout MUST apply to the entirety of the initial getMore and all commands sent as part of the resume attempt.

However, I think it further removes socketTimeoutMS from the documented behavior "to send or receive on a socket". This idea was rejected.


ctx.is_command = is_command;
matches = match_bson_with_ctx(doc, pattern, &ctx);
bson_t empty = BSON_INITIALIZER;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by fix to support NULL for doc and matches the comment above this function:

A NULL doc or NULL json_pattern means "{}".

@kevinAlbs kevinAlbs force-pushed the 2timeouts.D1404.changestream branch from 9fd63a6 to df213b8 Compare January 8, 2026 13:32
@kevinAlbs kevinAlbs marked this pull request as ready for review January 8, 2026 13:32
@kevinAlbs kevinAlbs requested a review from a team as a code owner January 8, 2026 13:33
Copy link
Contributor

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions; otherwise, LGTM.


bson_error_t error;
bson_t error_doc; /* always initialized, and set with server errors. */
bson_t error_doc; /* always initialized, and set with server errors. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bson_t error_doc; /* always initialized, and set with server errors. */
bson_t error_doc; // always initialized, and set with server errors.

Local comment style consistency.

Comment on lines 525 to 538
if (err_doc) {
resumable = _is_resumable_error(stream, err_doc);
if (stream->cursor->had_stream_timeout) {
iteration_timeout_count++;
}
} else {
resumable = false;
}

if (iteration_timeout_count >= 2) {
// CDRIVER-6182: Do not resume if two iteration timeouts occur. Intended to avoid a possible resume loop
// when `aggregate` succeeds but `getMore` consistently times out.
resumable = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (err_doc) {
resumable = _is_resumable_error(stream, err_doc);
if (stream->cursor->had_stream_timeout) {
iteration_timeout_count++;
}
} else {
resumable = false;
}
if (iteration_timeout_count >= 2) {
// CDRIVER-6182: Do not resume if two iteration timeouts occur. Intended to avoid a possible resume loop
// when `aggregate` succeeds but `getMore` consistently times out.
resumable = false;
}
if (err_doc) {
if (stream->cursor->had_stream_timeout) {
iteration_timeout_count++;
}
// CDRIVER-6182: Do not resume if two iteration timeouts occur. Intended to avoid a possible resume loop
// when `aggregate` succeeds but `getMore` consistently times out.
if (iteration_timeout_count >= 2) {
resumable = false;
} else {
resumable = _is_resumable_error(stream, err_doc);
}
} else {
resumable = false;
}

Suggest tweaking the layout of blocks as suggested to consistently ensure resumable is set once the if (err_doc) block is evaluated by moving each resumable = ... to the end of every possible branch.

Copy link
Contributor

@eramongodb eramongodb Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more suggestion: perhaps the CDRIVER-6182 branch should include a MONGOC_WARNING to notify users when this scenario is triggered?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest tweaking the layout

This led me to realize the if(err_doc) check was redundant. This is only entered if mongoc_cursor_error_document previously returned true. mongoc_cursor_error_document guarantees:

If the function returns true and reply is not NULL, then reply is set to a pointer to a BSON document, which is either empty or the server’s error response.

Changed if (err_doc) to BSON_ASSERT (err_doc) and also applied the suggested tweaking to use if/else so resumable is only set once.

One more suggestion

I like that idea. Added a warning.


ENTRY;

cursor->had_stream_timeout = false; // Reset before running next command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest moving further below just before the if (parts.assembled.session) block to group together with cursor-state-modifying operations, e.g. cursor->client_session = ...;, cursor->explicit_session = ...;, etc.

Copy link
Collaborator Author

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes applies skip_if_high_server_runtime_variance from #2195

Comment on lines 525 to 538
if (err_doc) {
resumable = _is_resumable_error(stream, err_doc);
if (stream->cursor->had_stream_timeout) {
iteration_timeout_count++;
}
} else {
resumable = false;
}

if (iteration_timeout_count >= 2) {
// CDRIVER-6182: Do not resume if two iteration timeouts occur. Intended to avoid a possible resume loop
// when `aggregate` succeeds but `getMore` consistently times out.
resumable = false;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest tweaking the layout

This led me to realize the if(err_doc) check was redundant. This is only entered if mongoc_cursor_error_document previously returned true. mongoc_cursor_error_document guarantees:

If the function returns true and reply is not NULL, then reply is set to a pointer to a BSON document, which is either empty or the server’s error response.

Changed if (err_doc) to BSON_ASSERT (err_doc) and also applied the suggested tweaking to use if/else so resumable is only set once.

One more suggestion

I like that idea. Added a warning.

@kevinAlbs kevinAlbs merged commit 230f8bd into mongodb:master Jan 13, 2026
46 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants