NODE-2274: unified failover server loss #2197

mbroadst · 2019-10-30T21:52:45Z

Description

The fundamental issue here is that when using the unified topology, if a node is "lost" (due to some network error, etc) it is never regained. This was due to some complicated error handling, primarily around the events propagated from the connection pool. I'll try to summarize in the next section what changed (or needed to be changed)

What changed?

State machines for the Pool, Server and Topology types were formalized, now using the same method to generate the machines. The Pool gained a new state DRAINING which is used to distinguish whether new commands can be written to the pool or not, in particular during a scenario when fire-and-forget messages are sent right before pool destruction
The Topology no longer listens for error or close events from the Pool. All errors are propagated through callbacks now, which allows us to be very specific about how we handle types of errors. This was the root cause of the issue addressed by this PR, all errors used to be funneled through a common error handler which would reset the server in SDAM flow, even during monitoring checks
Topology no longer uses a pool in forced reconnect mode. This was the second part of the root cause, failed attempts to do something like monitoring would cause retries which would trigger the SDAM flow. We now depend on the retry loop of retryable operations for retries explicitly.
The logic for "resetting a server" was factored, and made explicit during a server description update handling process.
As a part of this change NODE-2214 was resolved, where Unknown servers were erroneously removed from the topology description

Are there any files to ignore?
A number of test changes were required that were out of scope for this change, but caused test failures nonetheless. Specifically, ignore changes to the transactions tests, and an added tool called run_each_test.sh which is just a helper for identifying individual test files which leak handles in node

daprahamian · 2019-10-31T02:05:30Z

test/runner/filters/unified_filter.js

 */
 class UnifiedTopologyFilter {
  filter(test) {
    if (!test.metadata) return true;


nit: replace all of this with:

filter(test) { const unifiedTopology = test.metadata && test.metadata.requires && test.metadata.requires.unifiedTopology; return typeof unifiedTopology !== 'boolean' || unifiedTopology === process.env.MONGODB_UNIFIED_TOPOLOGY; }

lib/core/connection/pool.js

daprahamian · 2019-10-31T02:14:06Z

lib/core/connection/pool.js


 function createConnection(pool, callback) {
-  if (pool.state === DESTROYED || pool.state === DESTROYING) {
+  if (pool.state === DESTROYED) {


Do we want connections to be creatable when pool is DESTROYING

lib/core/sdam/server.js

lib/core/sdam/topology.js

NODE-2214

The `close` event was erroneously emitted every time a child server closed, which was not consistent with the legacy topology's behavior. Instead, the event is now emitted when the topology itself is closed. NODE-2251

For legacy reasons the unified topology forced the connection pool into auto reconnect mode by default. This caused failed server checks to continue to emit errors on the server, causing the server to lose track of its monitoring state, and never returning the node to the pool of selectable servers. This results client-side as an error about server selection timing out. NODE-2274

We have some edge cases in our testing where `endSessions` is sent during `destroy`, but the pool might not have enough open connections in that case.

If no host or port is provided, then `newTopology` should use all hosts provided in a connection string.

This script runs each test by itself through mocha, making it much easier to spot when a test leaks connections.

Sometime we request operations as fire-and-forget right before the pool is destroyed (`endSessions` is a good example). In a graceful destruction the pool still needs to account for these operations, so a new state `draining` was introduced to prevent new operations while allowing the pool to drain existing queued work.

lib/core/connection/pool.js

lib/core/sdam/server.js

test/core/functional/pool_tests.js

imlucas · 2019-11-05T03:10:22Z

nit: it would be swell to have more explicit error messages for debuggability. For example:

callback(new MongoError('Cannot execute a command when the server is closed'));

instead of:

callback(new MongoError('server is closed'));

mbroadst · 2019-11-05T14:22:06Z

Thanks @imlucas. This particular error should only ever show up as a reason, so I think we're safe being less specific here.

As an aside, I think the error is also sufficient from the perspective of what the stack trace would look like now that we have a very descriptive command dispatch path. It would be like executeOperation => selectServer => command => error server is closed.

imlucas · 2019-11-05T17:56:09Z

@mbroadst thanks!

🤔 It would be sweet to have a little util to inspect the stack → message that includes the dispatch path. Could solve a problem we have in Compass today where we're only looking/showing at err.message. Adding to my list to think more about and move off this thread.

mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from 4213db0 to 6d6fe93 Compare October 31, 2019 01:29

daprahamian suggested changes Oct 31, 2019

View reviewed changes

mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch 5 times, most recently from 366c166 to b3899cc Compare November 1, 2019 13:45

mbroadst added 21 commits November 2, 2019 08:08

fix(sdam): don't remove unknown servers in topology updates

8d487d1

NODE-2214

refactor(pool): support a callback in connect

2f8f8fd

fix(sdam): don't emit close every time a child server closes

5498b6f

The `close` event was erroneously emitted every time a child server closed, which was not consistent with the legacy topology's behavior. Instead, the event is now emitted when the topology itself is closed. NODE-2251

refactor(sdam): null => undefined

0183427

refactor(sdam): track server connections in a different timer list

d2b5549

refactor(pool): support creating connections in destroying state

c152aa1

We have some edge cases in our testing where `endSessions` is sent during `destroy`, but the pool might not have enough open connections in that case.

fix(close): the unified topology emits a close event on close now

eca376c

fix(server): ensure state is transitioned to closed on connect fail

85a4db5

fix(server): don't emit error in connect if closing/closed

93497a5

fix(sdam): ignore server errors when closing/closed

7951949

fix(sdam): minHeartbeatIntervalMS => minHeartbeatFrequencyMS

fd26385

refactor(sdam): revert null => undefined changes until 4.x

dda0a7f

test: allow specifying that tests require legacy topologies

1714ace

refactor(pool): don't explicitly create new connection on reset

8dc06a0

test: newTopology should use all hosts from a connection string

051fce7

If no host or port is provided, then `newTopology` should use all hosts provided in a connection string.

test: add script for running each test, useful for finding leaks

ef9b45e

This script runs each test by itself through mocha, making it much easier to spot when a test leaks connections.

test: ensure mocha is always being run recursively

4b46845

fix(pool): only transition to DISCONNECTED if reconnect enabled

00b8833

test: increase server selection timeout for flakey test

87bd12f

mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from e6b78d8 to 87bd12f Compare November 2, 2019 12:08

fix(monitoring): incorrect states used to determine rescheduling

15b4c9c

mbroadst added 2 commits November 3, 2019 08:57

fix(pool): don't reset a pool if we'not already connected

db6f3fe

refactor(server): do not permit commands when server is closed

7d5fab2

mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from 78acb68 to 471b1d0 Compare November 3, 2019 16:03

test: skip broken transaction tests until SERVER-40685 is fixed

060b90f

mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from 471b1d0 to 060b90f Compare November 3, 2019 16:05

mbroadst added 2 commits November 4, 2019 07:34

test: refactor logic for unified topology filter

c552786

refactor(topology): use sets for timer management

f57b9e2

mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from 397e8ef to f57b9e2 Compare November 4, 2019 13:38

mbroadst requested a review from daprahamian November 4, 2019 13:43

test: add tests around pool reset behavior

55b4e24

daprahamian suggested changes Nov 4, 2019

View reviewed changes

lib/core/connection/pool.js Show resolved Hide resolved

lib/core/sdam/server.js Show resolved Hide resolved

lib/core/sdam/server.js Show resolved Hide resolved

test/core/functional/pool_tests.js Show resolved Hide resolved

mbroadst added 2 commits November 4, 2019 17:24

refactor(pool): wait for createConnection on pool reset

c38668d

refactor(topology): don't explicit reset pool on SDAM unrecoverable

30dfc02

daprahamian approved these changes Nov 5, 2019

View reviewed changes

mbroadst merged commit 71a0270 into master Nov 5, 2019

mbroadst deleted the NODE-2274/unified-failover-server-loss branch November 5, 2019 17:12

This was referenced Nov 12, 2019

COMPASS-3943: Update to latest node driver 3.3.4 mongodb-js/data-service#154

Merged

COMPASS-3943: Update to latest node driver 3.3.4 mongodb-js/compass#1834

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NODE-2274: unified failover server loss #2197

NODE-2274: unified failover server loss #2197

Uh oh!

mbroadst commented Oct 30, 2019 •

edited

Loading

Uh oh!

daprahamian Oct 31, 2019

Uh oh!

mbroadst Nov 4, 2019

Uh oh!

Uh oh!

daprahamian Oct 31, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imlucas commented Nov 5, 2019

Uh oh!

mbroadst commented Nov 5, 2019

Uh oh!

imlucas commented Nov 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NODE-2274: unified failover server loss #2197

NODE-2274: unified failover server loss #2197

Uh oh!

Conversation

mbroadst commented Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

daprahamian Oct 31, 2019

Choose a reason for hiding this comment

Uh oh!

mbroadst Nov 4, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

daprahamian Oct 31, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imlucas commented Nov 5, 2019

Uh oh!

mbroadst commented Nov 5, 2019

Uh oh!

imlucas commented Nov 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mbroadst commented Oct 30, 2019 •

edited

Loading