Skip to content

Conversation

@mbroadst
Copy link
Member

@mbroadst mbroadst commented Oct 30, 2019

Description

The fundamental issue here is that when using the unified topology, if a node is "lost" (due to some network error, etc) it is never regained. This was due to some complicated error handling, primarily around the events propagated from the connection pool. I'll try to summarize in the next section what changed (or needed to be changed)

What changed?

  • State machines for the Pool, Server and Topology types were formalized, now using the same method to generate the machines. The Pool gained a new state DRAINING which is used to distinguish whether new commands can be written to the pool or not, in particular during a scenario when fire-and-forget messages are sent right before pool destruction
  • The Topology no longer listens for error or close events from the Pool. All errors are propagated through callbacks now, which allows us to be very specific about how we handle types of errors. This was the root cause of the issue addressed by this PR, all errors used to be funneled through a common error handler which would reset the server in SDAM flow, even during monitoring checks
  • Topology no longer uses a pool in forced reconnect mode. This was the second part of the root cause, failed attempts to do something like monitoring would cause retries which would trigger the SDAM flow. We now depend on the retry loop of retryable operations for retries explicitly.
  • The logic for "resetting a server" was factored, and made explicit during a server description update handling process.
  • As a part of this change NODE-2214 was resolved, where Unknown servers were erroneously removed from the topology description

Are there any files to ignore?
A number of test changes were required that were out of scope for this change, but caused test failures nonetheless. Specifically, ignore changes to the transactions tests, and an added tool called run_each_test.sh which is just a helper for identifying individual test files which leak handles in node

@mbroadst mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from 4213db0 to 6d6fe93 Compare October 31, 2019 01:29
*/
class UnifiedTopologyFilter {
filter(test) {
if (!test.metadata) return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace all of this with:

filter(test) {
  const unifiedTopology = test.metadata && 
    test.metadata.requires &&
    test.metadata.requires.unifiedTopology;
  return typeof unifiedTopology !== 'boolean' ||
    unifiedTopology === process.env.MONGODB_UNIFIED_TOPOLOGY;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


function createConnection(pool, callback) {
if (pool.state === DESTROYED || pool.state === DESTROYING) {
if (pool.state === DESTROYED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want connections to be creatable when pool is DESTROYING

@mbroadst mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch 5 times, most recently from 366c166 to b3899cc Compare November 1, 2019 13:45
The `close` event was erroneously emitted every time a child server
closed, which was not consistent with the legacy topology's
behavior. Instead, the event is now emitted when the topology itself
is closed.

NODE-2251
For legacy reasons the unified topology forced the connection pool
into auto reconnect mode by default. This caused failed server
checks to continue to emit errors on the server, causing the server
to lose track of its monitoring state, and never returning the node
to the pool of selectable servers. This results client-side as an
error about server selection timing out.

NODE-2274
We have some edge cases in our testing where `endSessions` is sent
during `destroy`, but the pool might not have enough open
connections in that case.
If no host or port is provided, then `newTopology` should use all
hosts provided in a connection string.
This script runs each test by itself through mocha, making it much
easier to spot when a test leaks connections.
Sometime we request operations as fire-and-forget right before the
pool is destroyed (`endSessions` is a good example). In a graceful
destruction the pool still needs to account for these operations,
so a new state `draining` was introduced to prevent new operations
while allowing the pool to drain existing queued work.
@mbroadst mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from e6b78d8 to 87bd12f Compare November 2, 2019 12:08
@mbroadst mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from 78acb68 to 471b1d0 Compare November 3, 2019 16:03
@mbroadst mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from 471b1d0 to 060b90f Compare November 3, 2019 16:05
@mbroadst mbroadst force-pushed the NODE-2274/unified-failover-server-loss branch from 397e8ef to f57b9e2 Compare November 4, 2019 13:38
@mbroadst mbroadst requested a review from daprahamian November 4, 2019 13:43
@imlucas
Copy link
Contributor

imlucas commented Nov 5, 2019

nit: it would be swell to have more explicit error messages for debuggability. For example:

callback(new MongoError('Cannot execute a command when the server is closed'));

instead of:

callback(new MongoError('server is closed'));

@mbroadst
Copy link
Member Author

mbroadst commented Nov 5, 2019

Thanks @imlucas. This particular error should only ever show up as a reason, so I think we're safe being less specific here.

As an aside, I think the error is also sufficient from the perspective of what the stack trace would look like now that we have a very descriptive command dispatch path. It would be like executeOperation => selectServer => command => error server is closed.

@mbroadst mbroadst merged commit 71a0270 into master Nov 5, 2019
@mbroadst mbroadst deleted the NODE-2274/unified-failover-server-loss branch November 5, 2019 17:12
@imlucas
Copy link
Contributor

imlucas commented Nov 5, 2019

@mbroadst thanks!

🤔 It would be sweet to have a little util to inspect the stack → message that includes the dispatch path. Could solve a problem we have in Compass today where we're only looking/showing at err.message. Adding to my list to think more about and move off this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants