-
Notifications
You must be signed in to change notification settings - Fork 1.8k
NODE-2274: unified failover server loss #2197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4213db0 to
6d6fe93
Compare
| */ | ||
| class UnifiedTopologyFilter { | ||
| filter(test) { | ||
| if (!test.metadata) return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: replace all of this with:
filter(test) {
const unifiedTopology = test.metadata &&
test.metadata.requires &&
test.metadata.requires.unifiedTopology;
return typeof unifiedTopology !== 'boolean' ||
unifiedTopology === process.env.MONGODB_UNIFIED_TOPOLOGY;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
lib/core/connection/pool.js
Outdated
|
|
||
| function createConnection(pool, callback) { | ||
| if (pool.state === DESTROYED || pool.state === DESTROYING) { | ||
| if (pool.state === DESTROYED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want connections to be creatable when pool is DESTROYING
366c166 to
b3899cc
Compare
The `close` event was erroneously emitted every time a child server closed, which was not consistent with the legacy topology's behavior. Instead, the event is now emitted when the topology itself is closed. NODE-2251
For legacy reasons the unified topology forced the connection pool into auto reconnect mode by default. This caused failed server checks to continue to emit errors on the server, causing the server to lose track of its monitoring state, and never returning the node to the pool of selectable servers. This results client-side as an error about server selection timing out. NODE-2274
We have some edge cases in our testing where `endSessions` is sent during `destroy`, but the pool might not have enough open connections in that case.
If no host or port is provided, then `newTopology` should use all hosts provided in a connection string.
This script runs each test by itself through mocha, making it much easier to spot when a test leaks connections.
Sometime we request operations as fire-and-forget right before the pool is destroyed (`endSessions` is a good example). In a graceful destruction the pool still needs to account for these operations, so a new state `draining` was introduced to prevent new operations while allowing the pool to drain existing queued work.
e6b78d8 to
87bd12f
Compare
78acb68 to
471b1d0
Compare
471b1d0 to
060b90f
Compare
397e8ef to
f57b9e2
Compare
|
nit: it would be swell to have more explicit error messages for debuggability. For example: callback(new MongoError('Cannot execute a command when the server is closed'));instead of: callback(new MongoError('server is closed')); |
|
Thanks @imlucas. This particular error should only ever show up as a reason, so I think we're safe being less specific here. As an aside, I think the error is also sufficient from the perspective of what the stack trace would look like now that we have a very descriptive command dispatch path. It would be like |
|
@mbroadst thanks! 🤔 It would be sweet to have a little util to inspect the stack → message that includes the dispatch path. Could solve a problem we have in Compass today where we're only looking/showing at |
Description
The fundamental issue here is that when using the unified topology, if a node is "lost" (due to some network error, etc) it is never regained. This was due to some complicated error handling, primarily around the events propagated from the connection pool. I'll try to summarize in the next section what changed (or needed to be changed)
What changed?
Pool,ServerandTopologytypes were formalized, now using the same method to generate the machines. ThePoolgained a new stateDRAININGwhich is used to distinguish whether new commands can be written to the pool or not, in particular during a scenario when fire-and-forget messages are sent right before pool destructionTopologyno longer listens forerrororcloseevents from thePool. All errors are propagated through callbacks now, which allows us to be very specific about how we handle types of errors. This was the root cause of the issue addressed by this PR, all errors used to be funneled through a common error handler which would reset the server in SDAM flow, even during monitoring checksTopologyno longer uses a pool in forcedreconnectmode. This was the second part of the root cause, failed attempts to do something like monitoring would cause retries which would trigger the SDAM flow. We now depend on the retry loop of retryable operations for retries explicitly.Unknownservers were erroneously removed from the topology descriptionAre there any files to ignore?
A number of test changes were required that were out of scope for this change, but caused test failures nonetheless. Specifically, ignore changes to the transactions tests, and an added tool called
run_each_test.shwhich is just a helper for identifying individual test files which leak handles in node