cluster: fix race condition setting suicide prop #4349

santigimeno · 2015-12-18T21:11:57Z

There is no guarantee that the suicide property of a worker in the master
process is going to be set when the disconnect and exit events are emitted.
To fix it, wait for the ACK of the suicide message from the master before
disconnecting the worker.
Modify test-regress-GH-3238 so it checks both the kill and disconnect
cases. Also take into account that the disconnect event may be received after
the exit event.

I discovered this because I was sometimes getting this error:

assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: false === true
    at Worker.<anonymous> (/Users/sgimeno/node/node/test/parallel/test-regress-GH-3238.js:17:12)
    at Worker.<anonymous> (/Users/sgimeno/node/node/test/common.js:401:15)
    at emitTwo (events.js:88:13)
    at Worker.emit (events.js:173:7)
    at ChildProcess.<anonymous> (cluster.js:361:14)
    at ChildProcess.g (events.js:264:16)
    at emitTwo (events.js:88:13)
    at ChildProcess.emit (events.js:173:7)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:200:12)

I don't know if it's the proper fix though

cjihrig · 2015-12-22T17:03:36Z

test/parallel/test-regress-GH-3238.js

-  const worker = cluster.fork();
-  let disconnected = false;
+  function forkWorker(action) {
+    const worker = cluster.fork({


Can you please make this: const worker = cluster.fork({ action });

cjihrig · 2015-12-22T17:06:56Z

LGTM.

CI seems like it might not be in great shape right now, but trying anyway: https://ci.nodejs.org/job/node-test-pull-request/1045/

santigimeno · 2015-12-22T18:23:37Z

@cjihrig updated. Thanks!

cjihrig · 2015-12-22T19:28:55Z

@bnoordhuis what is your take on the semver-ness of this change. I don't think it should break anyone.

Trott · 2015-12-22T22:48:00Z

Any idea if this fixes #4205? That would be awesome.

Trott · 2015-12-23T06:17:54Z

Following up on my own question about how this affects #4205:

On OS X, the code below exits normally with current master. With the change proposed in this PR, the code does not exit.

On Linux (ubuntu 14.04), the following code throws an AssertionError (which is bug #4205) with current master. With the change proposed here, the AssertionError goes away but the code does not exit.

Is the fact that this code does not exit a bug that needs to be addressed? Or is it expected behavior?

'use strict';
const net = require('net');
const cluster = require('cluster');
cluster.schedulingPolicy = cluster.SCHED_NONE;

if (cluster.isMaster) {
  var worker1, worker2;

  worker1 = cluster.fork();
  worker1.on('message', function() {
    worker2 = cluster.fork();
    worker1.disconnect();
    worker2.on('online', worker2.disconnect);
  });

  return;
}

var server = net.createServer();

server.listen(12346, function() {
  process.send('listening');
});

santigimeno · 2015-12-23T10:53:19Z

@Trott Changing the behavior of disconnect to synchronous when called from the parent makes your test code fail as described in #4205. I'll apply that change.

santigimeno · 2015-12-23T11:20:40Z

Updated

Trott · 2015-12-23T18:19:52Z

OK, so it doesn't fix #4205 but it also doesn't change the nature of that problem. And it gives me a pretty good idea of where to look to potentially figure out what the real issue is there. Thanks.

@cjihrig Does your previous LGTM stand with the additional modification that's been made?

cjihrig · 2015-12-23T18:30:25Z

lib/cluster.js

@@ -658,15 +663,21 @@ function workerInit() {
    if (accepted) server.onconnection(0, handle);
  }

-  Worker.prototype.disconnect = function() {
+  Worker.prototype.disconnect = function(fromParent) {


Could you name this something like masterInitiated?

cjihrig · 2015-12-23T18:31:41Z

Yes, still LGTM with comments.

Trott · 2015-12-23T19:09:54Z

lib/cluster.js

-        send({ act: 'suicide' });
-        process.disconnect();
+        // if called from worker, wait for ack to be sure suicide property is
+        // sent in the master


Nit: Should that be set rather than sent?

Trott · 2015-12-24T04:43:05Z

This would be semver-minor because it adds a parameter to worker.disconnect(), right? And, in that case, it would need a doc update as well.

Might it be better to do it in a way that allows the information to be sent when disconnect() is being called internally within cluster.js but not in a way that exposes a new API interface to the end user? This is a function parameter that no end user would ever need, right?

If we are going to add a parameter, I'd be more inclined to add an options object and check for options.fromParent (or whatever property name is appropriate) rather than a fromParent function parameter.

Trott · 2015-12-24T04:47:14Z

Maybe have a private function (say, _disconnect()) that takes the parameter so that it's available to cluster.js itself, and have Worker.prototype.disconnect() call _disconnect()?

Trott · 2015-12-24T22:37:19Z

When I run your modified version of the test file with the current master, it does not fail. Are you able to modify it so that it fails every time with the current master and succeeds every time with the patch here? (Or do I need to run it on a particular OS? I'm running it on OS X.) Otherwise, if the bug gets reintroduced, it's probably more likely that the test will be marked as flaky than it will be considered a bug in the code, as it probably won't fail on the commit that introduces the bug but rather on a subsequent unrelated commit.

Trott · 2015-12-24T22:40:48Z

I'm trying to figure out if #4418 would fix the issue you are experiencing or not. I was going to find out by simply seeing if that code passes the test, but since current master passes the test, ¯\_(ツ)_/¯. Any ideas? Maybe I'll run some stress tests...

Thanks, by the way, for all you do in trying to squash flaky tests. It is really valuable work and challenging in a pull-your-hair-out-and-bang-your-head-on-the-table kind of way. (EDIT: In case my hyphenated phrase is unclear, people don't always appreciate just how agonizing it can be to try to squash these bugs, so I want to say how much I appreciate you doing so much lately to do exactly that.)

Trott · 2015-12-25T04:41:58Z

Stress test with current master: https://ci.nodejs.org/job/node-stress-single-test/198/nodes=ubuntu1404-64/console

No failures on test-regress-GH-3238 after 9999 consecutive test runs on 64-bit ubuntu 14. What operating system are you seeing these failures on?

santigimeno · 2015-12-26T00:28:12Z

@Trott IIRC the test was failing to me in OS X. It only failed when the exit event was received before the disconnect event. I'm not sure there's an easy way to simulate that scenario in a consistent way.
BTW thank for your kind words :)

santigimeno · 2015-12-26T00:31:04Z

@Trott @cjihrig I have updated the PR. I think I have addressed all your comments. Let me know what you think. Thanks!

Trott · 2015-12-26T03:50:17Z

@santigimeno Judging from the stack trace you posted, it sure does seem like either exit fires before disconnect or else disconnect does not fire at all. But I am unable to reproduce this problem. I'm still trying, though. (I'm running a stress test on OS X right now.)

I'd be reluctant to make a change like this without having a test that fails without the change and passes with it.

Trott · 2015-12-26T03:58:58Z

lib/cluster.js

@@ -523,6 +528,35 @@ function workerInit() {
      process: process,
      state: 'online'
    });
+
+    worker._disconnect = function(masterInitiated) {


This will still be exposed to the end user. Can we make it private to this module rather than exposing it or is there a reason it cannot be done that way?

santigimeno · 2015-12-27T22:44:36Z

@Trott I'm able to reproduce the error consistently in my OS X computer by running multiple test-regress-GH-3238 tests in parallel like this:

while true; do /usr/bin/python tools/test.py --mode=release parallel/test-regress-GH-3238* -J; if [ $? -ne 0 ]; then break; fi; done

For example, having 16 copies of the test I'm getting the error before the tenth iteration most of the time:

while true; do /usr/bin/python tools/test.py --mode=release parallel/test-regress-GH-3238* -J; if [ $? -ne 0 ]; then break; fi; done
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
[00:00|% 100|+  16|-   0]: Done                            
=== release test-regress-GH-3238-6 ===                     
Path: parallel/test-regress-GH-3238-6
assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: false === true
    at Worker.<anonymous> (/Users/sgimeno/node/node/test/parallel/test-regress-GH-3238-6.js:16:12)
    at Worker.<anonymous> (/Users/sgimeno/node/node/test/common.js:401:15)
    at emitTwo (events.js:88:13)
    at Worker.emit (events.js:173:7)
    at ChildProcess.<anonymous> (cluster.js:361:14)
    at ChildProcess.g (events.js:264:16)
    at emitTwo (events.js:88:13)
    at ChildProcess.emit (events.js:173:7)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:200:12)
Command: out/Release/node /Users/sgimeno/node/node/test/parallel/test-regress-GH-3238-6.js
[00:00|% 100|+  15|-   1]: Done

I don't know if it'll be reproducible in other computers as I'm suspecting there's something in my computer that makes flaky tests fail more easily.

Trott · 2016-01-08T15:04:56Z

@santigimeno New CI: https://ci.nodejs.org/job/node-test-commit/1658/

santigimeno · 2016-01-08T21:18:45Z

Finally all is green :D

Trott · 2016-01-08T21:22:33Z

OK! Since it has gone through so many iterations since the last LGTM, it is probably good to ask reviewers to take one last look to make sure it is still OK by them.

Does this still look good to you, @cjihrig ?

Trott · 2016-01-08T21:23:49Z

Oh, and since there are changes to the previously existing regression test, we should double-check that it still fires on the regression it was written to catch.

cjihrig · 2016-01-12T20:32:33Z

test/sequential/test-cluster-disconnect-suicide-race.js

+    assert.strictEqual(code, 0, 'worker exited with error');
+  }, tries * 2));
+
+  for (let i = 0; i < tries; ++ i) {


Unnecessary space between ++ and i.

There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event.

santigimeno · 2016-01-13T16:51:36Z

PR updated. Thanks!

cjihrig · 2016-01-13T16:57:17Z

The changes to lib/cluster.js and the existing test LGTM.

Could you clarify the purpose of test-cluster-disconnect-suicide-race.js please. It looks a lot like the other test, but with the addition of tries, which looks somewhat arbitrary.

santigimeno · 2016-01-13T17:05:13Z

@cjihrig the new test forks lots of workers so the issue this PR tries to solve happens consistently. With the original test it barely did.

cjihrig · 2016-01-13T17:20:26Z

When you say consistently, is it reproduced 100% of the time?

santigimeno · 2016-01-13T17:24:32Z

At least the last time I checked (it's been a few days) it did in the platforms I tried: jessie 64, OS X and FreeBSD. I just tested in a jessie 64 box and the test fails in current master.

cjihrig · 2016-01-13T17:28:11Z

OK, well unless anyone objects to this, I'm cool with it.

rvagg · 2016-01-14T00:30:05Z

@Trott did you delete your testing rant from here? I was going to say that you need to turn it into a skeleton of a "how to write tests for core" doc for @nodejs/testing to iterate on and attach to the contributor guidelines. We need a rulebook we can follow when reviewing tests.

jasnell · 2016-01-14T01:02:28Z

+1 that would be quite valuable
On Jan 13, 2016 4:30 PM, "Rod Vagg" [email protected] wrote:

@Trott https://github.com/Trott did you delete your testing rant from
here? I was going to say that you need to turn it into a skeleton of a "how
to write tests for core" doc for @nodejs/testing
https://github.com/orgs/nodejs/teams/testing to iterate on and attach
to the contributor guidelines. We need a rulebook we can follow when
reviewing tests.

—
Reply to this email directly or view it on GitHub
#4349 (comment).

Trott · 2016-01-14T03:36:07Z

@rvagg @jasnell I deleted it from GitHub seconds after posting it, but not before saving it off somewhere to use later. Creating guidelines for tests is one item on a very long list of things I'd like to see the Testing WG do. First meeting is Friday. Check us out. nodejs/testing#1

Trott · 2016-01-14T03:43:17Z

Confirmed that the test still fires an AssertionError in Node.js 4.1.2 so we're all set, I think. One final CI run because I don't ever need more than a flimsy excuse to do a CI run.

https://ci.nodejs.org/job/node-test-pull-request/1228/

There are two tests on Windows that are failing on master and will continue to fail until #4679 lands (or some alternate fix). As long as nothing else is up with the CI, will land. Thanks for your persistence on this, @santigimeno!

There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: nodejs#4349 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]>

Trott · 2016-01-14T04:16:56Z

Landed in 9571be1. (Nice detailed git commit message, too, @santigimeno!)

There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-GH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: #4349 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]>

There is no guarantee that the `suicide` property of a worker in the master process is going to be set when the `disconnect` and `exit` events are emitted. To fix it, wait for the ACK of the suicide message from the master before disconnecting the worker. Also, there's no need to send the suicide message from the worker if the disconnection has been initiated in the master. Add `test-cluster-disconnect-suicide-race` that forks a lot of workers to consistently reproduce the issue this patch tries to solve. Modify `test-regress-nodejsGH-3238` so it checks both the `kill` and `disconnect` cases. Also take into account that the `disconnect` event may be received after the `exit` event. PR-URL: nodejs#4349 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]>

mscdex added the cluster Issues and PRs related to the cluster subsystem. label Dec 18, 2015

cjihrig reviewed Dec 22, 2015
View reviewed changes

santigimeno force-pushed the cluster_suicide branch from acb7566 to 982acc4 Compare December 22, 2015 18:18

santigimeno force-pushed the cluster_suicide branch from 982acc4 to be062b8 Compare December 23, 2015 11:20

cjihrig reviewed Dec 23, 2015
View reviewed changes

Trott reviewed Dec 23, 2015
View reviewed changes

jasnell added the lts-watch-v4.x label Dec 24, 2015

santigimeno force-pushed the cluster_suicide branch from be062b8 to d83edc3 Compare December 26, 2015 00:28

Trott reviewed Dec 26, 2015
View reviewed changes

Trott force-pushed the master branch from 1e896a6 to 082cc8d Compare December 27, 2015 02:01

santigimeno force-pushed the cluster_suicide branch from d83edc3 to f3a3370 Compare December 27, 2015 22:36

cjihrig reviewed Jan 12, 2016
View reviewed changes

santigimeno force-pushed the cluster_suicide branch from f63b35c to 0fb3ec9 Compare January 13, 2016 16:51

Trott closed this Jan 14, 2016

evanlucas mentioned this pull request Jan 18, 2016

v5.5.0 Release Proposal (Stable) #4742

Merged

MylesBorins added land-on-v4.x and removed lts-watch-v4.x labels Jan 28, 2016

MylesBorins mentioned this pull request Feb 11, 2016

V4.3.1 proposal #5200

Merged

cluster: fix race condition setting suicide prop #4349

cluster: fix race condition setting suicide prop #4349

Conversation

santigimeno commented Dec 18, 2015

cjihrig Dec 22, 2015

Choose a reason for hiding this comment

cjihrig commented Dec 22, 2015

santigimeno commented Dec 22, 2015

cjihrig commented Dec 22, 2015

Trott commented Dec 22, 2015

Trott commented Dec 23, 2015

santigimeno commented Dec 23, 2015

santigimeno commented Dec 23, 2015

Trott commented Dec 23, 2015

cjihrig Dec 23, 2015

Choose a reason for hiding this comment

cjihrig commented Dec 23, 2015

Trott Dec 23, 2015

Choose a reason for hiding this comment

Trott commented Dec 24, 2015

Trott commented Dec 24, 2015

Trott commented Dec 24, 2015

Trott commented Dec 24, 2015

Trott commented Dec 25, 2015

santigimeno commented Dec 26, 2015

santigimeno commented Dec 26, 2015

Trott commented Dec 26, 2015

Trott Dec 26, 2015

Choose a reason for hiding this comment

santigimeno commented Dec 27, 2015

Trott commented Jan 8, 2016

santigimeno commented Jan 8, 2016

Trott commented Jan 8, 2016

Trott commented Jan 8, 2016

cjihrig Jan 12, 2016

Choose a reason for hiding this comment

santigimeno commented Jan 13, 2016

cjihrig commented Jan 13, 2016

santigimeno commented Jan 13, 2016

cjihrig commented Jan 13, 2016

santigimeno commented Jan 13, 2016

cjihrig commented Jan 13, 2016

rvagg commented Jan 14, 2016

jasnell commented Jan 14, 2016

Trott commented Jan 14, 2016

Trott commented Jan 14, 2016

Trott commented Jan 14, 2016