test: fix flaky test-cluster-shared-leak.js #4173

Trott · 2015-12-06T16:24:36Z

test-cluster-shared-leak.js was flaky because a worker can emit EPIPE.
This error event is expected.

Trott · 2015-12-06T16:25:45Z

One thing that needs to be checked is that the test with this change still fails on Windows with Node 4.2.1. (The test checks for a bug in Node 4.2.1.)

Trott · 2015-12-06T16:27:41Z

Stress test: https://ci.nodejs.org/job/node-stress-single-test/139/nodes=win2012r2/console

Fishrock123 · 2015-12-06T16:31:21Z

test/parallel/test-cluster-shared-leak.js

@@ -15,6 +15,11 @@ if (cluster.isMaster) {
  worker1 = cluster.fork();
  worker1.on('message', common.mustCall(function() {
    worker2 = cluster.fork();
+    worker2.on('error', function(e) {
+      // EPIPE is OK on Windows
+      if ((! common.isWindows) || (e.code !== 'EPIPE'))


Aside from some style issues, would an opposite check be more clear? i.e. if (common.isWindows && e.code === 'EPIPE') return;?

I was going to write it that way but I went with this to more closely mirror the (simpler) condition involving ECONNRESET later in the file. You're right, though, and I'll switch it.

r-52 · 2015-12-06T18:24:48Z

@Trott sorry - I saw this PR too late and pushed your PR #4162 up to master. Makes it sense to delete the flaky mark for this test?

Trott · 2015-12-06T20:39:11Z

@romankl No problem. The test should be marked flaky until this (or some other fix) lands, so that's totally fine. I'll rebase this and add the removal from the .status file. Thanks!

Trott · 2015-12-06T22:31:35Z

Fixed up per nits from @romankl and @Fishrock123

bnoordhuis · 2015-12-07T10:59:45Z

test/parallel/test-cluster-shared-leak.js

@@ -15,14 +15,21 @@ if (cluster.isMaster) {
  worker1 = cluster.fork();
  worker1.on('message', common.mustCall(function() {
    worker2 = cluster.fork();
+    worker2.on('error', function(e) {
+      // EPIPE is OK on Windows
+      if (common.isWindows && e.code === 'EPIPE')


I'm not too surprised it only happens on Windows but I don't understand why you only need to the check for the second worker.

I wish I had a logical explanation for that too. I've been poking at it a bit, but I'm slowed by not having direct access to a Windows machine. I've been adding logging statements and running on CI which is obviously a limiting and slow process...

I think I figured out why it's only afflicting worker2 and how to rewrite the code without the special handling for Windows.

worker2 gets EPIPE if we try to send to it before it's actually listening. This never happens with worker1 because by the time we get to worker1.send(), we're in the worker1 message handler, so it's obviously already listening. But there's no such protection for worker2. So the fix is to wait for the listening event to fire on cluster for worker2 before doing all the send() stuff. Running a stress test right now to confirm that's really the fix. Looks good so far. Will update this PR if that doesn't change. (Already confirmed that it still fails for Node 4.2.1, which is what we want.)

test-cluster-shared-leak.js was flaky because a worker can emit EPIPE. Wait for workers to be listening so that EPIPE does not happen. Fixes: nodejs#3956 PR-URL: nodejs#4173

Trott · 2015-12-08T06:50:52Z

OK, fixed up hopefully for really-realz this time.

Stress test: https://ci.nodejs.org/job/node-stress-single-test/150/nodes=win2012r2/console

CI: https://ci.nodejs.org/job/node-test-pull-request/952/

R=@bnoordhuis

Trott · 2015-12-08T14:26:43Z

sigh CI is good but stress test is not. Back to the drawing board...

Trott · 2015-12-08T14:34:10Z

Reverted to the previous workaround.

bnoordhuis · 2015-12-08T21:21:43Z

Too bad the wait-until-listening approach didn't work out. Doesn't the current approach of calling .disconnect() in both workers leave behind stray processes?

Trott · 2015-12-08T22:00:46Z

Yes, all routes so far are terrible, but I'll figure something out.

Trott · 2015-12-08T22:01:34Z

Right now, I'm focusing on an assertion that can be fired in internal/child_process.js. I'm hoping that fixing that bug will make this all workable. I'm unrealistically optimistic like that.

Trott · 2015-12-08T22:59:32Z

Here's the bug that I think/hope might be related to the issues here: #4205

Trott · 2015-12-27T21:18:43Z

Fix for #4205 has landed, so time to resume work on this issue...

Stress test with current master to confirm that this bug still exists:
https://ci.nodejs.org/job/node-stress-single-test/211/nodes=win2012r2/console

Refactor test-cluster-shared-leak.js to remove flakiness on Windows. Fixes: nodejs#3956 PR-URL: nodejs#4173

Trott · 2015-12-28T16:22:58Z

OK, current minimal fix works.

CI: https://ci.nodejs.org/job/node-test-commit/1555/

Stress test: https://ci.nodejs.org/job/node-stress-single-test/233/nodes=win2012r2/consoleFull

Here's the explanation:

There is no guarantee that the pipe will be there when worker2.send() fires. It will probably be there but sometimes not. This is an error in the code, but I have not been able to remove that error without removing the invariant firing in internal/child_process when the code is run in Node 4.2.1, which is the point of the test. However, the error in the code is kind of irrelevant anyway. The invariant should never fire under any circumstances. So I've added an error event listener on worker2 which simply swallows the error. The test still fires the invariant in Node 4.2.1 (which is the bug that this test is supposed to detect, so that's good) and works fine in subsequent versions (which are fixed, so that's good too).

PTAL @bnoordhuis and anyone else interested. I'd really like to get rid of this flaky test!

(And pre-emptive strike: Yes, I or someone else should definitely squash this down to one commit before landing. It's been a journey. ¯\_(ツ)_/¯ )

EDIT: Red failures on Windows are a known flaky test (that I already have a PR open for to mark as flaky which I'll land momentarily) and a newly flaky but unrelated test. Hooray. :-| I'll look into it, but it shouldn't stop this from landing...

Trott · 2015-12-28T22:55:02Z

One more (hopefully last) time with feeling:

CI: https://ci.nodejs.org/job/node-test-commit/1558/
Stress: https://ci.nodejs.org/job/node-stress-single-test/246/nodes=win2012r2/console

(And the test still fails in Node 4.2.1, which is the last release that had the bug, so that's good.)

Trott · 2015-12-29T03:10:24Z

CI etc. looks good. Still needs an LGTM, though. /cc @nodejs/testing

One last cut-and-paste of the explanation:

There is no guarantee that the pipe will be there when worker2.send() fires. It will probably be there but sometimes not. This is an error in the code, but I have not been able to remove that error without removing the invariant firing in internal/child_process when the code is run in Node 4.2.1, which is the point of the test. However, the error in the code is kind of irrelevant anyway. The invariant should never fire under any circumstances. So I've added an error event listener on worker2 which simply swallows the error. The test still fires the invariant in Node 4.2.1 (which is the bug that this test is supposed to detect, so that's good) and works fine in subsequent versions (which are fixed, so that's good too).

Swallow EPIPE as there is it is expected to come up from time to time. This does not invalidate the test. Fixes: nodejs#3956 PR-URL: nodejs#4173

Trott · 2016-01-01T22:48:01Z

Closing in favor of #4510

Trott added cluster Issues and PRs related to the cluster subsystem. windows Issues and PRs related to the Windows platform. test Issues and PRs related to the tests. labels Dec 6, 2015

Trott mentioned this pull request Dec 6, 2015

test-cluster-shared-leak failure on windows #3956

Closed

Fishrock123 reviewed Dec 6, 2015
View reviewed changes

Trott force-pushed the fix-3956 branch from f8be55b to 251c8b1 Compare December 6, 2015 22:31

bnoordhuis reviewed Dec 7, 2015
View reviewed changes

Trott force-pushed the fix-3956 branch from 251c8b1 to bbec62b Compare December 8, 2015 06:47

Trott force-pushed the fix-3956 branch from bbec62b to 4624147 Compare December 8, 2015 06:49

Trott added a commit to Trott/io.js that referenced this pull request Dec 8, 2015

test: fix EPIPE on Windows

4624147

test-cluster-shared-leak.js was flaky because a worker can emit EPIPE. Wait for workers to be listening so that EPIPE does not happen. Fixes: nodejs#3956 PR-URL: nodejs#4173

Trott force-pushed the fix-3956 branch from 253fd62 to 3c7ae8d Compare December 8, 2015 18:19

Trott force-pushed the fix-3956 branch from a291b7c to 58f795e Compare December 8, 2015 20:15

jasnell added the lts-watch-v4.x label Dec 11, 2015

Trott force-pushed the master branch from 1e896a6 to 082cc8d Compare December 27, 2015 02:01

Trott force-pushed the fix-3956 branch from 1b23d09 to e32475a Compare December 27, 2015 22:28

Trott added a commit to Trott/io.js that referenced this pull request Dec 27, 2015

test: fix EPIPE on Windows

f434394

Refactor test-cluster-shared-leak.js to remove flakiness on Windows. Fixes: nodejs#3956 PR-URL: nodejs#4173

Trott changed the title ~~test: accommodate EPIPE on Windows~~ test: fix flaky test-cluster-shared-leak.js Dec 28, 2015

Trott force-pushed the fix-3956 branch from 2f737a5 to 64af6ac Compare December 28, 2015 17:16

Trott force-pushed the fix-3956 branch from 64af6ac to c319781 Compare December 28, 2015 17:18

Trott force-pushed the fix-3956 branch from 755cd89 to f87f03f Compare December 28, 2015 22:52

test: fix flaky test-cluster-shared-leak

e962bad

Swallow EPIPE as there is it is expected to come up from time to time. This does not invalidate the test. Fixes: nodejs#3956 PR-URL: nodejs#4173

Trott force-pushed the fix-3956 branch from f87f03f to e962bad Compare December 29, 2015 15:31

Trott mentioned this pull request Dec 29, 2015

cluster: ignore queryServer msgs on disconnection #4465

Closed

Trott closed this Jan 1, 2016

jasnell removed the lts-watch-v4.x label Jan 4, 2016

Trott deleted the fix-3956 branch January 13, 2022 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix flaky test-cluster-shared-leak.js #4173

test: fix flaky test-cluster-shared-leak.js #4173

Trott commented Dec 6, 2015

Trott commented Dec 6, 2015

Trott commented Dec 6, 2015

Fishrock123 Dec 6, 2015

Trott Dec 6, 2015

r-52 commented Dec 6, 2015

Trott commented Dec 6, 2015

Trott commented Dec 6, 2015

bnoordhuis Dec 7, 2015

Trott Dec 7, 2015

Trott Dec 8, 2015

Trott commented Dec 8, 2015

Trott commented Dec 8, 2015

Trott commented Dec 8, 2015

bnoordhuis commented Dec 8, 2015

Trott commented Dec 8, 2015

Trott commented Dec 8, 2015

Trott commented Dec 8, 2015

Trott commented Dec 27, 2015

Trott commented Dec 28, 2015

Trott commented Dec 28, 2015

Trott commented Dec 29, 2015

Trott commented Jan 1, 2016

test: fix flaky test-cluster-shared-leak.js #4173

test: fix flaky test-cluster-shared-leak.js #4173

Conversation

Trott commented Dec 6, 2015

Trott commented Dec 6, 2015

Trott commented Dec 6, 2015

Fishrock123 Dec 6, 2015

Choose a reason for hiding this comment

Trott Dec 6, 2015

Choose a reason for hiding this comment

r-52 commented Dec 6, 2015

Trott commented Dec 6, 2015

Trott commented Dec 6, 2015

bnoordhuis Dec 7, 2015

Choose a reason for hiding this comment

Trott Dec 7, 2015

Choose a reason for hiding this comment

Trott Dec 8, 2015

Choose a reason for hiding this comment

Trott commented Dec 8, 2015

Trott commented Dec 8, 2015

Trott commented Dec 8, 2015

bnoordhuis commented Dec 8, 2015

Trott commented Dec 8, 2015

Trott commented Dec 8, 2015

Trott commented Dec 8, 2015

Trott commented Dec 27, 2015

Trott commented Dec 28, 2015

Trott commented Dec 28, 2015

Trott commented Dec 29, 2015

Trott commented Jan 1, 2016