Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky test-child-process-fork-net #21012

Closed

Conversation

mmarchini
Copy link
Contributor

Patch inspired on 397eceb to fix
flakyness on test-child-process-fork-net.

Ref: #20973

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label May 29, 2018
@mmarchini
Copy link
Contributor Author

mmarchini commented May 29, 2018

CI https://ci.nodejs.org/job/node-test-pull-request/15139/

OSX Failure (looks like a flaky test):

not ok 2202 sequential/test-fs-readfile-tostring-fail
  ---
  duration_ms: 0.601
  severity: fail
  exitcode: 7
  stack: |-
    /Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1010/test/sequential/test-fs-readfile-tostring-fail.js:67
      throw err;
      ^
    
    AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
    
      assert.ok(err instanceof Error)
    
        at /Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1010/test/sequential/test-fs-readfile-tostring-fail.js:34:12
        at /Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1010/test/common/index.js:443:15
        at FSReqWrap.readFileAfterClose [as oncomplete] (internal/fs/read_file_context.js:53:3)
  ...

Windows:

not ok 51 parallel/test-child-process-fork-net
  ---
  duration_ms: 0.359
  severity: fail
  exitcode: 1
  stack: |-
    PARENT: server listening
    CHILD: server listening
    CLIENT: connected
    PARENT: got connection
    CLIENT: connected
    CHILD: got connection
    CLIENT: closed
    CHILD: got connection
    CHILD: got connection
    CLIENT: closed
    CLIENT: connected
    CLIENT: connected
    CLIENT: closed
    CLIENT: closed
    PARENT: server closed
    testSocket, listening
    CHILD: got socket
    CLIENT: got data
    CLIENT: closed
    events.js:167
          throw er; // Unhandled 'error' event
          ^
    
    Error: write EPIPE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at SocketListSend._request (internal/socket_list.js:20:16)
        at SocketListSend.close (internal/socket_list.js:40:10)
        at Server.close (net.js:1636:24)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-net.js:183:16)
        at Socket.emit (events.js:182:13)
        at TCP._handle.close [as _onclose] (net.js:596:12)
    Emitted 'error' event at:
        at process.nextTick (internal/child_process.js:745:39)
        at process._tickCallback (internal/process/next_tick.js:61:11)
  ...

@mmarchini mmarchini added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label May 29, 2018
@mmarchini mmarchini requested a review from gireeshpunathil May 29, 2018 13:06
@MylesBorins MylesBorins mentioned this pull request May 29, 2018
@mmarchini
Copy link
Contributor Author

@mmarchini
Copy link
Contributor Author

Update: the test is still failing...

@mmarchini mmarchini force-pushed the fix-flaky-test-child-process-fork-net branch from 072ce71 to 46371ce Compare May 29, 2018 15:02
@gireeshpunathil
Copy link
Member

Looking at the test case and the failing stack (from original issue tracker):

    Error: write EPIPE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at SocketListSend._request (internal/socket_list.js:20:16)
        at SocketListSend.close (internal/socket_list.js:40:10)
        at Server.close (net.js:1624:24)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-net.js:178:16)
        at Socket.emit (events.js:182:13)
        at TCP._handle.close [as _onclose] (net.js:596:12)
    Emitted 'error' event at:
        at process.nextTick (internal/child_process.js:745:39)
        at process._tickCallback (internal/process/next_tick.js:61:11)

I believe the server error handler may not be able to handle this, unless the child process percolates it to server, which I don't think would be the case.

So probably we need a completion handler to the server close that filters EPIPE:

--- a/test/parallel/test-child-process-fork-net.js
+++ b/test/parallel/test-child-process-fork-net.js
@@ -175,7 +175,10 @@ if (process.argv[2] === 'child') {
       connect.on('close', function() {
         console.log('CLIENT: closed');
         assert.strictEqual(store, 'echo');
-        server.close();
+        server.close((err) => {
+          if (err && err.code !== 'EPIPE')
+            throw err;
+        });
       });
     });
   }

Patch inspired on 397eceb to fix
flakyness on test-child-process-fork-net.

Ref: nodejs#20973
@mmarchini mmarchini force-pushed the fix-flaky-test-child-process-fork-net branch from 46371ce to bd3dc9d Compare May 29, 2018 17:05
@mmarchini
Copy link
Contributor Author

@gireeshpunathil thanks, I updated the PR with your suggestion.

Yet-another-CI: https://ci.nodejs.org/job/node-test-pull-request/15148/

@mmarchini mmarchini added the wip Issues and PRs that are still a work in progress. label May 29, 2018
mmarchini pushed a commit to mmarchini/node that referenced this pull request May 29, 2018
`flaky-test-child-process-fork-net` has been failing constantly for the
past few days, and all solutions suggestes so far were didn't work.
Marking it as faky while the issue is not fixed.

Ref: nodejs#21012
Ref: nodejs#20973
Ref: nodejs#20973
mmarchini pushed a commit that referenced this pull request May 30, 2018
`flaky-test-child-process-fork-net` has been failing constantly for the
past few days, and all solutions suggestes so far were didn't work.
Marking it as faky while the issue is not fixed.

Ref: #21012
Ref: #20973
Ref: #20973

PR-URL: #21018
Refs: #21012
Refs: #20973
Refs: #20973
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Jon Moss <[email protected]>
Reviewed-By: Anatoli Papirovski <[email protected]>
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Trivikram Kamat <[email protected]>
@mmarchini
Copy link
Contributor Author

I think I found the issue: the error is being thrown by the ChildProcess inside the server, and there's no way of catching errors thrown by it (which IMO is a bug). The patch below fixes the problem with this test (stress-test on test-child-process-fork-net with the patch applied: https://ci.nodejs.org/job/node-stress-single-test/1884/nodes=win2016-1p-vs2017/) but breaks test-child-process-fork-net2. There might be a better way to propagate errors thrown by the inner ChildProcess, but I'm stuck right now.

/cc @nodejs/streams do you have any suggestions here?

diff --git a/lib/internal/socket_list.js b/lib/internal/socket_list.js
index 55077af1305..32615c4b8cb 100644
--- a/lib/internal/socket_list.js
+++ b/lib/internal/socket_list.js
@@ -17,7 +17,7 @@ class SocketListSend extends EventEmitter {
     var self = this;
 
     if (!this.child.connected) return onclose();
-    this.child.send(msg);
+    this.child.send(msg, callback);
 
     function onclose() {
       self.child.removeListener('internalMessage', onreply);

@gireeshpunathil
Copy link
Member

Agree with @mmarchini . I too spent a lot of time on this, and have these observations.

three issues:

  • test: multiple capabilities being tested together in a multi-process scenario, inviting race conditions: minor
  • test: no provision in the test to catch race situations: medium
  • code: internal helper functions (that support async APIs) throwing exception: major!

This might have been discussed in the past (but I haven't seen one):

What is the specification for an asynchronous API in terms of the control flow in its return path? This is my current understanding:

  1. validate the input and if found defective, throw (alternatively one may choose to callback too if they want pure asynchronous behavior)
  2. after advancing from the input validation stage, callback with error for all other error cases
  3. ignore error and silently come out of the API if callback is not passed

Internal functions that support async APIs should follow only last two options.
If they throw, the caller (API) should absorb it and handle it through its well defined return paths.

In any case, a program should not be directly handling internal faults.

Given the premise of tests as to catch issues in the code, I recommend not to workaround the test instead leaave it as flaky while we address the code issue.

addaleax pushed a commit that referenced this pull request May 31, 2018
`flaky-test-child-process-fork-net` has been failing constantly for the
past few days, and all solutions suggestes so far were didn't work.
Marking it as faky while the issue is not fixed.

Ref: #21012
Ref: #20973
Ref: #20973

PR-URL: #21018
Refs: #21012
Refs: #20973
Refs: #20973
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Jon Moss <[email protected]>
Reviewed-By: Anatoli Papirovski <[email protected]>
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Trivikram Kamat <[email protected]>
@Trott
Copy link
Member

Trott commented Jun 2, 2018

Pinging @AndreasMadsen (original author of the test in dceebbfa31a). Probably OK to split the test into two files, one to test the server object and one to test the socket object? Thoughts on what's going on here?

Trott added a commit to Trott/io.js that referenced this pull request Jun 2, 2018
Split test-child-process-fork-net into
test-child-process-fork-net-server and
test-child-process-fork-net-socket. Rename
test-child-process-fork-net2.js to test-child-process-fork-net.js.

Refs: nodejs#21012
@Trott Trott mentioned this pull request Jun 2, 2018
2 tasks
@Trott
Copy link
Member

Trott commented Jun 2, 2018

@targos and @apapirovski did a bunch of debugging and experimentation on this today. Here's what they've found:

  • The child process is quitting before the parent notifies it of the socket being closed. Ergo: EPIPE
  • The test is easy to "fix" in that you can keep the child alive longer by keeping something in the event loop (e.g., using setTimeout()). But that probably just masks the actual issue which is quite possibly in Node.js core (although it could be a Windows bug or something like that?). This is probably not that much better than just swallowing EPIPE. So that's probably not the way to go.
  • The lines in Node.js core that is trying to notify the child is
    this._request({
    cmd: 'NODE_SOCKET_NOTIFY_CLOSE',
    key: this.key
    }, 'NODE_SOCKET_ALL_CLOSED', callback);
  • The line that the above call causes to fail is
    var err = channel.writeUtf8String(req, string, handle);
    . That calls into C++ and returns EPIPE (or something that Node.js displays as EPIPE).

This feels like a Node.js bug on Windows. It seems like a race condition. We swallow EPIPE in a lot of places on Windows in tests and we wonder if this bug is the source of all or many of those workarounds.

Calling in the child_process experts to see what they think: @nodejs/child_process

@Trott
Copy link
Member

Trott commented Jun 2, 2018

More thoughts on EPIPE: It seems like there will always be a potential race condition because you have two different processes. Therefore, we can't get rid of EPIPE (can we?), so the question becomes: What is the appropriate thing to do with EPIPE within Node.js core? Is it to ignore it for that particular message at least?

@Trott
Copy link
Member

Trott commented Jun 2, 2018

Ping @addaleax

@Trott
Copy link
Member

Trott commented Jun 3, 2018

this._request({
cmd: 'NODE_SOCKET_NOTIFY_CLOSE',
key: this.key
}, 'NODE_SOCKET_ALL_CLOSED', callback);

I see that the callback that we internally use above is onWorkerClose in net.js:

node/lib/net.js

Lines 1620 to 1625 in 65b17d4

const onWorkerClose = () => {
if (--left !== 0) return;
this._connections = 0;
this._emitCloseIfDrained();
};

It's keeping track of the number of worker processes. Somehow, one is going away without it noticing? But only on this one Windows variant and only sometimes?

@Trott
Copy link
Member

Trott commented Jun 3, 2018

Perhaps this is a clue too?

The test has this listener:

socket.on('close', function() {
console.log('CLIENT: socket closed');
});

It never fires on successful runs or failed runs. It just never fires.

targos pushed a commit that referenced this pull request Jun 5, 2018
Split test-child-process-fork-net into
test-child-process-fork-net-server and
test-child-process-fork-net-socket. Rename
test-child-process-fork-net2.js to test-child-process-fork-net.js.

Refs: #21012

PR-URL: #21095
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Trivikram Kamat <[email protected]>
@Trott
Copy link
Member

Trott commented Jun 5, 2018

@mmarchini Can you rebase this?

@mmarchini
Copy link
Contributor Author

@Trott I strongly feel this is a Node.js core bug and should be fixed as so instead of writing yet another workaround for EPIPE on tests. I'll close this PR, but first I'll open an issue regarding the problems we listed here (if there isn't one already).

@apapirovski
Copy link
Member

This is fixed by #21108 so we can just close.

@apapirovski apapirovski closed this Jun 5, 2018
MylesBorins pushed a commit that referenced this pull request Jun 6, 2018
Split test-child-process-fork-net into
test-child-process-fork-net-server and
test-child-process-fork-net-socket. Rename
test-child-process-fork-net2.js to test-child-process-fork-net.js.

Refs: #21012

PR-URL: #21095
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Trivikram Kamat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. wip Issues and PRs that are still a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants