Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-debug-signal-cluster.js fails sometimes (investigate flaky test) #3796

Closed
jelmd opened this issue Nov 12, 2015 · 21 comments
Closed

test-debug-signal-cluster.js fails sometimes (investigate flaky test) #3796

jelmd opened this issue Nov 12, 2015 · 21 comments
Labels
test Issues and PRs related to the tests.

Comments

@jelmd
Copy link

jelmd commented Nov 12, 2015

It seems, that many processes write at the "same" time to stderr, and thus this test fails from time to time: When it failes, I always got something like 'Debugger listening on port 12389Debugger listening on port 12390\n\n', i.e. the msg from another process got inserted before the of the message of another process msg.

To make it more reliable (in my case it now succeeds always), I changed it to collect the output w/o any <LF|CR> and finally just substract the expected lines (see http://iws.cs.uni-magdeburg.de/~elkner/tmp/node5/test-concur.patch). I know, as long as the write to stderr doesn't get synced, this test may always fail, however, it now seems to work better.

@cjihrig
Copy link
Contributor

cjihrig commented Nov 12, 2015

Would you mind opening this as a proper PR?

@jelmd
Copy link
Author

jelmd commented Nov 12, 2015

Is there a howto? I'm usually use git on CLI, only (and try to avoid the chaotic github UI).

@mscdex mscdex added debugger test Issues and PRs related to the tests. labels Nov 12, 2015
@targos
Copy link
Member

targos commented Nov 12, 2015

@jelmd you can start by forking this repository. Then create a branch, apply your patch, push the changes to GitHub and finally create the pull request.

@jelmd
Copy link
Author

jelmd commented Nov 16, 2015

Isn't it easier, when you just do it? Just forking, doing all the curios things just for the thing which already exists and than let it rot seems to be a little bit overkill, waste of resources.

@jasnell
Copy link
Member

jasnell commented Nov 16, 2015

@jelmd ... all changes in the source are handled through pull requests and all pull requests need someone to open them ;-)

@tyleranton
Copy link
Contributor

@jelmd If you could provide your fix in a gist, I can make a PR.

@jelmd
Copy link
Author

jelmd commented Dec 15, 2015

@Trott
Copy link
Member

Trott commented Feb 23, 2016

@jelmd Can you confirm that this is still a problem? I wouldn't be surprised if #3701 fixed this as it looks an awful lot like #2476 (which that PR fixed).

@jelmd
Copy link
Author

jelmd commented Feb 24, 2016

Haven't build a new version yet. Due to the lack of openssl 1.0.1 support it takes me a considerable amount of time to go over the changes of new versions and make it 1.0.1 compatible. That's why my current plan is to wait for the version, which incorporates the fips mode changes, so that it is worth to spent time on it...

@Trott
Copy link
Member

Trott commented May 11, 2016

@jelmd Any chance you've had an opportunity to verify this bug with a recent version?

@Trott
Copy link
Member

Trott commented Jul 7, 2016

Well, just saw this on OS X on CI, so I guess this is now our "investigate flaky test-debug-signal-cluster on OS X" issue. :-/

https://ci.nodejs.org/job/node-test-commit-osx/4041/nodes=osx1010/console

not ok 194 parallel/test-debug-signal-cluster
# 
# assert.js:89
#   throw new assert.AssertionError({
#   ^
# AssertionError: test timed out.
#     at Timeout.testTimedOut [as _onTimeout] (/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1010/test/parallel/test-debug-signal-cluster.js:53:3)
#     at Timer.unrefdHandle (timers.js:454:14)
  ---
  duration_ms: 5.897

@Trott
Copy link
Member

Trott commented Jul 7, 2016

Although it does seem that the CI failure is different than what's described here?

@Trott Trott changed the title node5: test-debug-signal-cluster.js fails sometimes test-debug-signal-cluster.js fails sometimes (investigate flaky test) Jul 7, 2016
@Trott
Copy link
Member

Trott commented Jul 8, 2016

Stress test on all platforms: https://ci.nodejs.org/job/node-stress-single-test/793/

@Trott
Copy link
Member

Trott commented Jul 9, 2016

Stress test was clean except for a build failure.

@jelmd
Copy link
Author

jelmd commented Aug 3, 2016

tested 6.3.1: seems to work.

@mhdawson
Copy link
Member

mhdawson commented Aug 11, 2016

This seems to have failed on AIX on master last night: https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/321/console

not ok 199 parallel/test-debug-signal-cluster
# got pids [8978502,5701842,8257694]
# 
# assert.js:89
#   throw new assert.AssertionError({
#   ^
# AssertionError: test timed out.
#     at Timeout.testTimedOut [as _onTimeout] (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/parallel/test-debug-signal-cluster.js:53:3)
#     at Timer.unrefdHandle (timers.js:462:14)
# > all workers are running
# > Starting debugger agent.
# > Debugger listening on [::]:12347
# > Starting debugger agent.
# > Starting debugger agent.
# > Debugger listening on [::]:12349Debugger listening on [::]:12348
  ---

mhdawson added a commit to mhdawson/io.js that referenced this issue Aug 11, 2016
We now have adequate AIX hardware to add AIX to
the regular regression runs.

However, there are a couple of failing tests even
though AIX was green at one point.  This PR
marks those tests as flaky so that we can add AIX
so that we can spot any new regressions without making
the builds RED

The tests are being worked under the following PRs

- being worked under nodejs#7564
test-async-wrap-post-did-throw
test-async-wrap-throw-from-callback
test-crypto-random

- being worked under nodejs#7973
test-stdio-closed

- covered by nodejs#3796
test-debug-signal-cluster
mhdawson added a commit that referenced this issue Aug 11, 2016
We now have adequate AIX hardware to add AIX to
the regular regression runs.

However, there are a couple of failing tests even
though AIX was green at one point.  This PR
marks those tests as flaky so that we can add AIX
so that we can spot any new regressions without making
the builds RED

The tests are being worked under the following PRs

- being worked under #7564
test-async-wrap-post-did-throw
test-async-wrap-throw-from-callback
test-crypto-random

- being worked under #7973
test-stdio-closed

- covered by #3796
test-debug-signal-cluster

PR-URL: #8065
Reviewed-By: joaocgreis - João Reis <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
cjihrig pushed a commit that referenced this issue Aug 11, 2016
We now have adequate AIX hardware to add AIX to
the regular regression runs.

However, there are a couple of failing tests even
though AIX was green at one point.  This PR
marks those tests as flaky so that we can add AIX
so that we can spot any new regressions without making
the builds RED

The tests are being worked under the following PRs

- being worked under #7564
test-async-wrap-post-did-throw
test-async-wrap-throw-from-callback
test-crypto-random

- being worked under #7973
test-stdio-closed

- covered by #3796
test-debug-signal-cluster

PR-URL: #8065
Reviewed-By: joaocgreis - João Reis <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
@Trott
Copy link
Member

Trott commented Aug 26, 2016

@mhdawson Is there any chance one or both of the two setvbuf() calls in src/node_main.cc are failing? Maybe change them to this and run the test a bunch?

  if (setvbuf(stdout, nullptr, _IONBF, 0) != 0) {
    fprintf(stderr, "Could not unset buffering on stdout.\n");
    exit(1);
  }
  if (setvbuf(stderr, nullptr, _IONBF, 0) != 0) {
    fprintf(stderr, "Could not unset buffering on stderr.\n");
    exit(1);
  }

I'm trying to figure out how the output you pasted above could be generated and that's my best guess without any more information.

(Aside, although I put this in the #node-build IRC channel already a few minutes ago so now I'm probably just being annoying: Can we add AIX to the node-stress-single-test task so I can do stuff like this myself easily? /cc @joaocgreis)

@Trott
Copy link
Member

Trott commented Aug 26, 2016

By the way, saw the same failure this morning on SmartOS:

https://ci.nodejs.org/job/node-test-commit-smartos/3917/nodes=smartos14-64/console

not ok 209 parallel/test-debug-signal-cluster # TODO : Fix flaky test
# got pids [89284,89293,89297]
# 
# assert.js:85
#   throw new assert.AssertionError({
#   ^
# AssertionError: test timed out.
#     at Timeout.testTimedOut [as _onTimeout] (/home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos14-64/test/parallel/test-debug-signal-cluster.js:53:3)
#     at Timer.unrefdHandle (timers.js:462:14)
# > all workers are running
# > Starting debugger agent.
# > Debugger listening on 127.0.0.1:12447
# > Starting debugger agent.
# > Starting debugger agent.
# > Debugger listening on 127.0.0.1:12448Debugger listening on 127.0.0.1:12449
  ---
  duration_ms: 4.737

Stress test on SmartOS with the C++ change in my previous comment to see if setvbuf() failure might be what's causing the issue for SmartOS or not: https://ci.nodejs.org/job/node-stress-single-test/856/nodes=smartos14-64/console

@Trott
Copy link
Member

Trott commented Aug 26, 2016

SmartOS failed the same way even though setvbuf() returned 0 (success). Guess I'll keep looking/poking.
¯_(ツ)_/¯

@misterdjules
Copy link

cfe76f2 reintroduced the no buffering mode for stderr on SmartOS, which made test-debug-signal-cluster.js flaky again.

Running test-debug-signal-cluster.js with DTrace, we can see that each debug output from the agent translates into two different write system calls:

root@0c933410-7c9b-4bea-96b6-2db35946bc95 ~/node]# dtrace -n 'syscall::*write*:entry /progenyof($target) && arg0 == 2/ { printf("write [%s] (%d bytes)\n", copyinstr(arg1), arg2); }' -c './node test/parallel/test-debug-signal-cluster.js'
dtrace: description 'syscall::*write*:entry ' matched 5 probes
> all workers are running
got pids [2307,2332,2343]
> Starting debugger agent.
> Debugger listening on 127.0.0.1:12346
> Starting debugger agent.
> Starting debugger agent.
> Debugger listening on 127.0.0.1:12348
> Debugger listening on 127.0.0.1:12347
dtrace: pid 2194 has exited
CPU     ID                    FUNCTION:NAME
 37   6156                      write:entry write [all workers are running
??L] (24 bytes)

 37   6156                      write:entry write [Starting debugger agent.
] (25 bytes)

  0   6156                      write:entry write [Debugger listening on 127.0.0.1:12346] (37 bytes)

  0   6156                      write:entry write [
] (1 bytes)

 29   6156                      write:entry write [Starting debugger agent.
] (25 bytes)

  6   6156                      write:entry write [Starting debugger agent.
] (25 bytes)

  8   6156                      write:entry write [Debugger listening on 127.0.0.1:12348] (37 bytes)

  8   6156                      write:entry write [
] (1 bytes)

 31   6156                      write:entry write [Debugger listening on 127.0.0.1:12347] (37 bytes)

 31   6156                      write:entry write [
] (1 bytes)


[root@0c933410-7c9b-4bea-96b6-2db35946bc95 ~/node]#

Because this test creates several concurrent processes running the debug agent, these write calls can end up being interleaved, and the output does not correspond to what the test expects.

I'm not sure this is a bug in the way SmartOS handles unbuffered I/O, because passing _IONBF to setvbuf means setting a stream to the "unbuffered" mode, but I don't think this means that outputting a buffer to a stream should be atomic (as in, should result in one write system call).

So I would lean towards thinking that the test should be rewritten to accept output that is not properly synchronized between the processes it creates, or have these processes synchronize their output.

misterdjules pushed a commit to misterdjules/node-1 that referenced this issue Sep 16, 2016
Do not assume any order and buffering/atomicity of output from child
processes' debugger agents.

Fixes nodejs#3796.
@misterdjules
Copy link

See #8568 for a PR that makes test-debug-signal-cluster.js not flaky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

9 participants