Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test-async-wrap-uncaughtexception #16210

Closed
joyeecheung opened this issue Oct 15, 2017 · 32 comments
Closed

Flaky test-async-wrap-uncaughtexception #16210

joyeecheung opened this issue Oct 15, 2017 · 32 comments
Labels
async_hooks Issues and PRs related to the async hooks subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests.

Comments

@joyeecheung
Copy link
Member

  • Version: master
  • Platform: Windows
  • Subsystem: test

I have seen multiple errors from this one on Windows this week. sample

not ok 35 parallel/test-async-wrap-uncaughtexception
  ---
  duration_ms: 0.337
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (c:\workspace\node-test-binary-windows\COMPILED_BY\vs2015\RUNNER\win2008r2-vs2017\RUN_SUBSET\0\test\common\index.js:486:10)
        at Object.<anonymous> (c:\workspace\node-test-binary-windows\COMPILED_BY\vs2015\RUNNER\win2008r2-vs2017\RUN_SUBSET\0\test\parallel\test-async-wrap-uncaughtexception.js:14:33)
        at Module._compile (module.js:599:30)
        at Object.Module._extensions..js (module.js:610:10)
        at Module.load (module.js:518:32)
        at tryModuleLoad (module.js:481:12)
        at Function.Module._load (module.js:473:3)
        at Function.Module.runMain (module.js:640:10)
        at startup (bootstrap_node.js:187:16)
@joyeecheung joyeecheung added flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. labels Oct 15, 2017
@joyeecheung
Copy link
Member Author

joyeecheung commented Oct 15, 2017

cc @nodejs/async_hooks @nodejs/platform-windows

@Trott
Copy link
Member

Trott commented Oct 15, 2017

This is happening pretty much constantly as of about 12 hours ago. Yikes! I don't see anything obvious that has landed that would cause this.

@nodejs/testing @nodejs/build

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Oct 15, 2017

I've had this issue in #15538 for a long time (6 days). Where it also appears to be unrelated to my changes.

edit: fixed wrong link.

@AndreasMadsen AndreasMadsen added async_wrap async_hooks Issues and PRs related to the async hooks subsystem. labels Oct 15, 2017
@joaocgreis
Copy link
Member

@addaleax
Copy link
Member

Investigating one of the affected CI machines, this does never happen when running node directly (cmd or git bash) but always happens when using test.py

@joaocgreis Taking a guess based on that: This might be because test.py sets up stdio differently than a plain run of Node from the command line, so the stdio objects in Node use different implementations – do you think you could try to verify/disprove that?

@joaocgreis
Copy link
Member

joaocgreis commented Oct 16, 2017

@addaleax verified. Redirecting both stdout and stderr to a file makes this happen in cmd. Thanks!

How to check stderr type for future reference
Release\node.exe -e "console.log(util.inspect(process.stderr))"
Release\node.exe -e "console.log(util.inspect(process.stderr))" 2>&1 | cat
Release\node.exe -e "console.log(util.inspect(process.stderr))" > file 2>&1 & type file

Some more notes:

  • The directory where the test is run makes a difference. In the CI machine where I'm running tests, running in c:\workspace\node-test-binary-windows\COMPILED_BY\vs2015\RUNNER\win2012r2\RUN_SUBSET\0 makes this happen every time, but in any parent or subdir it does never happen. Renaming the directories does not seem to matter, apparently only the number of directories does

  • I can't make this reproduce in a local VM with the same OS and the same binaries generated by CI that fail in the CI worker. Even in the same directory

  • Current working directory does not seem to influence, running with full path fails in any directory

  • Moving the whole test (or just the last group of lines) to a setImmediate makes the test pass

So, the error is that beforeExit is running twice in

process.on('beforeExit', common.mustCall(() => {
process.removeAllListeners('uncaughtException');
hooks.disable();
assert.strictEqual(typeof call_id, 'number');
assert.deepStrictEqual(call_log, [1, 1, 1, 1]);
}));

Can anything there be scheduling more async work?

@joaocgreis
Copy link
Member

I've changed the workspace directory of node-test-binary-windows (simplified it to c:\workspace\node-test-binary-windows\) and haven't seen this happen since in CI, but this is still an issue.

@refack
Copy link
Contributor

refack commented Oct 29, 2017

@joyeecheung
Copy link
Member Author

@joyeecheung
Copy link
Member Author

Also I vaguely remember that I have seen this on my macbook, but it does not reproduce much.

@Trott
Copy link
Member

Trott commented Oct 30, 2017

Happened twice in a row on Windows 10 in CI:

https://ci.nodejs.org/job/node-test-binary-windows/12364/COMPILED_BY=vcbt2015,RUNNER=win10,RUN_SUBSET=3/console
https://ci.nodejs.org/job/node-test-binary-windows/12365/COMPILED_BY=vcbt2015,RUNNER=win10,RUN_SUBSET=3/console:

not ok 36 parallel/test-async-wrap-uncaughtexception
  ---
  duration_ms: 0.188
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (c:\workspace\node-test-binary-windows\test\common\index.js:501:10)
        at Object.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-async-wrap-uncaughtexception.js:14:33)
        at Module._compile (module.js:617:30)
        at Object.Module._extensions..js (module.js:628:10)
        at Module.load (module.js:536:32)
        at tryModuleLoad (module.js:479:12)
        at Function.Module._load (module.js:471:3)
        at Function.Module.runMain (module.js:658:10)
        at startup (bootstrap_node.js:191:16)

@Trott
Copy link
Member

Trott commented Oct 30, 2017

Can anything there be scheduling more async work?

process.removeAllListeners() emits removeListener events, so that may be scheduling more work, maybe?

If we're pretty sure the being-emitted-twice thing is not actually a bug but dependent on the state of the machine, easiest fix might be to use process.once() instead. I'll submit a PR now-ish...

Trott added a commit to Trott/io.js that referenced this issue Oct 30, 2017
It's possible for `beforeExit` to be emitted more than once if work is
scheduled by its listeners. Accommodate this fact in the test.

Fixes: nodejs#16210
@Trott
Copy link
Member

Trott commented Oct 30, 2017

Fix (maybe?) in #16598

@refack
Copy link
Contributor

refack commented Oct 30, 2017

Seems like we now have a repro - https://ci.nodejs.org/job/node-stress-single-test/1475/
(Ubuntu1604x64).
All the above clues makes me thing this is a memory corruption bug...

@joyeecheung
Copy link
Member Author

@Trott that commit landed 6 days ago while this issue was opened 18 days ago...I think there could be more to it, might need to audit some crypto error PRs

@Trott
Copy link
Member

Trott commented Nov 2, 2017

@joyeecheung Urm...hmmm... Maybe that commit caused the failures to happen on ubuntu 1604 which only started failing more recently? Maybe the bug has deeper roots for Windows?

I did the stress testing on Ubuntu so it would go faster, but I guess I may need to take a deep breath, set aside several hours, and do it on Windows....

@joyeecheung
Copy link
Member Author

@Trott I looked into e567402 again and AFAICT it does not seem to touch the code path tested in test-async-wrap-uncaughtexception. Maybe we could switch to another async hooks provider in the test to see if the root is in crypto?

@Trott
Copy link
Member

Trott commented Nov 2, 2017

Hmmm....after 12 hours of nothing but red (presumably due to this issue), Windows is coming back green/yellow/red now...

@Trott
Copy link
Member

Trott commented Nov 2, 2017

Argh, nope, still a thing, now showing up on fedora24:

https://ci.nodejs.org/job/node-test-commit-linux/13748/nodes=fedora24/console

  ...
not ok 136 parallel/test-async-wrap-uncaughtexception
  ---
  duration_ms: 0.138
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/fedora24/test/common/index.js:490:10)
        at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/nodes/fedora24/test/parallel/test-async-wrap-uncaughtexception.js:14:33)
        at Module._compile (module.js:641:30)
        at Object.Module._extensions..js (module.js:652:10)
        at Module.load (module.js:560:32)
        at tryModuleLoad (module.js:503:12)
        at Function.Module._load (module.js:495:3)
        at Function.Module.runMain (module.js:682:10)
        at startup (bootstrap_node.js:191:16)

@joyeecheung
Copy link
Member Author

joyeecheung commented Nov 2, 2017

BTW I have seen this a few times this week on my macbook, but have never been able to reproduce reliably. It only shows up when I run the tests in parallel.

@Trott
Copy link
Member

Trott commented Nov 2, 2017

I'm worried about this one at Code + Learn next week because I can totally imagine a third of the room having the test fail locally or something like that....

@jasnell
Copy link
Member

jasnell commented Nov 2, 2017

FYI: #16692

@refack
Copy link
Contributor

refack commented Nov 2, 2017

I'm worried about this one at Code + Learn next week because I can totally imagine a third of the room having the test fail locally or something like that....

Alternative: Marking as flaky #16694 PR

@jasnell RE: #16692 we see either a regression between 8.6.0 and 8.7.0 or some level of non-determinism, so IMHO we shouldn't force the test to pass and so hide the bug

@refack refack mentioned this issue Nov 3, 2017
3 tasks
refack added a commit to refack/node that referenced this issue Nov 3, 2017
`parallel/test-async-wrap-uncaughtexception.js` has become flaky.
At this time investigating the cause is still on going, but this issue
become has prevalent. In order to restore CI status to be relevant,
this marks the test as explicitly FLAKY.

PR-URL: nodejs#16694
Refs: nodejs#16210
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Myles Borins <[email protected]>
@MylesBorins
Copy link
Contributor

I created a new project to track flakes on CI and put a variety of related issues to this test in a column. Please feel free to add more related issues / prs

@Trott
Copy link
Member

Trott commented Nov 4, 2017

FWIW, moving this test to sequential does not solve the problem (tried it in #16733) so having too much other stuff happening on the system is unlikely to be the problem.

@Trott Trott changed the title Flaky test-async-wrap-uncaughtexception on Windows Flaky test-async-wrap-uncaughtexception Nov 4, 2017
@Trott
Copy link
Member

Trott commented Nov 4, 2017

I can replicate this if I use the inspector and step into or out of stuff.

@Trott
Copy link
Member

Trott commented Nov 4, 2017

I'm starting to suspect that the extra beforeExit call might be caused either by the changes starting in the block starting on line 4708 of src/node.cc in f27b5e4bdaa or (more likely, perhaps?) the V8 changes that commit is trying to accommodate. Haven't been able to pin it down yet, but if a C++ person wants to do some debugging, please do. :-D (And, of course, I could be totally wrong in my suspicion, just like I've been a multiple times already with this issue.)

@addaleax
Copy link
Member

addaleax commented Nov 4, 2017

@Trott The only V8 functionality I know of that will require the change is asynchronous WASM compilation, which is currently behind a flag even in V8 master.

It might be related to f27b5e4, but I’d be surprised since all I can see V8 doing when the test is being executed is scheduling background tasks, no foreground or delayed foreground tasks…

addaleax added a commit to addaleax/node that referenced this issue Nov 4, 2017
cjihrig pushed a commit to cjihrig/node that referenced this issue Nov 6, 2017
`parallel/test-async-wrap-uncaughtexception.js` has become flaky.
At this time investigating the cause is still on going, but this issue
become has prevalent. In order to restore CI status to be relevant,
this marks the test as explicitly FLAKY.

PR-URL: nodejs#16694
Refs: nodejs#16210
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Myles Borins <[email protected]>
addaleax added a commit to addaleax/node that referenced this issue Nov 11, 2017
Do not let an internal handle keep the event loop alive.

Fixes: nodejs#16210
@joyeecheung
Copy link
Member Author

This has been fixed and can be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
async_hooks Issues and PRs related to the async hooks subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests.
Projects
None yet
8 participants