-
Notifications
You must be signed in to change notification settings - Fork 29.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel/test-child-process-exec-timeout not cleaning up child process on win7 #11052
Comments
cc/ @nodejs/platform-windows |
Looks like the test was added in #9208. Based on that also cc/ @santigimeno |
@BethGriggs does the test pass, even though the child processes are left behind? |
@cjihrig Yes it does pass, you can run |
@gibfahn when I was running I was able to recreate the issue running |
The last parallel run I did ended up with three tests hanging - test-child-process-flush-stdio, test-child-process-exec-stdout-stderr-data-string, and test-child-process-exit-code. These tests all hung until I manually killed a few of the test-child-process-exec-timeout child processes. There seems to be two issues here:
To reproduce:
|
It's too bad exec doesn't allow an ipc channel to be attached to the child. It is possible that the test can be rewritten so that the child either replaces or augments its internal timeout with a tcp connection to its parent process, and a handler that self-exits when that connection cannot be made, or is lost - this will limit the lifetime of the child to just a bit more than the parent - right now it looks like it won't terminate if its parent terminates. |
Is it possible to identify which of the three child processes is the problematic one (the test could be modified to pass an additional unique argument to each child)? Shouldn't the child processes be exiting after their |
This test is pretty subtle. For one thing, node is not the "child" process that is getting killed, the shell is. That means that node never gets signalled, it always times out, and attempts to write its output. I couldn't figure out how it ever passed, even on Linux, so I read child_process. It turns out that the timeout implementation has an undocumented behaviour: before it signals its child (the shell) it destroys the stdout/stderr pipes from the shell. Its because of this pipe destruction that the output is not seen, not because the the child is killed before the timeout. As I understand it, bash on OS X has a unique behaviour where it signals its pgrp on exit causing its child (the tests grandchild) to die, so that's a bit different, but also works. This reason the test works is so non-obvious that even though it does (on unix), I don't think it works as the author intended. As for Windows, the description above is confusing. What it sounds like is happening from talking to Jeremiah on my team (EDIT: @Akpotohwo) is that the test never completes, and is eventually killed by the test runner, so the real problem is the test never completing - not that children leak (thats a follow-on effect of the testrunner killing the top-level node process). However, he also says that only the first exec with timeout using SIGTERM is leaking. That is odd, because on Windows SIGTERM and SIGKILL are handled identically (see uv__kill in deps/uv/src/windows), though after the child terminates UV lies and claims it was killed with a signal (it wasn't, Windows has no signals, but the lie makes node more uniform on Windows and Linux). I don't know why this is, but I'm wondering if there is something about the destruction of the stdout/err pipes that is confusing the child (remember, the child is |
This is what we observed (on Linux) when looking into test-child-process-exec-kill-throws flakiness. So IIUIC killing a process launched with |
There is a big note: https://nodejs.org/api/child_process.html#child_process_child_process_execfile_file_args_options_callback (EDIT: the note is above that link), but it maybe doesn't go far enough in describing the implications of this. I've seen people comlain that |
And that the child pipes are destroyed should be documented as well. |
Can someone confirm if this is still an issue on Windows? I'm guessing it is, but the test has had a few changes since the problem was reporting, so ¯\(ツ)/¯ |
@Trott, I'll see if I can reproduce with the latest test changes |
Processes are still being leftover with the latest test changes. 10 runs of parallel/test-child-process-exec-timeout results in about 6-7 processes left. I tested with master, 8.x, and 6.x. However, I have not been able to reproduce the second part of the issue where the parallel test suite hangs until I kill some of the remaining processes. |
I hate to ask every six months and just hope that the problem goes away, but.... Is this still an issue? |
yes, I saw this yesterday in a local run. And I believe this could potentially cascading to other failures as well, for example |
FWIW, I'm able to emulate the leaving-a-process-lying-around on macOS if I run enough of these in parallel. |
Thanks @Trott, your simple test case was wonderful, and it brought the hang consistently in macos in couple of iterations, when ran 64 instances altogether. #ps
PID TTY TIME CMD
1707 ttys000 0:00.66 -bash
39634 ttys000 0:00.14 node 20414.js
39701 ttys000 0:01.23 /usr/local/bin/node /Users/gireeshpunathil/Desktop/support/issues/20414.js chil
1714 ttys001 0:00.01 -bash
8182 ttys002 0:00.03 -bash
38346 ttys003 0:00.06 -bash lldb attach showed something surprising: the child indeed succumbed to SIGSEGV, but it was not terminating the process, instead causing high cpu - probably the crash handler and the tracing agent are running in a tight loop or something like that. I have no clues about that, don't know whether #lldb
(lldb) attach 39701
Process 39701 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x440)
frame #0: 0x00000001000a7c2d node`node::tracing::Agent::Stop() + 9
node`node::tracing::Agent::Stop:
-> 0x1000a7c2d <+9>: cmpb $0x0, 0x440(%rbx)
0x1000a7c34 <+16>: je 0x1000a7c69 ; <+69>
0x1000a7c36 <+18>: movq 0x460(%rbx), %rdi
0x1000a7c3d <+25>: callq 0x100abf040 ; v8::platform::tracing::TracingController::StopTracing()
Target 0: (node) stopped.
Executable module set to "/usr/local/bin/node".
Architecture set to: x86_64h-apple-macosx.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x440)
* frame #0: 0x00000001000a7c2d node`node::tracing::Agent::Stop() + 9
frame #1: 0x0000000100030beb node`node::SignalExit(int) + 25
frame #2: 0x00007fff5d7faf5a libsystem_platform.dylib`_sigtramp + 26
frame #3: 0x00007fff5d7ff3e9 libsystem_platform.dylib`_platform_memmove + 1
frame #4: 0x00007fff5b5ba272 libc++.1.dylib`std::__1::basic_streambuf<char, std::__1::char_traits<char> >::xsputn(char const*, long) + 76
frame #5: 0x00007fff5b5cc2ec libc++.1.dylib`std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> > std::__1::__pad_and_output<char, std::__1::char_traits<char> >(std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> >, char const*, char const*, char const*, std::__1::ios_base&, char) + 294
frame #6: 0x00007fff5b5cc5c9 libc++.1.dylib`std::__1::num_put<char, std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> > >::do_put(std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> >, std::__1::ios_base&, char, unsigned long) const + 305
frame #7: 0x00007fff5b5c03c6 libc++.1.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char> >::operator<<(unsigned long) + 240
frame #8: 0x000000010050ce17 node`v8::internal::ComputeFlagListHash() + 295
frame #9: 0x0000000100032afb node`node::Init(int*, char const**, int*, char const***) + 848
frame #10: 0x0000000100033294 node`node::Start(int, char**) + 96
frame #11: 0x0000000100001034 node`start + 52
(lldb) disassemble -f
node`node::tracing::Agent::Stop:
0x1000a7c24 <+0>: pushq %rbp
0x1000a7c25 <+1>: movq %rsp, %rbp
0x1000a7c28 <+4>: pushq %rbx
0x1000a7c29 <+5>: pushq %rax
0x1000a7c2a <+6>: movq %rdi, %rbx
-> 0x1000a7c2d <+9>: cmpb $0x0, 0x440(%rbx)
(lldb) reg read rbx
rbx = 0x0000000000000000
(lldb) This corresponds to this source line The crash happened when trying to access the member field
If I manually kill the child, the parent gets into action with the usual failure message. I guess the python parent in the actual test case also does the same.
|
edit: source line is this: Line 40 in 6af26b1
|
the Mac spin dump showed that these frames were always there in the 1000 samples taken over 10 seconds! I am trying to make some sense out of it. 1000 start + 52 (node + 4148) [0x100001034]
1000 node::Start(int, char**) + 96 (node + 209556) [0x100033294]
1000 node::Init(int*, char const**, int*, char const***) + 37 (node + 206800) [0x1000327d0]
1000 uv_default_loop + 33 (node + 9613071) [0x10092af0f]
1000 uv_loop_init + 361 (node + 9633437) [0x10092fe9d]
1000 uv_once + 9 (node + 9659457) [0x100936441]
1000 pthread_once + 56 (libsystem_pthread.dylib + 14681) [0x7fff5d804959]
1000 _os_once + 33 (libsystem_platform.dylib + 3840) [0x7fff5d7f9f00]
1000 __pthread_once_handler + 66 (libsystem_pthread.dylib + 14782) [0x7fff5d8049be]
1000 uv__signal_global_init + 47 (node + 9641262) [0x100931d2e]
1000 ??? (node + 22593584) [0x10158c030]
1000 dyld_stub_binder + 282 (libdyld.dylib + 5058) [0x7fff5d57a3c2]
1000 dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 89 (dyld + 17752) [0x1113e2558]
1000 ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, ImageLoader::LinkContext const&, void (*)(), void (*)()) + 250 (dyld + 121610) [0x1113fbb0a]
1000 ImageLoaderMachOCompressed::resolve(ImageLoader::LinkContext const&, char const*, unsigned char, long, ImageLoader const**, ImageLoaderMachOCompressed::LastLookup*, bool) + 308 (dyld + 115412) [0x1113fa2d4]
1000 ImageLoaderMachOCompressed::resolveTwolevel(ImageLoader::LinkContext const&, char const*, ImageLoader const*, ImageLoader const*, unsigned int, bool, bool, ImageLoader const**) + 86 (dyld + 114774) [0x1113fa056]
1000 ImageLoader::findExportedSymbolAddress(ImageLoader::LinkContext const&, char const*, ImageLoader const*, int, bool, ImageLoader const**, unsigned long*) const + 50 (dyld + 67152) [0x1113ee650]
1000 ImageLoader::weakBind(ImageLoader::LinkContext const&) + 1204 (dyld + 70078) [0x1113ef1be]
1000 _sigtramp + 26 (libsystem_platform.dylib + 8026) [0x7fff5d7faf5a]
1000 node::SignalExit(int) + 25 (node + 199659) [0x100030beb]
1000 node::tracing::Agent::Stop() + 9 (node + 687149) [0x1000a7c2d] |
@gireeshpunathil That's #14802. |
This is awesome, thanks @bnoordhuis! with that info, it is now pretty easy to reproduce without a child_process in hand. Insert this patch and just run node. --- a/deps/uv/src/unix/signal.c
+++ b/deps/uv/src/unix/signal.c
@@ -73,6 +73,7 @@ static void uv__signal_global_init(void) {
*/
if (pthread_atfork(NULL, NULL, &uv__signal_global_reinit))
abort();
+ raise(SIGTERM); this loops between the handler and the Agent:Stop() method with high CPU. #lldb ./node
(lldb) target create "./node"
Current executable set to './node' (x86_64).
(lldb) r
Process 46709 launched: './node' (x86_64)
Process 46709 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGTERM
frame #0: 0x00007fff5d6c9fce libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
-> 0x7fff5d6c9fce <+10>: jae 0x7fff5d6c9fd8 ; <+20>
0x7fff5d6c9fd0 <+12>: movq %rax, %rdi
0x7fff5d6c9fd3 <+15>: jmp 0x7fff5d6c176c ; cerror_nocancel
0x7fff5d6c9fd8 <+20>: retq
Target 0: (node) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGTERM
* frame #0: 0x00007fff5d6c9fce libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff5d807150 libsystem_pthread.dylib`pthread_kill + 333
frame #2: 0x00007fff5d5d88fe libsystem_c.dylib`raise + 26
frame #3: 0x0000000100932568 node`uv__signal_global_init at signal.c:76 [opt]
frame #4: 0x00007fff5d8049be libsystem_pthread.dylib`__pthread_once_handler + 66
frame #5: 0x00007fff5d7f9f00 libsystem_platform.dylib`_os_once + 33
frame #6: 0x00007fff5d804959 libsystem_pthread.dylib`pthread_once + 56
frame #7: 0x0000000100936c2d node`uv_once(guard=<unavailable>, callback=<unavailable>) at thread.c:363 [opt]
frame #8: 0x00000001009306fe node`uv_loop_init(loop=0x000000010162de40) at loop.c:70 [opt]
frame #9: 0x000000010092ba13 node`uv_default_loop at uv-common.c:599 [opt]
frame #10: 0x0000000100030f65 node`node::Init(argc=0x00007ffeefbff99c, argv=0x0000000102500370, exec_argc=0x00007ffeefbff998, exec_argv=0x00007ffeefbff990) at node.cc:4245 [opt]
frame #11: 0x00000001000319e7 node`node::Start(argc=1, argv=0x0000000102500370) at node.cc:4625 [opt]
frame #12: 0x0000000100001034 node`start + 52
(lldb) c
Process 46709 resuming
node was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 46709 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x440)
frame #0: 0x00000001000a1865 node`node::tracing::Agent::Stop(this=0x0000000000000000) at agent.cc:40 [opt]
37 }
38
39 void Agent::Stop() {
-> 40 if (!started_)
41 return;
42 // Perform final Flush on TraceBuffer. We don't want the tracing controller
43 // to flush the buffer again on destruction of the V8::Platform.
Target 0: (node) stopped.
(lldb) f 1
frame #1: 0x000000010002f345 node`node::SignalExit(int) [inlined] node::$_0::StopTracingAgent() at node.cc:326 [opt]
323 }
324
325 void StopTracingAgent() {
-> 326 tracing_agent_->Stop();
327 }
328
329 tracing::Agent* GetTracingAgent() const {
(lldb) p tracing_agent_
error: warning: Stopped in a C++ method, but 'this' isn't available; pretending we are in a generic context
error: use of undeclared identifier 'tracing_agent_' Looks like the agent is absolutely unavailable at this such an early bootup time? |
this is fixed through #20637 . |
The test parallel/test-child-process-exec-timeout is not always cleaning up its' child processes. When I run the test 10 times (tools/test.py parallel/test-process-child-exec-timeout), I get around 4-5 left over child processes.
Occasionally when the processes are left lying around, the parallel test suite 'hangs' on the 1157th test in the parallel test suite. The test does not timeout with the test runner in this instance - it will hang permanently and the test suite will never complete.
/cc @cjihrig @gibfahn
The text was updated successfully, but these errors were encountered: