Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: exit sequence sanity tests [2] #25085

Merged
merged 1 commit into from
Feb 8, 2019

Conversation

gireeshpunathil
Copy link
Member

Execute JS code in worker through same vm context
while exiting from the main thread at arbitrary
execution points, and make sure that the workers
quiesce without crashing.

worker_threads are not necessarily the subject of
testing, those are used for easy simulation of
multi-thread scenarios.

Refs: #25007

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Dec 17, 2018
@gireeshpunathil
Copy link
Member Author

It captured a new error - good from test's perspective, but makes it a do-not-land-yet candidate

04:56:20 not ok 2057 parallel/test-worker-cleanexit-with-js
04:56:20   ---
04:56:20   duration_ms: 0.366
04:56:20   severity: fail
04:56:20   exitcode: 1
04:56:20   stack: |-
04:56:20     assert.js:86
04:56:20       throw new AssertionError(obj);
04:56:20       ^
04:56:20     
04:56:20     AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
04:56:20     + actual - expected
04:56:20     
04:56:20     + "undefined/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/out/Release/node[45062]: ../src/node_native_module.cc:102:v8::Local<v8::String> node::native_module::NativeModuleLoader::GetSource(v8::Isolate*, const char*) const: Assertion `(it) != (source_.end())' failed.\n"
04:56:20     - undefined
04:56:20         at ChildProcess.child.on.common.mustCall (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/parallel/test-worker-cleanexit-with-js.js:44:12)
04:56:20         at ChildProcess.<anonymous> (/data/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel72-s390x/test/common/index.js:335:15)
04:56:20         at ChildProcess.emit (events.js:189:13)
04:56:20         at Process.ChildProcess._handle.onexit (internal/child_process.js:254:12)
04:56:20   ...

source line:

CHECK_NE(it, source_.end());

native stack:

(dbx) where
pthread_kill(??, ??) at 0x90000000051beb4
_p_raise(??) at 0x90000000051b6e8
raise.raise(??) at 0x90000000002bd4c
abort() at 0x9000000000827c4
_ZN4node5AbortEv() at 0x1001f2314
_ZN4node6AssertEPA4_KPKc(??) at 0x1001f23d0
_ZNK4node13native_module18NativeModuleLoader9GetSourceEPN2v87IsolateEPKc(??, ??, ??) at 0x100875330
_ZN4node13native_module18NativeModuleLoader16LookupAndCompileEN2v85LocalINS2_7ContextEEEPKcPSt6vectorINS3_INS2_6StringEEESaISA_EENS1_21CompilationResultTypeEPNS_11EnvironmentE(??, ??, ??, ??, ??, ??) at 0x100875e98
_ZN4node13native_module18NativeModuleLoader15CompileAsModuleEPNS_11EnvironmentEPKcNS1_21CompilationResultTypeE(??, ??, ??) at 0x1008768f4
_ZN4node13native_module18NativeModuleLoader15CompileFunctionERKN2v820FunctionCallbackInfoINS2_5ValueEEE(??) at 0x100876c28
builtins-api._ZN2v88internal12_GLOBAL__N_119HandleApiCallHelperILb0EEENS0_11MaybeHandleINS0_6ObjectEEEPNS0_7IsolateENS0_6HandleINS0_10HeapObjectEEESA_NS8_INS0_20FunctionTemplateInfoEEENS8_IS4_EENS0_16BuiltinArgumentsE(??, ??, ??, ??, ??, ??, ??) at 0x100048bf0

@gireeshpunathil gireeshpunathil added the wip Issues and PRs that are still a work in progress. label Dec 17, 2018
@gireeshpunathil gireeshpunathil removed the wip Issues and PRs that are still a work in progress. label Jan 29, 2019
@gireeshpunathil
Copy link
Member Author

this can easily be accommodated in #25083, but needs to stay separate as:

  • While it shares a common objective and structure, the operating space for the workers in this test is more of JS than C++, and have proven to capture new errors in relation to thread interactions.
  • these tests are sensitive to time, keeping separately makes sure no interference
  • debugging is easy if kept separately.

CI: https://ci.nodejs.org/job/node-test-pull-request/20407/


// Allow workers to go live.
setTimeout(() => {
// process.exit(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the commented out line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Trott - it was a mistake to comment that out! The abrupt exit is required for the thread interactions to come in to play. I opened it up now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so no wonder why CI did not report a single related failure. Realistically I was expecting one or two that this test can throw, based on its original run that threw up a lot!

@gireeshpunathil
Copy link
Member Author

@gireeshpunathil
Copy link
Member Author

Resume Build CI: https://ci.nodejs.org/job/node-test-commit/25439/

test/parallel/test-worker-cleanexit-with-js.js Outdated Show resolved Hide resolved
test/parallel/test-worker-cleanexit-with-js.js Outdated Show resolved Hide resolved
@gireeshpunathil
Copy link
Member Author

@gireeshpunathil gireeshpunathil added the blocked PRs that are blocked by other issues or PRs. label Feb 5, 2019
@gireeshpunathil
Copy link
Member Author

The test fails in AIX and Windows.
As it is evident, the test do not carry any issue - both are native crashes.
So I added a blocked label as opposed to WIP one.

AIX

18:08:15 not ok 2397 parallel/test-worker-cleanexit-with-js
18:08:15   ---
18:08:15   duration_ms: 120.217
18:08:15   severity: fail
18:08:15   exitcode: -15
18:08:15   stack: |-
18:08:15     timeout
18:08:15   ...

windows

18:52:34 not ok 534 parallel/test-worker-cleanexit-with-js
18:52:34   ---
18:52:34   duration_ms: 37.447
18:52:34   severity: fail
18:52:34   exitcode: -529697949
18:52:34   stack: |-

@gireeshpunathil
Copy link
Member Author

sorry, it was timeout on AIX!

@gireeshpunathil
Copy link
Member Author

Interesting story from AIX: In a system where the no. of CPUs are less than 10 (the number of workers in this test), the already spawned workers spin their full allotted time in the JS loop to cause OOM, before main thread gets into action - mostly it was de-scheduled while one or more workers ran.

So I want to revisit the while(true) loop - replace it with a setInterval . But before that will need to see what happens in Windows.

@addaleax
Copy link
Member

addaleax commented Feb 5, 2019

@gireeshpunathil Is that normal for AIX/by design, non-preemptive multithreading?

@gireeshpunathil
Copy link
Member Author

@addaleax - in terms of multi-tasking, there is nothing abnormal; AIX has a truly preemptive kernel.

The difference comes in way of deciding / allocating the current CPU post-fork: whether to continue to give it to the parent, or give child a chance and push parent to the next available cpu - only in the current slice of schedule cycle of the current thread of the current process. Since then all processes contend for CPU in a similar manner of Linux, driven by scheduler policies.

While no documented evidence exists, all my experiments has proven that:

  • Linux always runs the parent in the current CPU, consistently.
  • AIX 'mostly' gives it to the child. Reverse is also seen rarely, probably depends on factors internal to the scheduler.

cat tc.c

#include <unistd.h>
#include <stdio.h>
int main() {
  pid_t pid = fork();
  if (pid == 0) write(1, "child\n", 6);
  else write(1, "parent\n", 7);
}

linux:
parent
child

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f824a355a10) = 100129
strace: Process 100129 attached
[pid 100128] write(1, "parent\n", 7)    = 7
[pid 100129] write(1, "child\n", 6)     = 6

aix:
child
parent

15270132: 100728997: kfork()                            = 18153586
18153586:       kfork()         (returning as child ...)        = 0
18153586:       23462101: kwrite(1, " c h i l d\n", 6)          = 6
18153586:       23462101: kfcntl(1, F_GETFL, 0x20000E0C)        = 67110914
18153586:       23462101: kfcntl(2, F_GETFL, 0x2FF22FFC)        = 67108865
18153586:       23462101: _exit(0)
15270132: 100728997: kwrite(1, " p a r e n t\n", 7)     = 7

the same applies to pthread_create as well.

In applications of most other languages this does not imply anything: slight difference in terms of CPU allocation, and who gets ahead slightly. But given that node threads can run millions of instructions without any blocking calls coming in between, this can cause visible differences in our workloads - such as many timing related issues we usually see in child process related tests.

So in this case one or more workers ate up their remaining slice leading upto OOM before main thread gets a CPU.

So I would like to revert the while loop with setInterval if you are good with that.

@addaleax
Copy link
Member

addaleax commented Feb 6, 2019

@gireeshpunathil Sure, if you think so. It just seemed pretty odd to me that AIX would act like this/that this could would OOM.

@gireeshpunathil
Copy link
Member Author

Execute JS code in worker through same vm context
while exiting from the main thread at arbitrary
execution points, and make sure that the workers
quiesce without crashing.

`worker_threads` are not necessarily the subject of
testing, those are used for easy simulation of
multi-thread scenarios.

Refs: nodejs#25007

PR-URL: nodejs#25085
Reviewed-By: Anna Henningsen <[email protected]>
@gireeshpunathil
Copy link
Member Author

landed as 2ea2000

@gireeshpunathil gireeshpunathil removed the blocked PRs that are blocked by other issues or PRs. label Feb 8, 2019
addaleax pushed a commit that referenced this pull request Feb 8, 2019
Execute JS code in worker through same vm context
while exiting from the main thread at arbitrary
execution points, and make sure that the workers
quiesce without crashing.

`worker_threads` are not necessarily the subject of
testing, those are used for easy simulation of
multi-thread scenarios.

Refs: #25007

PR-URL: #25085
Reviewed-By: Anna Henningsen <[email protected]>
@targos targos mentioned this pull request Feb 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants