Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky sequential/test-child-process-execsync on AIX #24921

Closed
Trott opened this issue Dec 9, 2018 · 11 comments
Closed

Investigate flaky sequential/test-child-process-execsync on AIX #24921

Trott opened this issue Dec 9, 2018 · 11 comments
Labels
aix Issues and PRs related to the AIX platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@Trott
Copy link
Member

Trott commented Dec 9, 2018

https://ci.nodejs.org/job/node-test-commit-aix/19533/nodes=aix61-ppc64/console

Host: test-osuosl-aix61-ppc64_be-1

00:47:03 not ok 2377 sequential/test-child-process-execsync
00:47:03   ---
00:47:03   duration_ms: 3.701
00:47:03   severity: fail
00:47:03   exitcode: 1
00:47:03   stack: |-
00:47:03     /bin/sh: iamabadcommand:  not found.
00:47:03     assert.js:86
00:47:03       throw new AssertionError(obj);
00:47:03       ^
00:47:03     
00:47:03     AssertionError [ERR_ASSERTION]: Expected values to be strictly deep-equal:
00:47:03     + actual - expected
00:47:03     
00:47:03     + null
00:47:03     - 'SIGILL'
00:47:03         at spawnSyncKeys.forEach (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:138:14)
00:47:03         at Array.forEach (<anonymous>)
00:47:03         at Object.assert.throws (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:136:19)
00:47:03         at expectedException (assert.js:568:19)
00:47:03         at expectsError (assert.js:663:16)
00:47:03         at Function.throws (assert.js:694:3)
00:47:03         at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:127:10)
00:47:03         at Module._compile (internal/modules/cjs/loader.js:721:30)
00:47:03         at Object.Module._extensions..js (internal/modules/cjs/loader.js:732:10)
00:47:03         at Module.load (internal/modules/cjs/loader.js:620:32)
00:47:03   ...
@Trott Trott added aix Issues and PRs related to the AIX platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Dec 9, 2018
@gireeshpunathil
Copy link
Member

this was reported way back in 2015 through nodejs/node-v0.x-archive#9444 and was fixed through #1214 , don't know what has changed recently. I will have a look.

@gireeshpunathil
Copy link
Member

ok, spent some time on this. there are 2 puzzles to solve:

the stack message (`/bin/sh: iamabadcommand: not found.) indicates that the failure is in this section,

assert.throws(function() {
execSync('iamabadcommand');
}, /Command failed: iamabadcommand/);

where as the actual stack of the failure suggests it is at

assert.deepStrictEqual(err[key], spawnSyncResult[key]);

at which point the actual command that is run as a child is NOT iamabadcommand, instead node -e process.exit(1)

So this warrants a question about at what point the python parent captures the data pertinent to stack: |- as it do not seem to be coherent with the fail site.

  1. If we take AssertionError as shown in the second part to be bona-fide, then it indicates unexpected and improper exit from the child (SIGILL as opposed to null). Though I could not replicate this failure in a 1000 local runs, there is a contextual indication that this is in same as the exit-race with workers, as seen frequently through test-cli-syntax family of tests.

I will run more with ulimit -c unlimited to see if I get a core dump, which will mostly tell us a conclusive story.

@gireeshpunathil
Copy link
Member

I ran a 10K times locally, and no sign of failure. I don't think running more is worth.

So I conclude this has to do with the resource (CPU, memory, and other user limits) on CI.

@Trott - is it possible for you to get me ulimit -a output from the failing CI machine? I will attempt to mimic the same values in my dev box and see how it behaves. thanks!

@Trott
Copy link
Member Author

Trott commented Dec 12, 2018

@Trott - is it possible for you to get me ulimit -a output from the failing CI machine? I will attempt to mimic the same values in my dev box and see how it behaves. thanks!

ansible-playbook playbooks/write-ssh-config.yml fails for me so I no longer have access to any hosts. Hopefully someone on @nodejs/build can log in and run the command to get you results. (Alternatively, we can have a CI run it by, for example, creating a branch with a test that runs the command. But that's getting pretty convoluted....)

@Trott
Copy link
Member Author

Trott commented Dec 12, 2018

Woot! Fixed my ansible problem. So here you go:

$ ssh test-osuosl-aix61-ppc64_be-1 ulimit -a
time(seconds)        unlimited
file(blocks)         2097151
data(kbytes)         131072
stack(kbytes)        32768
memory(kbytes)       32768
coredump(blocks)     2097151
nofiles(descriptors) 2000
threads(per process) unlimited
processes(per user)  unlimited
$ 

@Trott
Copy link
Member Author

Trott commented Dec 14, 2018

https://ci.nodejs.org/job/node-test-commit-aix/19688/nodes=aix61-ppc64/consoleText

test-osuosl-aix61-ppc64_be-1

not ok 2383 sequential/test-child-process-execsync
  ---
  duration_ms: 2.473
  severity: fail
  exitcode: 1
  stack: |-
    /bin/sh: iamabadcommand:  not found.
    assert.js:86
      throw new AssertionError(obj);
      ^
    
    AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
    
    null !== 1
    
        at Object.assert.throws (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:134:12)
        at expectedException (assert.js:568:19)
        at expectsError (assert.js:663:16)
        at Function.throws (assert.js:694:3)
        at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:127:10)
        at Module._compile (internal/modules/cjs/loader.js:718:30)
        at Object.Module._extensions..js (internal/modules/cjs/loader.js:729:10)
        at Module.load (internal/modules/cjs/loader.js:617:32)
        at tryModuleLoad (internal/modules/cjs/loader.js:560:12)
        at Function.Module._load (internal/modules/cjs/loader.js:552:3)
  ...

@Trott
Copy link
Member Author

Trott commented Dec 14, 2018

And again. Probably time to mark this as flaky.

https://ci.nodejs.org/job/node-test-commit-aix/19692/nodes=aix61-ppc64/console

test-osuosl-aix61-ppc64_be-2

21:40:41 not ok 2383 sequential/test-child-process-execsync
21:40:41   ---
21:40:41   duration_ms: 5.83
21:40:41   severity: fail
21:40:41   exitcode: 1
21:40:41   stack: |-
21:40:41     /bin/sh: iamabadcommand:  not found.
21:40:41     assert.js:86
21:40:41       throw new AssertionError(obj);
21:40:41       ^
21:40:41     
21:40:41     AssertionError [ERR_ASSERTION]: Expected values to be strictly deep-equal:
21:40:41     + actual - expected
21:40:41     
21:40:41     + null
21:40:41     - 'SIGILL'
21:40:41         at spawnSyncKeys.forEach (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:138:14)
21:40:41         at Array.forEach (<anonymous>)
21:40:41         at Object.assert.throws (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:136:19)
21:40:41         at expectedException (assert.js:568:19)
21:40:41         at expectsError (assert.js:663:16)
21:40:41         at Function.throws (assert.js:694:3)
21:40:41         at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:127:10)
21:40:41         at Module._compile (internal/modules/cjs/loader.js:718:30)
21:40:41         at Object.Module._extensions..js (internal/modules/cjs/loader.js:729:10)
21:40:41         at Module.load (internal/modules/cjs/loader.js:617:32)
21:40:41   ...

Trott added a commit to Trott/io.js that referenced this issue Dec 14, 2018
Trott added a commit to Trott/io.js that referenced this issue Dec 14, 2018
Refs: nodejs#24921

PR-URL: nodejs#25031
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Bradley Farias <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
@gireeshpunathil
Copy link
Member

so the first core dump arrived, but before that; want to clear the confusion in my earlier comment
#24921 (comment) . the failure is in fact in line 134 only, the string /bin/sh: iamabadcommand: not found. is always printed in the console that is suppressed by python harness under successful runs, while printed when failed.

So in summary /bin/sh: iamabadcommand: not found. is a valid output, and not a failure site.

Now, on the failure:

(dbx) where
.() at 0x0
array-buffer-collector._ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv(??) at 0x1011d770c
_ZN2v88internal14CancelableTask3RunEv(??) at 0x100039304
node_platform._ZN4node12_GLOBAL__N_1L20PlatformWorkerThreadEPv(??) at 0x1001e6990
(dbx)

this is here,

void ArrayBufferCollector::FreeAllocations() {

and we have a wild branch - IAR 0.

(dbx) x
$r0:0x00000001011d7678 $stkp:0x0000000110365480 $toc:0x0000000000000000
$r3:0x000000011024b150 $r4:0x0000000101f82e60 $r5:0x0000000000e8a71d
$r6:0x00362b2feca85a5f $r7:0x000000000000007d $r8:0x000000000000007d
$r9:0x0000000000000000 $r10:0x0000000000000000 $r11:0x0000000000000000
$r12:0x09001000a0082b60 $r13:0x000000011036d800 $r14:0x0000000000000000
$r15:0x0000000000000000 $r16:0x0000000000000000 $r17:0x0000000000000000
$r18:0x0000000000000000 $r19:0x0000000000000000 $r20:0x0000000000000000
$r21:0x0000000000000000 $r22:0x0000000000000000 $r23:0x0000000000000000
$r24:0x0000000000000000 $r25:0x000000011024b9e0 $r26:0x0000000000000000
$r27:0x000000011024c770 $r28:0x00000001101fb460 $r29:0x0000000110406b30
$r30:0x0000000000000000 $r31:0x0000000110365480
$iar:0x0000000000000000 $msr:0xa00000000000d032 $cr:0x42000228
$link:0x00000001011d7710 $ctr:0x0000000000000000 $xer:0x40000000

Value in Link Register 0x1011d7710 holds the caller's return address. So the wild branch was made at 0x1011d770c (PPC is RISC, so every instructions are 4 bytes long.)

(dbx) (0x1011d76fc)/10i
0x1011d76fc (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x35c) e9490000          ld   r10,0x0(r9)
0x1011d7700 (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x360) e9690010          ld   r11,0x10(r9)
0x1011d7704 (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x364) 7d4903a6       mtctr   r10
0x1011d7708 (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x368) e8490008          ld   r2,0x8(r9)
0x1011d770c (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x36c) 4e800421       bctrl
(dbx) 

Without reading through ~500 instructions it is difficult to pin-point which construct in the source maps to this. However, wild branch typically indicates NULL function pointers, so in this case the lambda.

Is it another manifestation of the exit race with destructors, or a new one altogether? Will study more core files.

@gireeshpunathil
Copy link
Member

Unfortunately all subsequent core files show the same pattern at the same location. So the reason for this type of crash is a wild branch to a NULL lambda target.

To establish its relation with exit race, I changed exit call with underscored exit in Environment::Exit and this type of crash is avoided (in a 100 runs). The test still fails occasionally, but at a different point that also was occurring frequently prior to this change:

AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
assert(end < SLEEP)
at Object.<anonymous> (/home/iojs/gireesh/exitrace/node/test/sequential/test-child-process-execsync.js:59:3)

So with that, I want to establish that this is root caused by the same issue reported in #25007

BethGriggs pushed a commit that referenced this issue Dec 18, 2018
Refs: #24921

PR-URL: #25031
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Bradley Farias <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
@MylesBorins
Copy link
Contributor

Seeing similar looking failure on bsd https://ci.nodejs.org/job/node-test-commit-freebsd/23046/

pull bot pushed a commit to SimenB/node that referenced this issue Dec 31, 2018
* move `start` time to the point of execution (avoids counting 'throws'
  tests towards 'timeout' test case)
* scope cmd/ret values where possible
* use `filter` instead of manual if/return

PR-URL: nodejs#25227
Refs: nodejs#24921
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: James M Snell <[email protected]>
targos pushed a commit that referenced this issue Jan 1, 2019
* move `start` time to the point of execution (avoids counting 'throws'
  tests towards 'timeout' test case)
* scope cmd/ret values where possible
* use `filter` instead of manual if/return

PR-URL: #25227
Refs: #24921
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: James M Snell <[email protected]>
refack pushed a commit to refack/node that referenced this issue Jan 14, 2019
Refs: nodejs#24921

PR-URL: nodejs#25031
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Bradley Farias <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
refack pushed a commit to refack/node that referenced this issue Jan 14, 2019
* move `start` time to the point of execution (avoids counting 'throws'
  tests towards 'timeout' test case)
* scope cmd/ret values where possible
* use `filter` instead of manual if/return

PR-URL: nodejs#25227
Refs: nodejs#24921
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@gireeshpunathil
Copy link
Member

the test is active for the last one week with no failures, so as expected #25061 has fixed the underlying issue. Closing.

BethGriggs pushed a commit that referenced this issue Feb 12, 2019
Refs: #24921

PR-URL: #25031
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Bradley Farias <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
BethGriggs pushed a commit that referenced this issue Feb 20, 2019
Refs: #24921

PR-URL: #25031
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Bradley Farias <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
rvagg pushed a commit that referenced this issue Feb 28, 2019
Refs: #24921

PR-URL: #25031
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Bradley Farias <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
BethGriggs pushed a commit that referenced this issue Apr 28, 2019
* move `start` time to the point of execution (avoids counting 'throws'
  tests towards 'timeout' test case)
* scope cmd/ret values where possible
* use `filter` instead of manual if/return

PR-URL: #25227
Refs: #24921
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: James M Snell <[email protected]>
BethGriggs pushed a commit that referenced this issue May 10, 2019
* move `start` time to the point of execution (avoids counting 'throws'
  tests towards 'timeout' test case)
* scope cmd/ret values where possible
* use `filter` instead of manual if/return

PR-URL: #25227
Refs: #24921
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this issue May 16, 2019
* move `start` time to the point of execution (avoids counting 'throws'
  tests towards 'timeout' test case)
* scope cmd/ret values where possible
* use `filter` instead of manual if/return

PR-URL: #25227
Refs: #24921
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aix Issues and PRs related to the AIX platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
Development

No branches or pull requests

3 participants