Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-stdio-closed failing on some AIX environments #10234

Closed
sxa opened this issue Dec 12, 2016 · 19 comments
Closed

test-stdio-closed failing on some AIX environments #10234

sxa opened this issue Dec 12, 2016 · 19 comments
Labels
aix Issues and PRs related to the AIX platform. process Issues and PRs related to the process subsystem. test Issues and PRs related to the tests.

Comments

@sxa
Copy link
Member

sxa commented Dec 12, 2016

Reference: #8375 (FYI @gireeshpunathil @Trott @mhdawson)
Happening on current versions of node 4 and 6 (4.7.0/6.9.2) and on AIX 6.1TL09

The fix that was applied in e65a2d7 does not appear to make the test pass universally on all AIX systems. I have a mix of some that it works for, and others which are still returning "42" which is consistent with the other platforms, so is failing this assertion:

assert.strictEqual(exitCode, common.isAix ? 126 : 42);

On the system that it works on, if I set the LANG variable to anything other the en_US I get the "42" result (not certain yet why that is making a difference) although on one of my "failing" machines which is giving 42 setting it to en_US doesn't resolve it, despite it being in the output of locale -a. We could, potentially, be seeing a timing issue where going down the path of a non-default locale is changing things, but I'm guessing so far at this point.

The AIX level on the "failing" box (the one defaulting to en_GB which is always giving me 42) is patched to a slightly later level AIX level - .6100-09-08-1642 vs 6100-09-07-1614. On my 6100-07 system (defaults to en_US) it's always returning 42 as well.

I get the feeling we may need to be able to trap both cases for this test to pass reliably (or change the exit(126) to exit(42), or understand why we're getting the differences that caused the initial patch to be required.

@italoacasas italoacasas added process Issues and PRs related to the process subsystem. test Issues and PRs related to the tests. labels Dec 12, 2016
@Trott
Copy link
Member

Trott commented Dec 12, 2016

It would be great to track down exactly why it sometimes works one way on AIX and sometimes works another way. But seeing as both behaviors are spec-compliant (I think), I'd be OK with replacing:

assert.strictEqual(exitCode, common.isAix ? 126 : 42);

...with something like:

if (common.isAix)
  assert([42, 126].includes(exitCode));
else
  assert.strictEqual(exitCode, 42);

I agree, though, that tracking down the source of the behavior difference would be a superior solution...

@mscdex mscdex added the aix Issues and PRs related to the AIX platform. label Dec 12, 2016
@sxa
Copy link
Member Author

sxa commented Dec 12, 2016

Sounds reasonable, and it means we'll spot if the same issue shows up on another OS since the acceptance of 126 is still in an AIX-specific clause. I've checked it and it seems ok so I'll get that submitted, especially since I don't think Gireesh will be around for a while who was involved in the previous discussion on this topic.

sxa pushed a commit to sxa/node that referenced this issue Dec 12, 2016
Fixes: nodejs#10234

Allow either of the two possible return codes on AIX to
be considered successful on test-stdio-closed.js
@jBarz
Copy link
Contributor

jBarz commented Dec 15, 2016

node [args] 1>&-

is supposed to start node with fd=1 (stdout) closed and not open. (upon which node will reopen it to point to /dev/null)
However it is open which is why fstat returns a pass.
More weirdly, fd=1 is open on the following catalog file
/usr/lib/nls/msg/en_US/execerr.cat

Renaming the catalog file to something else produces the expected outcome (i.e. fd=1 is closed).
I think this is a bug in the shell (ksh) on aix since the documentation for ksh states that 1>&- must close fd=1.

@Trott
Copy link
Member

Trott commented Dec 15, 2016

So, on AIX, /bin/sh is ksh? If it's ksh-specific, might something like this work around the issue until there's a fix for the bug?

Instead of this:

const proc = spawn('/bin/sh', ['-c', cmd], { stdio: 'inherit' });

...maybe something like this?:

// Insert comment here explaining the bug so that the next line is not a mystery.
const shell = common.isAIX ? '/bin/bash' : '/bin/sh';
const proc = spawn(shell, ['-c', cmd], { stdio: 'inherit' });

@Trott
Copy link
Member

Trott commented Dec 15, 2016

Another possibility: If we identify the version of ksh that has this bug and when it is fixed, check the version of ksh before running the test and skip if we are using a buggy version?

@jBarz
Copy link
Contributor

jBarz commented Dec 15, 2016

Unfortunately, bash fails the same as ksh. However "csh" does the job.
We could use change it to csh.

Another possible fix (more comprehensive i think) is to do the following

If stdout is open and it is read-only (which is the case on aix/ksh/bash), then we assume that the user originally intended to close stdout. So, we

  • close it
  • reopen it on /dev/null

jBarz@25d758d
This produces consistent behavior on both aix machines that originally produced different behavior.
This also means that the test case would be the same for aix and other platforms.

@jBarz
Copy link
Contributor

jBarz commented Dec 15, 2016

I can submit a PR for the above suggestion if it seems valid to you guys.

@richardlau
Copy link
Member

On the system that it works on, if I set the LANG variable to anything other the en_US I get the "42" result (not certain yet why that is making a difference) although on one of my "failing" machines which is giving 42 setting it to en_US doesn't resolve it, despite it being in the output of locale -a. We could, potentially, be seeing a timing issue where going down the path of a non-default locale is changing things, but I'm guessing so far at this point.

Might be a good guess, based on:

More weirdly, fd=1 is open on the following catalog file
/usr/lib/nls/msg/en_US/execerr.cat

Renaming the catalog file to something else produces the expected outcome (i.e. fd=1 is closed).
I think this is a bug in the shell (ksh) on aix since the documentation for ksh states that 1>&- must close fd=1.

I'm wondering if the shell did close fd=1 but something has then opened the catalog file before node checks if fd=1 is closed. The contents of /usr/lib/nls/msg/en_US/execerr.cat on the AIX systems I have access to look to be error messages relating to execing a process (probably should have guessed based on the filename 😁 ) -- Perhaps the shell is actually reporting an error (which if it is might be tricky to get at since we've closed stdout and stderr)? Maybe running with truss could help?

@richardlau
Copy link
Member

Perhaps the shell is actually reporting an error (which if it is might be tricky to get at since we've closed stdout and stderr)?

Actually thinking a bit more we only close stdout and stderr for the node process -- The shell's stdout and stderr should be being piped back to the parent (the test) so we could probably capture it in the test (maybe assert the output is empty?).

@Trott
Copy link
Member

Trott commented Dec 16, 2016

@jBarz wrote:

Another possible fix (more comprehensive i think) is to do the following

If stdout is open and it is read-only (which is the case on aix/ksh/bash), then we assume that the user originally intended to close stdout. So, we

  • close it
  • reopen it on /dev/null

jBarz@25d758d

/cc @bnoordhuis

@sxa
Copy link
Member Author

sxa commented Dec 16, 2016

Sounds worth a shot @jBarz - I'd be happy with my PR or your one if you can confirm it will work :-)

@sam-github
Copy link
Contributor

That's not a good idea, having fd 1 or 1 being read-only is valid. The code under test isn't intended to guaranteed that fd 1 or 2 is writeable or that fd 0 is readable, its just there to guarantee that the fd is uses, so no other resource is opened on that descriptor.

The ksh bug wherein one of its message catalog files ends up being opened on fd 1 is actually a good example of the bug node is avoiding that ksh is not, wherein a child process (node in this case) accidentally receives a readable descriptor to a file it was not intended to have access to.

The solutions here are, IMO:

  1. use /bin/csh on AIX, it was reported, to not have the bug
  2. make the test assert that the fd is valid, instead of that its writeable, since that is the intention of the code under test

fs.fstatSync() of 0,1,2 can do the latter

@gibfahn
Copy link
Member

gibfahn commented Dec 16, 2016

@sam-github The latter sounds like the better fix.

@jBarz
Copy link
Contributor

jBarz commented Dec 16, 2016

I was incorrect about csh. Using "1>&- 2>&-" on csh simply redirects output to a file called "-" and the test passes because stdout and stderr are not closed :-). There doesn't seem to be a way to close fds with csh.
I have tested option 2 and it passes on both aix machines.
Here is the test case from option 2.
jBarz@dbd9df8

@richardlau
Copy link
Member

@sxa555 Gave me access to an AIX system that exhibits the behavior where the child process was exiting with 126. We've run truss on the test on that system and another AIX system where the child process exits with 42 (in line with other Unix platforms).

This is part of the output on the system that returns 126:

12648680:       21889217: close(1)                              = 0
12648680:       21889217: close(2)                              = 0
12648680:       21889217: execve("/tmp/v6.x/out/Release/node", 0x20011518, 0x200115B8)  argc: 3
12648680:       21889217:  argv: /tmp/v6.x/out/Release/node
12648680:       21889217:  /tmp/v6.x/test/parallel/test-stdio-closed.js child
...
12648680:       21889217: kloadquery(238, 0x0FFFFFFFFFFFF790, 24) = 0
12648680:       21889217: sem_init(0x00000001100A8AD8, 0, 0)    = 0
12648680:       21889217: __libc_sbrk(0x0000000000010020)       = 0x00000001101C3680
12648680:       21889217: kload("/usr/lib/libviocmnotify.a(shr.o)", 17367936, "/opt/freeware/lib/pthread/ppc64:"
) = 0
12648680:       21889217: kloadquery(1, 0x00000001101C3AC0, 8176) = 0
12648680:       21889217: getuidx(4)                            = 0
12648680:       21889217: getuidx(2)                            = 0
12648680:       21889217: getuidx(1)                            = 0
12648680:       21889217: getgidx(4)                            = 0
12648680:       21889217: getgidx(2)                            = 0
12648680:       21889217: getgidx(1)                            = 0
12648680:       21889217: kload("/usr/lib/nls/loc/en_US__64", 17301632, "/opt/freeware/lib/pthread/ppc64:") = 64
8518346072921224
12648680:       21889217: kload("libi18n.a(shr_64.o)", 17301888, "/usr/ccs/lib") = 648518346072921384
12648680:       21889217: access("/usr/lib/nls/msg/en_US/execerr.cat", 0) = 0
12648680:       21889217: _getpid()                             = 12648680
12648680:       21889217: kopen("/usr/lib/nls/msg/en_US/execerr.cat", O_RDONLY) = 1
...
12648680:       21889217: fstatx(0, 0x0FFFFFFFFFFFF840, 176, 0) = 0
12648680:       21889217: fstatx(1, 0x0FFFFFFFFFFFF840, 176, 0) = 0
12648680:       21889217: fstatx(2, 0x0FFFFFFFFFFFF840, 176, 0) Err#9  EBADF
12648680:       21889217: kopen("/dev/null", O_RDWR)            = 2

We can see that:

  1. stdout (fd=1) and stderr (fd=2) are closed before the child node process is executed (i.e. the shell is behaving correctly)
  2. The message catalog /usr/lib/nls/msg/en_US/execerr.cat is being opened and is assigned fd=1.
  3. The catalog is not closed, so fd 1 remains open when node calls fstat on it.
  4. node only sees fd=2 as closed and reopens as /dev/null

In the other case (return 42) none of the kload calls appear in the output, and 2. does not appear either so fd=1 remains unused after it is closed and node reopens as `/dev/null'.

So we went back to the premise that changing the LANG environment variable affected behavior. Back on the system where the child process returned 126 LANG was set to en_US. Setting it to anything else changes the behavior and the child process returns 42 (as we expect it to do everywhere else). truss output shows that the message catalog is not being opened and node ends up reopening fd=1 .

When we look in /usr/lib/nls/msg/ there only seems to be catalogs for en_US on the system, so there is not message catalog available to load if LANG is set to anything other than en_US.

So we may have a workaround (changing the LANG environment variable).

What we don't know is why only some AIX systems are exhibiting this behavior. The systems where the child process was returning 42 (in line with the other platforms) by default at no point (according to truss) perform any kload calls nor attempt to load message catalogs no matter what LANG was set to and we confirmed that message catalogs were present in /usr/lib/nls/msg/.

cc @nodejs/platform-aix

@richardlau
Copy link
Member

By workaround, I meant we change the test back so that it expects return 42 everywhere (so no special casing AIX) and run on AIX with the LANG environment variable changed.

@sxa
Copy link
Member Author

sxa commented Dec 19, 2016

For the record, removing the bos.msg.en_US.rte fileset from the machine would also result in consistent behaviour across platforms, since if it cannot find the execerr.cat file then it can't open it, the behaviour and expected RC a;ways appears to end up the same as other platforms (i.e. 42) ... Probably not the best option, but it is another possibility (that package had no direct dependencies on my test systems so it removed cleanly)

Everything comes back to 42 ;-)

@sxa
Copy link
Member Author

sxa commented Dec 19, 2016

Another option would be to set NLSPATH in the environment to something invalid so it can't find the message catalogues

@jBarz
Copy link
Contributor

jBarz commented Dec 19, 2016

I think the test case needs to be fixed as @sam-github suggested in option 2 of #10234 (comment)
I will submit a PR.

jBarz added a commit to jBarz/node-core that referenced this issue Jan 23, 2017
Don't do a write on stdout/stderr because that checks for their
writability. But fd=1 could legitimately be opened with read-only
access by the user. What this test needs to only ensure is that
they are used at startup.
This fixes nodejs#10234
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 25, 2017
Don't do a write on stdout/stderr because that checks for their
writability. But fd=1 could legitimately be opened with read-only
access by the user. All this test needs to ensure is that
they are used at startup.

PR-URL: nodejs#10339
Fixes: nodejs#10234
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 27, 2017
Don't do a write on stdout/stderr because that checks for their
writability. But fd=1 could legitimately be opened with read-only
access by the user. All this test needs to ensure is that
they are used at startup.

PR-URL: nodejs#10339
Fixes: nodejs#10234
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 30, 2017
Don't do a write on stdout/stderr because that checks for their
writability. But fd=1 could legitimately be opened with read-only
access by the user. All this test needs to ensure is that
they are used at startup.

PR-URL: nodejs#10339
Fixes: nodejs#10234
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 30, 2017
Don't do a write on stdout/stderr because that checks for their
writability. But fd=1 could legitimately be opened with read-only
access by the user. All this test needs to ensure is that
they are used at startup.

PR-URL: nodejs#10339
Fixes: nodejs#10234
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this issue Mar 8, 2017
Don't do a write on stdout/stderr because that checks for their
writability. But fd=1 could legitimately be opened with read-only
access by the user. All this test needs to ensure is that
they are used at startup.

PR-URL: #10339
Fixes: #10234
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this issue Mar 8, 2017
Don't do a write on stdout/stderr because that checks for their
writability. But fd=1 could legitimately be opened with read-only
access by the user. All this test needs to ensure is that
they are used at startup.

PR-URL: #10339
Fixes: #10234
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this issue Mar 9, 2017
Don't do a write on stdout/stderr because that checks for their
writability. But fd=1 could legitimately be opened with read-only
access by the user. All this test needs to ensure is that
they are used at startup.

PR-URL: #10339
Fixes: #10234
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this issue Mar 9, 2017
Don't do a write on stdout/stderr because that checks for their
writability. But fd=1 could legitimately be opened with read-only
access by the user. All this test needs to ensure is that
they are used at startup.

PR-URL: #10339
Fixes: #10234
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aix Issues and PRs related to the AIX platform. process Issues and PRs related to the process subsystem. test Issues and PRs related to the tests.
Projects
None yet
8 participants