Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: delay child exit in AIX for pseudo-tty tests #11715

Closed

Conversation

gireeshpunathil
Copy link
Member

@gireeshpunathil gireeshpunathil commented Mar 6, 2017

The tests in pseudo-tty takes the form of child node writing some data
and exiting, while parent python consume them through pseudo tty
implementations, and validate the result.

While there is no synchronization between child and parent, this works
for most platforms, except AIX, where the child exits even before the
parent could setup the read loop, under race conditions

Fixing the race condition is ideally done through sending ACK messages
to and forth, but involves massive changes and have side effect. The
workaround is to address them in AIX alone, by adding a reasonable delay.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Mar 6, 2017
if (common.isAix) {
setTimeout(function() {
process.exit(0);
}, 200);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it would completely break the purpose of the test… if the problem really is the child exiting too quickly, maybe just set up a setTimeout around the whole code here for AIX.

Also, the comment sounds like this is a bug in Python on AIX, not in our tests/Node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addalex - the background for this change ishere
There exists a bug in python in AIX, but the current issue and its mitigation is outside of that bug.
I am not sure I understand why this would break the purpose of the test - can you please explain?
The sequence of child-parent interaction is thus:

  1. Parent turns child's std streams into pseudo terminals, and spawns the child.
  2. child writes data into its streams and exits. [ data of differing length, and sequence, in different tests]
  3. Parent sets up a read loop and continuously reads, until it gets EIO (on child termination)
  4. The read data is then compared with the expected data, with the help of a companion file.

The issue is: there is a race condition between child's exit and parent's read logic - if child exits too early, the parent will end up loosing all the data - in theory this is possible in all platforms, but AIX seem to be the only victim, and hence this workaround.

It is not a pre-req that the child starts only when the parent is ready. If the child stays alive, so does its fds, and hence parent can continue to read from it.

Please let me know if I am missing anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand why this would break the purpose of the test - can you please explain?

What happens in the test is that the Node process writes a chunk of data that is slightly too large to fit into the OS buffer for the TTY (specifically for the OS X value here); for regular streams, this would mean that the data can only partly be written synchronously, and some of the data will need to be written asynchronously.

For TTYs we avoid this by turning them into blocking streams, so all data is written synchronously, and this is what the test here is checking.

That’s why the process.exit(0) has to happen in exactly the same tick; otherwise libuv can do asynchronous I/O before the process exits.

The issue is: there is a race condition between child's exit and parent's read logic

Yeah, it’s odd… I don’t know much about AIX but intermingling the I/O logic and the process control logic seems weird to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addaleax - thanks, I understand the history of the test, and agree that the nextTick will defeat the purpose of the test. I have amended the code for no_dropped_stdio.js as per your suggestion. Looking at changes in other tests, I believe they are just fine.

@gibfahn gibfahn added the aix Issues and PRs related to the AIX platform. label Mar 6, 2017
@gibfahn
Copy link
Member

gibfahn commented Mar 6, 2017

cc/ @nodejs/platform-aix

@mscdex mscdex added the tty Issues and PRs related to the tty subsystem. label Mar 6, 2017
@addaleax addaleax dismissed their stale review March 7, 2017 12:49

outdated


// In AIX, the child exits even before the python parent
// can setup the readloop. Provide a reasonable delay.
if (common.isAix) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these, would it be better to delay the entire test when running on AIX?

e.g. something like...

function doTest() {
  /** do the test stuff here **/
}
if (common.isAix)
  setTimeout(doTest, 200);
else
  doTest();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasnell - thanks, I made some changes to this effect, essentially introducing the delay before I/O is performed, instead of after. Please note that from a process control logic perspective, the order of delay is immeterial, while from an I/O logic perspective, the child writes are kept in tact (performed in same uv tick), in order not to affect the intent of the test. I hope you are fine with this?

@gireeshpunathil gireeshpunathil force-pushed the pseudottyaixdelay branch 2 times, most recently from 12238ff to e48dcd8 Compare March 8, 2017 07:59
Copy link
Member

@mhdawson mhdawson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As per discussion I'd prefer to avoid a timeout but given that does not seem feasible this looks ok.

@gibfahn
Copy link
Member

gibfahn commented Mar 12, 2017

CI: https://ci.nodejs.org/job/node-test-commit/8414/

@addaleax , does this LGTY?

cc/ @nodejs/testing

Copy link
Member

@gibfahn gibfahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gireeshpunathil I get warnings when I try to apply this patch, could you remove the extra blank lines at the end of the files?

Applying: test: delay child exit in AIX for pseudo-tty tests
.git/rebase-apply/patch:33: new blank line at EOF.
+
.git/rebase-apply/patch:86: new blank line at EOF.
+
warning: 2 lines add whitespace errors.

setTimeout(function() {
process.emit('SIGWINCH');
}, common.isAix ? 200 : 0);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blank line 2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibfahn - addressed, thanks.

process.stdout.write(out);
process.exit(0);
}, common.isAix ? 200 : 0);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blank line 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibfahn - addressed, thanks

@addaleax
Copy link
Member

@addaleax , does this LGTY?

I guess? If working around a Python bug is what we have to do then this seems like a reasonable approach.

The tests in pseudo-tty takes the form of child node writing some data
and exiting, while parent python consume them through pseudo tty
implementations, and validate the result.

While there is no synchronization between child and parent, this works
for most platforms, except AIX, where the child exits even before the
parent could setup the read loop, under race conditions

Fixing the race condition is ideally done through sending ACK messages
to and forth, but involves massive changes and have side effect. The
workaround is to address them in AIX alone, by adding a reasonable delay.
@gibfahn gibfahn self-assigned this Mar 15, 2017
@gibfahn
Copy link
Member

gibfahn commented Mar 15, 2017

Will land later today if there are no objections

@gibfahn
Copy link
Member

gibfahn commented Mar 17, 2017

Another CI for good luck: https://ci.nodejs.org/job/node-test-commit/8513/

@gibfahn
Copy link
Member

gibfahn commented Mar 17, 2017

@gireeshpunathil I was adding the Fixes: lines to your PR as part of the landing procedure, and I noticed that this doesn't seem to cover the pseudo-tty/test-tty-wrap issue from #9728. Would it be worth/possible to expand this PR to cover that? Not sure if it's the same change.

Currently this PR fixes:

Fixes: https://github.com/nodejs/node/issues/7973
Fixes: https://github.com/nodejs/node/issues/9765
Fixes: https://github.com/nodejs/node/issues/11541

EDIT: Spoke to Gireesh, #9728 is a separate issue.

@gibfahn
Copy link
Member

gibfahn commented Mar 17, 2017

Landed in d631af5

@gibfahn gibfahn closed this Mar 17, 2017
gibfahn pushed a commit that referenced this pull request Mar 17, 2017
The tests in pseudo-tty takes the form of child node writing some data
and exiting, while parent python consume them through pseudo tty
implementations, and validate the result.

While there is no synchronization between child and parent, this works
for most platforms, except AIX, where the child exits even before the
parent could setup the read loop, under race conditions

Fixing the race condition is ideally done through sending ACK messages
to and forth, but involves massive changes and have side effect. The
workaround is to address them in AIX alone, by adding a reasonable
delay.

PR-URL: #11715
Fixes: #7973
Fixes: #9765
Fixes: #11541
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
@gireeshpunathil
Copy link
Member Author

@gibfahn - thanks. #9728 suffers from hang, and the reason is explained here. So the test stays excluded, until the python bug is addressed. Hope this clarifies.

italoacasas pushed a commit to italoacasas/node that referenced this pull request Mar 20, 2017
The tests in pseudo-tty takes the form of child node writing some data
and exiting, while parent python consume them through pseudo tty
implementations, and validate the result.

While there is no synchronization between child and parent, this works
for most platforms, except AIX, where the child exits even before the
parent could setup the read loop, under race conditions

Fixing the race condition is ideally done through sending ACK messages
to and forth, but involves massive changes and have side effect. The
workaround is to address them in AIX alone, by adding a reasonable
delay.

PR-URL: nodejs#11715
Fixes: nodejs#7973
Fixes: nodejs#9765
Fixes: nodejs#11541
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
jungx098 pushed a commit to jungx098/node that referenced this pull request Mar 21, 2017
The tests in pseudo-tty takes the form of child node writing some data
and exiting, while parent python consume them through pseudo tty
implementations, and validate the result.

While there is no synchronization between child and parent, this works
for most platforms, except AIX, where the child exits even before the
parent could setup the read loop, under race conditions

Fixing the race condition is ideally done through sending ACK messages
to and forth, but involves massive changes and have side effect. The
workaround is to address them in AIX alone, by adding a reasonable
delay.

PR-URL: nodejs#11715
Fixes: nodejs#7973
Fixes: nodejs#9765
Fixes: nodejs#11541
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Gibson Fahnestock <[email protected]>
@MylesBorins
Copy link
Contributor

Should this be backported

@gibfahn
Copy link
Member

gibfahn commented Apr 18, 2017

Should this be backported

I think so, @gireeshpunathil can you raise a backport to v6.x-staging? Guide is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aix Issues and PRs related to the AIX platform. test Issues and PRs related to the tests. tty Issues and PRs related to the tty subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants