nixos/test-driver: warn when command exits but stdout stays open#471141
nixos/test-driver: warn when command exits but stdout stays open#471141roberth wants to merge 1 commit intoNixOS:staging-nixosfrom
Conversation
The test driver's succeed() method waits for stdout to be fully consumed before returning. When a command spawns background processes that inherit stdout, succeed() will wait silently for those processes to complete or close stdout. This wait can be arbitrarily long and users have no visibility into what's causing the delay. Unfortunately just changing the behavior of these widely used methods is not an option. This change detects when a command has exited but stdout remains open for more than 10 seconds, and emits a warning to help users diagnose the issue. This warning briefly explains the problem and suggests redirecting background process output to avoid the implicit wait. The implementation uses bash coproc to independently track: - The command process exit status (preserved for return) - The stdout closure (via base64 process termination) - A 10-second timeout using wait -n to race these events When stdout closes quickly: no warning When stdout stays open >10s: warning emitted, continues waiting
|
Is there any prior discussion to this problem / how wide-spread that is? Overall, I'd feel quite uneasy by adding a giant pile of bash(!) to handle process monitoring. I've seen bash being used for non-trivial tasks being quite a bunch of times and too often it bit back then. Isn't this something the backdoor itself should be able to do ideally rather than adding another layer on top in the "client"? At the very least the behavior is documented in the manual: https://nixos.org/manual/nixos/stable/#ssec-machine-objects |
I've added #144875 to the description. Hardly any discussion, but also I don't know how anyone would even find that issue in the first place. The hidden nature of the problem makes it so much worse, both for our ability to find info about the issue, and to even figure out that it happens in the first place. I suspect we'd have more
In a way, but the interface is "any bash statement" and it's not unreasonable to expect it to exit when the statement exits.
I understand, and tbh I felt uneasy writing it, but it has the benefit is portability. It works with any VM that can offer a >=4.something bash on serial. That includes other distributions and an initrd without any extra interpreters. I guess we could put this on ice until we enable those fancy use cases and implement this in a language like python and/or turn the backdoor into a more structured protocol and whatnot. I would love to see that happen, but it's not something I can just go ahead and make happen. What I have done is write a test for the warning functionality, which will be helpful whenever we want to change the implementation.
I never know when people read my writing, but I don't count on it. Thank you for reading. I believe that unfortunately many people already feel like they're going the extra mile by even writing a test, and any setbacks are more likely to kill their motivation than make the read the docs; subconsciously a bridge too far for them. So yeah, the reason why I make this trade-off (complexity -> visibility) is because I believe we're better off with a testing system that's easier to use, even at increased maintenance cost. |
I mean, there's a direct contradiction between "awaiting statement exit" and "getting stdout from
This is at least what I'd prefer, yes. Considering that we use AF_VSOCK already for SSH in debug scenarios and #453305 gets rid of the downsides of having AF_VSOCK on the host-side, I wanted to see if it somehow makes sense to just use this approach for all of the communication instead of That being said, this is nothing I actively plan, so if anybody wants to do some research on that end before me, feel free! |
True, but that's not apparent from, for instance, just the name of the method.
Definitely good, but less portable, so I don't know if we'll end up replacing it entirely, but the serial backdoor can be brought back if needed for non-NixOS, which is not really a thing yet anyway. |
What are you worried about specifically? Some thoughts from my end:
|
Motivation
I believe this issue has cost the community many hours of unnecessary troubleshooting, frustration, and I assume it has restricted the adoption of the test framework. That needs to end.
Commit message
The test driver's succeed() method waits for stdout to be fully consumed before returning. When a command spawns background processes that inherit stdout, succeed() will wait silently for those processes to complete or close stdout. This wait can be arbitrarily long and users have no visibility into what's causing the delay.
Unfortunately just changing the behavior of these widely used methods is not an option.
This change detects when a command has exited but stdout remains open for more than 10 seconds, and emits a warning to help users diagnose the issue. This warning briefly explains the problem and suggests redirecting background process output to avoid the implicit wait.
The implementation uses bash coproc to independently track:
When stdout closes quickly: no warning
When stdout stays open >10s: warning emitted, continues waiting
Things done
passthru.tests.nixpkgs-reviewon this PR. See nixpkgs-review usage../result/bin/.Add a 👍 reaction to pull requests you find important.