Skip to content

nixos/test-driver: warn when command exits but stdout stays open#471141

Open
roberth wants to merge 1 commit intoNixOS:staging-nixosfrom
roberth:nixos-test-driver-warn-when-blocked-after-exit
Open

nixos/test-driver: warn when command exits but stdout stays open#471141
roberth wants to merge 1 commit intoNixOS:staging-nixosfrom
roberth:nixos-test-driver-warn-when-blocked-after-exit

Conversation

@roberth
Copy link
Member

@roberth roberth commented Dec 15, 2025

Motivation

I believe this issue has cost the community many hours of unnecessary troubleshooting, frustration, and I assume it has restricted the adoption of the test framework. That needs to end.

Commit message

The test driver's succeed() method waits for stdout to be fully consumed before returning. When a command spawns background processes that inherit stdout, succeed() will wait silently for those processes to complete or close stdout. This wait can be arbitrarily long and users have no visibility into what's causing the delay.

Unfortunately just changing the behavior of these widely used methods is not an option.

This change detects when a command has exited but stdout remains open for more than 10 seconds, and emits a warning to help users diagnose the issue. This warning briefly explains the problem and suggests redirecting background process output to avoid the implicit wait.

The implementation uses bash coproc to independently track:

  • The command process exit status (preserved for return)
  • The stdout closure (via base64 process termination)
  • A 10-second timeout using wait -n to race these events

When stdout closes quickly: no warning
When stdout stays open >10s: warning emitted, continues waiting

Things done

  • Built on platform:
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • Tested, as applicable:
  • Ran nixpkgs-review on this PR. See nixpkgs-review usage.
  • Tested basic functionality of all binary files, usually in ./result/bin/.
  • Nixpkgs Release Notes
    • Package update: when the change is major or breaking.
  • NixOS Release Notes
    • Module addition: when adding a new NixOS module.
    • Module update: when the change is significant.
  • Fits CONTRIBUTING.md, pkgs/README.md, maintainers/README.md and other READMEs.

Add a 👍 reaction to pull requests you find important.

The test driver's succeed() method waits for stdout to be fully
consumed before returning. When a command spawns background processes
that inherit stdout, succeed() will wait silently for those processes
to complete or close stdout. This wait can be arbitrarily long and
users have no visibility into what's causing the delay.

Unfortunately just changing the behavior of these widely used methods
is not an option.

This change detects when a command has exited but stdout remains open
for more than 10 seconds, and emits a warning to help users diagnose
the issue. This warning briefly explains the problem and suggests
redirecting background process output to avoid the implicit wait.

The implementation uses bash coproc to independently track:
- The command process exit status (preserved for return)
- The stdout closure (via base64 process termination)
- A 10-second timeout using wait -n to race these events

When stdout closes quickly: no warning
When stdout stays open >10s: warning emitted, continues waiting
@roberth roberth requested a review from tfc December 15, 2025 20:28
@nixpkgs-ci nixpkgs-ci bot added 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-darwin: 1 This PR causes 1 package to rebuild on Darwin. 10.rebuild-nixos-tests This PR causes rebuilds for all NixOS tests and should normally target the staging branches. 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: testing Tooling for automated testing of packages and modules labels Dec 15, 2025
@Ma27
Copy link
Member

Ma27 commented Dec 22, 2025

Is there any prior discussion to this problem / how wide-spread that is?
I happen to work for a company nowadays that makes very heavy use of the test driver for a few projects and I think i've seen a bunch of the common pain-points now (e.g. #399245 and similar problems under heavy CI load), but this is a new one for me. Am I getting it right that this is not really a timing problem, but wrong use of how machine.execute?

Overall, I'd feel quite uneasy by adding a giant pile of bash(!) to handle process monitoring. I've seen bash being used for non-trivial tasks being quite a bunch of times and too often it bit back then. Isn't this something the backdoor itself should be able to do ideally rather than adding another layer on top in the "client"?

At the very least the behavior is documented in the manual: https://nixos.org/manual/nixos/stable/#ssec-machine-objects

@roberth
Copy link
Member Author

roberth commented Dec 22, 2025

Is there any prior discussion to this problem / how wide-spread that is?

I've added #144875 to the description. Hardly any discussion, but also I don't know how anyone would even find that issue in the first place.

The hidden nature of the problem makes it so much worse, both for our ability to find info about the issue, and to even figure out that it happens in the first place.

I suspect we'd have more & detached processes in tests, and more elaborate tests if it wasn't for this problem.

Am I getting it right that this is not really a timing problem, but wrong use of how machine.execute?

In a way, but the interface is "any bash statement" and it's not unreasonable to expect it to exit when the statement exits.

uneasy by adding a giant pile of bash(!)

I understand, and tbh I felt uneasy writing it, but it has the benefit is portability. It works with any VM that can offer a >=4.something bash on serial. That includes other distributions and an initrd without any extra interpreters.
I have considered doing it in python, but this would somewhat reduce aforementioned portability, and in my experience doing process management in other languages isn't obviously simpler either.

I guess we could put this on ice until we enable those fancy use cases and implement this in a language like python and/or turn the backdoor into a more structured protocol and whatnot. I would love to see that happen, but it's not something I can just go ahead and make happen.

What I have done is write a test for the warning functionality, which will be helpful whenever we want to change the implementation.

At the very least the behavior is documented in the manual

I never know when people read my writing, but I don't count on it. Thank you for reading.

I believe that unfortunately many people already feel like they're going the extra mile by even writing a test, and any setbacks are more likely to kill their motivation than make the read the docs; subconsciously a bridge too far for them.

So yeah, the reason why I make this trade-off (complexity -> visibility) is because I believe we're better off with a testing system that's easier to use, even at increased maintenance cost.
It's a long-term view where I believe the additional tests and additional people who write tests offset the complexity.

@Ma27
Copy link
Member

Ma27 commented Dec 22, 2025

In a way, but the interface is "any bash statement" and it's not unreasonable to expect it to exit when the statement exits.

I mean, there's a direct contradiction between "awaiting statement exit" and "getting stdout from Machine.execute" and that's where I'd expect people to test which is true (or check the manual). That's my reasoning for being surprised that this is much of an issue, but apparently it is.

this in a language like python and/or turn the backdoor into a more structured protocol and whatnot

This is at least what I'd prefer, yes.

Considering that we use AF_VSOCK already for SSH in debug scenarios and #453305 gets rid of the downsides of having AF_VSOCK on the host-side, I wanted to see if it somehow makes sense to just use this approach for all of the communication instead of backdoor.service (motivations being mostly less code for us to maintain and a single backdoor implementation). When I experiment with this, I'll also take a look at how to deal with that problem.

That being said, this is nothing I actively plan, so if anybody wants to do some research on that end before me, feel free!

@roberth
Copy link
Member Author

roberth commented Dec 22, 2025

there's a direct contradiction

True, but that's not apparent from, for instance, just the name of the method. succeed is a deceptive name for the two things it does. Arguably that should be fixed too, and that could be another angle to solve the issue.
Takes more time and relearning though.

AF_VSOCK

Definitely good, but less portable, so I don't know if we'll end up replacing it entirely, but the serial backdoor can be brought back if needed for non-NixOS, which is not really a thing yet anyway.

@Ma27
Copy link
Member

Ma27 commented Dec 29, 2025

Definitely good, but less portable, so I don't know if we'll end up replacing it entirely, but the serial backdoor can be brought back if needed for non-NixOS, which is not really a thing yet anyway.

What are you worried about specifically? Some thoughts from my end:

  • Inside the guest we just leverage systemd for that, so any modern Linux distro (I'm aware that there are ideas to run tests with other distros with this) should be usable. In fact, `test.thing - a test-runner that already supports multi-distro use-cases - has even written a polyfill for that: https://codeberg.org/lis/test.thing/src/branch/main/workarounds
  • for the host-OS I haven't checked if macos works fine with vhost-device-vsock, but I think a goal of this project is to be no longer Linux-specific. Anyways, I'm adding support for that in nixos/test-driver: use vhost-device-vsock for SSH backdoor #453305 because we'd need it anyways for the default backdoor (otherwise we'd need extra-sandbox-paths = /dev/vhost-vsock and vsock numbers are host-global)
  • If it's about the hypervisor then I can say that e.g. cloud-hypervisor is supporting vsock with sockets on the host-side natively IIRC. Admittedly I haven't checked on how to implement this with nspawn which is an open, yet required thing to check, yes.

@mdaniels5757 mdaniels5757 changed the base branch from master to staging-nixos January 18, 2026 03:40
@nixpkgs-ci nixpkgs-ci bot closed this Jan 18, 2026
@nixpkgs-ci nixpkgs-ci bot reopened this Jan 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: testing Tooling for automated testing of packages and modules 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-darwin: 1 This PR causes 1 package to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. 10.rebuild-nixos-tests This PR causes rebuilds for all NixOS tests and should normally target the staging branches.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nixosTest stdout blocking is hard to troubleshoot

2 participants