nixos/test-driver: warn when command exits but stdout stays open by roberth · Pull Request #471141 · NixOS/nixpkgs

roberth · 2025-12-15T20:28:53Z

Motivation

I believe this issue has cost the community many hours of unnecessary troubleshooting, frustration, and I assume it has restricted the adoption of the test framework. That needs to end.

[EDIT] Closes nixosTest stdout blocking is hard to troubleshoot #144875

Commit message

The test driver's succeed() method waits for stdout to be fully consumed before returning. When a command spawns background processes that inherit stdout, succeed() will wait silently for those processes to complete or close stdout. This wait can be arbitrarily long and users have no visibility into what's causing the delay.

Unfortunately just changing the behavior of these widely used methods is not an option.

This change detects when a command has exited but stdout remains open for more than 10 seconds, and emits a warning to help users diagnose the issue. This warning briefly explains the problem and suggests redirecting background process output to avoid the implicit wait.

The implementation uses bash coproc to independently track:

The command process exit status (preserved for return)
The stdout closure (via base64 process termination)
A 10-second timeout using wait -n to race these events

When stdout closes quickly: no warning
When stdout stays open >10s: warning emitted, continues waiting

Things done

Add a 👍 reaction to pull requests you find important.

The test driver's succeed() method waits for stdout to be fully consumed before returning. When a command spawns background processes that inherit stdout, succeed() will wait silently for those processes to complete or close stdout. This wait can be arbitrarily long and users have no visibility into what's causing the delay. Unfortunately just changing the behavior of these widely used methods is not an option. This change detects when a command has exited but stdout remains open for more than 10 seconds, and emits a warning to help users diagnose the issue. This warning briefly explains the problem and suggests redirecting background process output to avoid the implicit wait. The implementation uses bash coproc to independently track: - The command process exit status (preserved for return) - The stdout closure (via base64 process termination) - A 10-second timeout using wait -n to race these events When stdout closes quickly: no warning When stdout stays open >10s: warning emitted, continues waiting

Ma27 · 2025-12-22T10:49:15Z

Is there any prior discussion to this problem / how wide-spread that is?
I happen to work for a company nowadays that makes very heavy use of the test driver for a few projects and I think i've seen a bunch of the common pain-points now (e.g. #399245 and similar problems under heavy CI load), but this is a new one for me. Am I getting it right that this is not really a timing problem, but wrong use of how machine.execute?

Overall, I'd feel quite uneasy by adding a giant pile of bash(!) to handle process monitoring. I've seen bash being used for non-trivial tasks being quite a bunch of times and too often it bit back then. Isn't this something the backdoor itself should be able to do ideally rather than adding another layer on top in the "client"?

At the very least the behavior is documented in the manual: https://nixos.org/manual/nixos/stable/#ssec-machine-objects

roberth · 2025-12-22T12:09:32Z

Is there any prior discussion to this problem / how wide-spread that is?

I've added #144875 to the description. Hardly any discussion, but also I don't know how anyone would even find that issue in the first place.

The hidden nature of the problem makes it so much worse, both for our ability to find info about the issue, and to even figure out that it happens in the first place.

I suspect we'd have more & detached processes in tests, and more elaborate tests if it wasn't for this problem.

Am I getting it right that this is not really a timing problem, but wrong use of how machine.execute?

In a way, but the interface is "any bash statement" and it's not unreasonable to expect it to exit when the statement exits.

uneasy by adding a giant pile of bash(!)

I understand, and tbh I felt uneasy writing it, but it has the benefit is portability. It works with any VM that can offer a >=4.something bash on serial. That includes other distributions and an initrd without any extra interpreters.
I have considered doing it in python, but this would somewhat reduce aforementioned portability, and in my experience doing process management in other languages isn't obviously simpler either.

I guess we could put this on ice until we enable those fancy use cases and implement this in a language like python and/or turn the backdoor into a more structured protocol and whatnot. I would love to see that happen, but it's not something I can just go ahead and make happen.

What I have done is write a test for the warning functionality, which will be helpful whenever we want to change the implementation.

At the very least the behavior is documented in the manual

I never know when people read my writing, but I don't count on it. Thank you for reading.

I believe that unfortunately many people already feel like they're going the extra mile by even writing a test, and any setbacks are more likely to kill their motivation than make the read the docs; subconsciously a bridge too far for them.

So yeah, the reason why I make this trade-off (complexity -> visibility) is because I believe we're better off with a testing system that's easier to use, even at increased maintenance cost.
It's a long-term view where I believe the additional tests and additional people who write tests offset the complexity.

Ma27 · 2025-12-22T14:41:51Z

In a way, but the interface is "any bash statement" and it's not unreasonable to expect it to exit when the statement exits.

I mean, there's a direct contradiction between "awaiting statement exit" and "getting stdout from Machine.execute" and that's where I'd expect people to test which is true (or check the manual). That's my reasoning for being surprised that this is much of an issue, but apparently it is.

this in a language like python and/or turn the backdoor into a more structured protocol and whatnot

This is at least what I'd prefer, yes.

Considering that we use AF_VSOCK already for SSH in debug scenarios and #453305 gets rid of the downsides of having AF_VSOCK on the host-side, I wanted to see if it somehow makes sense to just use this approach for all of the communication instead of backdoor.service (motivations being mostly less code for us to maintain and a single backdoor implementation). When I experiment with this, I'll also take a look at how to deal with that problem.

That being said, this is nothing I actively plan, so if anybody wants to do some research on that end before me, feel free!

roberth · 2025-12-22T19:18:30Z

there's a direct contradiction

True, but that's not apparent from, for instance, just the name of the method. succeed is a deceptive name for the two things it does. Arguably that should be fixed too, and that could be another angle to solve the issue.
Takes more time and relearning though.

AF_VSOCK

Definitely good, but less portable, so I don't know if we'll end up replacing it entirely, but the serial backdoor can be brought back if needed for non-NixOS, which is not really a thing yet anyway.

Ma27 · 2025-12-29T12:15:44Z

Definitely good, but less portable, so I don't know if we'll end up replacing it entirely, but the serial backdoor can be brought back if needed for non-NixOS, which is not really a thing yet anyway.

What are you worried about specifically? Some thoughts from my end:

Inside the guest we just leverage systemd for that, so any modern Linux distro (I'm aware that there are ideas to run tests with other distros with this) should be usable. In fact, `test.thing - a test-runner that already supports multi-distro use-cases - has even written a polyfill for that: https://codeberg.org/lis/test.thing/src/branch/main/workarounds
for the host-OS I haven't checked if macos works fine with vhost-device-vsock, but I think a goal of this project is to be no longer Linux-specific. Anyways, I'm adding support for that in nixos/test-driver: use vhost-device-vsock for SSH backdoor #453305 because we'd need it anyways for the default backdoor (otherwise we'd need extra-sandbox-paths = /dev/vhost-vsock and vsock numbers are host-global)
If it's about the hypervisor then I can say that e.g. cloud-hypervisor is supporting vsock with sockets on the host-side natively IIRC. Admittedly I haven't checked on how to implement this with nspawn which is an open, yet required thing to check, yes.

roberth requested a review from tfc December 15, 2025 20:28

mdaniels5757 changed the base branch from master to staging-nixos January 18, 2026 03:40

nixpkgs-ci bot closed this Jan 18, 2026

nixpkgs-ci bot reopened this Jan 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nixos/test-driver: warn when command exits but stdout stays open#471141

nixos/test-driver: warn when command exits but stdout stays open#471141
roberth wants to merge 1 commit intoNixOS:staging-nixosfrom
roberth:nixos-test-driver-warn-when-blocked-after-exit

roberth commented Dec 15, 2025 •

edited

Loading

Uh oh!

Ma27 commented Dec 22, 2025

Uh oh!

roberth commented Dec 22, 2025

Uh oh!

Ma27 commented Dec 22, 2025

Uh oh!

roberth commented Dec 22, 2025

Uh oh!

Ma27 commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

roberth commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Commit message

Things done

Uh oh!

Ma27 commented Dec 22, 2025

Uh oh!

roberth commented Dec 22, 2025

Uh oh!

Ma27 commented Dec 22, 2025

Uh oh!

roberth commented Dec 22, 2025

Uh oh!

Ma27 commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roberth commented Dec 15, 2025 •

edited

Loading