nixos/test-driver: add support for nspawn containers#478109
nixos/test-driver: add support for nspawn containers#478109Ma27 merged 37 commits intoNixOS:staging-nixosfrom
Conversation
1a58317 to
81245e7
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces support for systemd-nspawn containers as a lightweight alternative to QEMU VMs in the NixOS test framework. The implementation refactors the Python test driver to use an abstract BaseMachine class with separate QemuMachine and NspawnMachine implementations, enabling tests to run both VMs and containers in parallel with shared networking infrastructure.
Key changes:
- Refactored test driver from monolithic
Machineclass to abstractBaseMachinewith specializedQemuMachineandNspawnMachinesubclasses - Created new
guest-networking-options.nixmodule to share VLAN configuration between QEMU VMs and nspawn containers - Added
run-nspawnPython package for managing container lifecycle, networking, and process execution viansenter
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| nixos/tests/test-containers.nix | New test demonstrating basic container startup and VLAN isolation |
| nixos/tests/test-containers-bittorrent.nix | Complex networking test with NAT/UPnP using multiple containers |
| nixos/tests/all-tests.nix | Registers new container tests |
| nixos/modules/virtualisation/qemu-vm.nix | Extracts networking options to shared module |
| nixos/modules/virtualisation/guest-networking-options.nix | New shared networking configuration for VMs and containers |
| nixos/modules/virtualisation/nspawn-container/default.nix | Container profile module with systemd-nspawn configuration |
| nixos/modules/virtualisation/nspawn-container/run-nspawn/ | Python package for container lifecycle management |
| nixos/modules/testing/test-instrumentation.nix | Disables backdoor for containers (not compatible) |
| nixos/lib/testing/nodes.nix | Adds container support alongside nodes with separate defaults |
| nixos/lib/testing/network.nix | Refactors networking to support both VMs and containers |
| nixos/lib/testing/driver.nix | Updates driver build to pass container scripts separately |
| nixos/lib/testing/run.nix | Adds uid-range requirement for container tests |
| nixos/lib/testing/testScript.nix | Minor variable rename for clarity |
| nixos/lib/testing/nixos-test-base.nix | Removes qemu-vm import (now conditional) |
| nixos/lib/test-driver/src/test_driver/machine/init.py | Major refactoring: BaseMachine, QemuMachine, NspawnMachine classes |
| nixos/lib/test-driver/src/test_driver/driver.py | Updates driver to handle both VM and container machines |
| nixos/lib/test-driver/src/test_driver/init.py | Adds CLI arguments for container support |
| nixos/lib/test-driver/default.nix | Adds systemd and util-linux dependencies |
| nixos/lib/test-script-prepend.py | Updates type hints for new machine classes |
| nixos/lib/testing-python.nix | Adds containers parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
nixos/modules/virtualisation/nspawn-container/run-nspawn/src/run_nspawn/__init__.py
Show resolved
Hide resolved
|
@ofborg test nat.firewall networking.scripted.link installer.simpleProvided installer.separateBootFat networking.scripted.virtual printing keymap.azerty keymap.dvorak-programmer boot-stage1 installer.swraid keymap.neo nfs4.simple i3wm udisks2 networking.networkd.loopback containers-ip ecryptfs login installer.simpleLabels php.httpd zfs.installer predictable-interface-names.unpredictable predictable-interface-names.unpredictableNetworkd mutableUsers |
227fa66 to
ccf61f9
Compare
ccf61f9 to
0722a1f
Compare
Ma27
left a comment
There was a problem hiding this comment.
Mostly low-hanging fruits. Hopefully I'll have spoons to do the rest tomorrow or next week :)
| # (n-daemon)[417]: transmission.service: Failed to create destination mount point node '/run/transmission/run/host/.os-release-stage/', ignoring: Read-only file system | ||
| # (n-daemon)[417]: transmission.service: Failed to mount /run/systemd/propagate/.os-release-stage to /run/transmission/run/host/.os-release-stage/: No such file or directory | ||
| # (n-daemon)[417]: transmission.service: Failed to set up mount namespacing: /run/host/.os-release-stage/: No such file or directory | ||
| # (n-daemon)[417]: transmission.service: Failed at step NAMESPACE spawning /nix/store/zfksw9bllp95pl45d1nxmpd2lks42bkj-transmission-4.0.6/bin/transmission-daemon: No such file or directory |
There was a problem hiding this comment.
Have you traced down what settings lead to that problem?
At this point I'm unsure if I consider this a potential bug in the driver, something that needs to be fixed or something that we can just accept (if so, I think it's worth leaving a more detailed rationale here).
There was a problem hiding this comment.
This setting specifically:
Disabling it manually lets transmission start successfully.
There was a problem hiding this comment.
Maybe one of our systemd folks has an opinion on that? I'm not sure if we're missing something here or if just turning the option for that test-case off is OK cc @ElvishJerricco @nikstur
There was a problem hiding this comment.
Do I understand you right in that you would prefer a test that uses the upstream services.transmission module (with the option turned off if necessary) to a test that uses aria2?
There was a problem hiding this comment.
It's not about the service itself, I'm just wondering if we're holding something wrong here. And finally, I wouldn't want to lift arbitrary hardening once we get to the point that this feature of the test framework is being used inside nixpkgs.
maybe cc @NixOS/systemd reaches more people who could weigh in.
9f082bb to
abd52f4
Compare
80b4cb2 to
e02998f
Compare
| visible = "shallow"; | ||
| description = '' | ||
| An attribute set of NixOS configuration modules. | ||
| An attribute set of NixOS configuration modules representing QEMU vms that can be started during a test. |
There was a problem hiding this comment.
This isn't something that needs to happen in this PR but perhaps worth thinking about early:
Once the early kinks of the container backend are ironed out, we probably want to have containers become the default backend. For that we'd need to explicitly declare the tests which rely on hardware virtualisation as such.
I'd prefer if "nodes" was kept as an abstract option to declare multiple NixOS systems without needing to declare the technical detail of how they're implemented. I'd also prefer if the majority of nixosTests that have no requirement for this technical detail didn't have to explicitly declare it either and would keep using "whatever is the default"; with merely the meaning of the default being changed in the future.
For that we'd need the distinction between "nixos systems in virtual machines" and "nixos systems". What do you think of:
- Adding an option specifically for declaring VMs; mirroring the
containersone - (Later) Migrating all tests that require VM-level virtualisation to use said option to explicitly declare that fact
- (Later still) Switching the default implementation for
nodesto containers once they're production-ready and sufficient time has passed to allow downstream consumers to migrate to the VM option where needed
There was a problem hiding this comment.
Hey @Atemu, this was our exact thought! It seems like this method will gain broader consensus easily. :-)
There was a problem hiding this comment.
I think I agree here, yes. I especially cannot come up with an alternate generic name for nodes that's reasonably good and would save the migration effort.
But this PR is a beast already (not a surprise given what had to be done!), so I have a strong opinion on doing this in a follow-up to get this one out sooner than later.
There was a problem hiding this comment.
Thanks @Atemu for sketching out this actionable roadmap. This is indeed what we had in mind; the amount of CPU time we'd save with a more lightweight default almost argues for itself, so of course we're going to follow up on this first step.
However, I am with @Ma27 in that this should wait until this PR and #479968 (the docs) have been merged.
Ma27
left a comment
There was a problem hiding this comment.
So the two things missing that I'm seeing is @KiaraGrouwstra's note on containers missing as _module.arg and a comment needing an update. Rest is good from my view, hence I'm approving.
Thanks a lot for the work and bearing with me as reviewer!
@kmein while you're at it, you may want to consider retargeting this to staging-nixos since this PR would be qualified to go through that branch instead (this is way faster than having this go through all of staging).
cc @K900 @zowoq as a heads-up since it's usually you merging this into master.
| }; | ||
| baseQemuOS = baseOS.extendModules { | ||
| modules = [ | ||
| ../../modules/virtualisation/qemu-vm.nix |
There was a problem hiding this comment.
This import only applies to QEMU. It's correct that containers are ignoring virtualisation.memorySize, but the option doesn't exist there (unless you'd explicitly defined it, of course):
error: The option `containers.c1.virtualisation.memorySize' does not exist. Definition values:
- In `/home/ma27/Projects/nixpkgs-hack/nixos/tests/nixos-test-driver/containers.nix': 1024
| visible = "shallow"; | ||
| description = '' | ||
| An attribute set of NixOS configuration modules. | ||
| An attribute set of NixOS configuration modules representing QEMU vms that can be started during a test. |
There was a problem hiding this comment.
I think I agree here, yes. I especially cannot come up with an alternate generic name for nodes that's reasonably good and would save the migration effort.
But this PR is a beast already (not a surprise given what had to be done!), so I have a strong opinion on doing this in a follow-up to get this one out sooner than later.
2f28ffa to
d589103
Compare
Resolves an issue where nodes on shared secondary VLANs could not reach each other if their primary IPs were on isolated networks.
d589103 to
6631316
Compare
d19dde7 to
c22da90
Compare
|
@Ma27 I've checked the final items off the list and retargeted to Also, thank you @jfly for the head start that brought the finish line into sight—and for your continued input here as well. |
4487447 to
324dec8
Compare
| # note that this is a hacky solution to the fact that | ||
| # the container's "real" /run is clobbered by a tmpfs (see below) | ||
| Path("/host/run").mkdir(parents=True) | ||
| subprocess.run(["mount", "--bind", "/run", "/host/run"], check=True) |
There was a problem hiding this comment.
Considering that this got pushed after fixing up discussions and being at a point where this PR is good to go, is this intended to be on this branch?
Can't this be put into some directory in the container's /run, I don't see how a tmpfs moutned in there is keeping us from doing that.
Also, does /run/opengl-driver even exist in a sandbox? If this is only thrown in via sandbox-paths this strongly looks like a hack for a hack. I'd really prefer to do this in a follow-up and submit the chunk that's good to go already.
There was a problem hiding this comment.
That’s a fair point—I agree that it might feel like a hack. I've gone ahead and removed that commit from this PR so we can keep this focused on the verified changes. I'll look into a more proper implementation in a separate follow-up.
This should be good to go now!
Eveeifyeve
left a comment
There was a problem hiding this comment.
I approve this pr getting merged, with the follow-up pr with the stuff mentioned from @Mic92 on the horizon.
Co-authored-by: cinereal <cinereal@riseup.net>
|
CI reports job canceled? |
|
Bisection is telling me that 23f1e63 broke manual build, e.g. |
|
|
|
Ah, it is not a channel blocker. I forgot that we don't block on Firefox tests (anymore) but only on the builds. |
|
Bisect says 23f1e63 broke eval of at least |
Motivation
Current NixOS integration tests rely heavily on QEMU, which can be slow and resource-intensive. This PR introduces
systemd-nspawnas a lightweight container backend, significantly reducing test latency and enabling hardware passthrough scenarios that are difficult to achieve in VMs.Key advantages of container tests
~25% improvement in test execution speed on Intel(R) Core(TM) Ultra 9 285HX
benchmark of 24 machines running GNU helloTry it out (2 simple steps!)
systemd-nspawn:nix-build -A nixosTests.nixos-test-driver.containers: Basic startup and inter-VLAN isolation (VMs and containers in parallel).nix-build -A nixosTests.test-containers-bittorrent: Complex networking (NAT/UPnP) involving multiple containers.Implementation details
QemuMachineclass (inheriting from an abstractBaseMachineclass).NspawnMachineclass replicates as much functionality ofQemuMachineas possible usingsystemd-nspawncontainers.virtualisation.vlans) for containers.Debugging
To debug a failing container test, introduce
enableDebugHookandsshBackdoorlike so:The test will then print an ssh command on startup:
Upon failure, the test will print a command to attach to it, e.g.
Run this to get a shell inside the sandbox, where you can run the
sshcommand above to enter the container.Note: Due to the nature of
systemd-nspawn, interactive execution of the tests requires root privileges:sudo $(nix -L build --print-out-paths .#nixosTests.test-containers.driverInteractive)/bin/nixos-test-driver --interactiveLimitations of containers
setuidwrappers (likesudo—though you can userunuserinstead).systemdhardening options (ProtectSystem=etc.) used by NixOS modules such asservices.transmissionamong many others./dev, making it necessary to pass in needed paths, e. g.--option sandbox-paths /dev/netfor VPN tests that create/dev/net/tun.Credits and history
Based on the testing infrastructure from @clan-lol.
The heavy lifting for integrating this into nixpkgs was done by @jfly.
Things done
passthru.tests.nixpkgs-reviewon this PR. See nixpkgs-review usage../result/bin/.Add a 👍 reaction to pull requests you find important.