build06: init by zowoq · Pull Request #1807 · nix-community/infra

zowoq · 2025-05-01T02:00:40Z

Will wait until the config is finalised before ordering the machine.

This wires the machine up so it can be used by https://hydra.nix-community.org/ ~~(and https://buildbot.nix-community.org/)~~ but only for builds that set the cuda requiredSystemFeature.

SomeoneSerge · 2025-05-01T12:20:57Z

hosts/build06/nvidia.nix

+    open = true;
+  };
+
+  programs.nix-required-mounts = {


We'll need to hack around NixOS/nix#9272, my tentative plan was to see if it's enough to ad hoc patch Nix just on the remote side without touching the requesting side

We can try that but as long as it can eventually be upstreamed patching both sides is fine.

@Mic92 What do you think a fix for NixOS/nix#9272 would look like?

As the only builds that'll run on this machine are the cuda tests could we set the sandbox paths directly instead of using the hook?

What is the issue here? Is this needed for some cachix pre-build hook?

What is the issue here? Is this needed for some cachix pre-build hook?

Did you read the issue linked in the previous comments?

OTOH do we actually care? These outputs get garbage collected anyway.

Everything built gets pushed to our cachix cache. To avoid that we'd need to have a separate hydra instance just for running the tests. That then has problems, either we'd need to build the non-test derivations twice or have the test derivations on a much slower schedule to ensure that they have already been build on the main hydra (and still have the problem of derivations that failed on the main hydra being attempted a second time on the test hydra).

Adding features to non-test derivations instead of the test derivations is an interesting idea. However, a derivation's system features are part of the derivation struct and affect its hash

Yes. As long as the feature is a nix default (which big-parallel is) and also correct (I think big-parallel does apply for most of them) I don't see an issue?

Everything built gets pushed to our cachix cache.

What I meant to say is: cachix contents eventually get garbage-collected...

a nix default (which big-parallel is) and also correct (I think big-parallel does apply for most of them) I don't see an issue?

Ah, I see. Clever. This should work?

https://github.com/helsinki-systems/hydra-queue-runner

Using the new queue runner seems like it would address this problem.

Working on this in #1912, not useable yet.

Adding the paths that would have been set by the pre-build-hook to nix.settings.sandbox-paths

This is a little tricky, as the hook performs operations like symlink resolution that cannot be done at evaluation time.

Behold, my solution:

programs.nix-required-mounts.extraWrapperArgs = [ "--run shift" "--add-flag '${builtins.unsafeDiscardOutputDependency (derivation { name = "needs-cuda"; builder = "_"; system = "_"; requiredSystemFeatures = [ "cuda" ]; }).drvPath}'" ];

ConnorBaker · 2025-05-06T15:06:51Z

hosts/build06/default.nix

+    inputs.srvos.nixosModules.hardware-hetzner-online-intel
+  ];
+
+  nix.settings.max-jobs = 14;


I'd recommend setting this to two -- both in the case this is used as an actual builder and because running a bunch of GPU tests simultaneously could cause an OOM event.

👍🏻 this is the place where we suffer very much from lack of a "real" scheduler for Nix, with support for negative affinity and resource constraints like in SLURM. We'll definitely run into issues running things like pytorch or pytorch-lightning test-suites, where you can have matrices of tests run in parallel

ConnorBaker · 2025-05-06T15:07:32Z

modules/nixos/hydra.nix

        printf "$machines" > $out
        substituteInPlace $out --replace-fail 'ssh-ng://' 'ssh://'
        substituteInPlace $out --replace-fail ' 80 ' ' 3 '
+        substituteInPlace $out --replace-fail ' 14 ' ' 2 '


This should change depending on https://github.com/nix-community/infra/pull/1807/files#r2075684863, correct?

zowoq · 2025-07-31T23:31:21Z

How are these tests going to be set up in nixpkgs for CI to run? Will they be added to the current release-cuda jobset or a new jobset?

zowoq · 2025-08-18T00:31:16Z

How are these tests going to be set up in nixpkgs for CI to run? Will they be added to the current release-cuda jobset or a new jobset?

@ConnorBaker @SomeoneSerge

zowoq · 2025-08-27T22:12:52Z

How are these tests going to be set up in nixpkgs for CI to run? Will they be added to the current release-cuda jobset or a new jobset?

@ConnorBaker @SomeoneSerge

^ asking again

ConnorBaker · 2025-08-28T23:29:52Z

Yes, I imagine they'd be added there and include requiredSystemFeatures = [ "cuda" ];; although without allowing the GPU in the sandbox for all builds (or something like #1912) I'm still not sure how we'd do the testing :l

zowoq · 2025-08-29T00:53:32Z

(or something like #1912)

I'm working on this but hard to say when it'll be ready, could be in a week or two, I'd certainty hope to have it running within a month or two.

ConnorBaker · 2025-08-29T09:52:16Z

Hopefully that gives me a chance to make sure things actually build 🫠

zowoq · 2025-09-03T22:49:58Z

#1912 is merged but I want to see that it can handle a full rebuild from a staging-next merge without problems. There is a staging-next PR open at the moment which I expect will be merged in a week or two, after that should be ready to move forward with this.

zowoq · 2025-09-10T04:06:11Z

#1912 is merged but I want to see that it can handle a full rebuild from a staging-next merge without problems. There is a staging-next PR open at the moment which I expect will be merged in a week or two, after that should be ready to move forward with this.

Staging-next was merged earlier, there wasn't any problems with the new queue runner handling the full rebuild so we're ready to move forward with this.

I'll wait for the tests to be included in a jobset that we can run before ordering the machine.

zowoq · 2025-09-10T23:45:03Z

https://discourse.nixos.org/t/nix-flox-nvidia-opening-up-cuda-redistribution-on-nix/69189

https://matrix.to/#/!eWOErHSaiddIbsUNsJ:nixos.org/$jZY-rPxCzGokgIGyZGMHXdP-iM2-LLQTMoycDZ6_J5g?via=nixos.org&via=matrix.org&via=nixos.dev
The negotiations with NVIDIA have been run by Flox (although in parallel with many other companies' simultaneous inquiries). Ron kept us, the Foundation, and the SC in the loop, and offered both legal help and workforce. The current idea roughly is that the CUDA team gets access to the relevant repo and infra, and work closely together with Flox to secure the position and a commx channel to NVIDIA.

I've been seeing occasional issues from people about cache misses because we're hitting the limits of our cache capacity, we really need to significantly increase our cache capacity (#1926) for the cuda builds and tests to be sustainable.

Not going to waste community money on a cache we don't need if flox is already doing it.

I don't think we should proceed with the gpu tests here and we should also stop building the cuda packages.

Honestly can't believe that no one bothered to mention that flox was working on their own cache and infra before the announcement.

@ConnorBaker @SomeoneSerge

cc @zimbatm

SomeoneSerge · 2025-09-17T01:12:20Z

Honestly can't believe that no one bothered to mention that flox was working on their own cache and infra before the announcement.

DISCLAIMER: I didn't check the latest revision of this message with Connor or anyone else, and so cannot represent anyone but myself.

Hello @zowoq! This is a failure, on part of the cuda team at least. I can see how this breaks trust or reads as some sort of disrespect. I hope we can one day sit in one room none the less.
Regarding the jobset, if you'd rather separate and make distance, we can start running our own. We wouldn't want to abuse your work. We still do aim to implement GPU Tests outside Flox, for one thing because we're not currently aware of Flox having any interest in GPU Tests, but mainly because a Nix-Community-like structure offers a clearer story for sponsorships, greater flexibility, and more diversification. Concerning Cachix, I believe that to split and host that under a new name would involve too much false-signaling and communication overhead. It would be just more pragmatic if we could reuse the Nix-Community substitutor and include its scaling in the GPU Tests' "budget". Of course a dedicated server or a backblaze bucket used as a backend for snix-narbridge or attic must be still an order of magnitude more affordable than Cachix, as you outline in #1926.

With that out of the way, I'd like to attach a brief transcript of interactions between the CUDA Team and Flox to date so you may the better judge whether to cooperate with us any further.

On May 5 Foundation board member Sebastian (ra33it) and the CUDA Team started a private matrix room for coordinating the communications that various parties (including commercial entities) have been engaging in, in regards to the CUDA EULA issue. We only included people we were already talking to privately and had consented to participating in a group chat. We also invited invited Ron and Tom as representatives of the Board and the SC respectively, and of Flox both. Ron shared that they at Flox had been very interested in the issue "for years", and are trying to find relevant contacts.
On June 13 Ron, Connor, and I had a group call.
- Connor and I provided a recap on the team's status with the infrastructure, labour, and communication with the potential sponsor companies.
- Ron shared that they at Flox had managed to meet a contact at NVIDIA, which could potentially lead to progress for the EULA issue. A Flox-hosted cache was mentioned.
- We all expressed an agreement that in it's in everyone's long-term interst to see Nix CUDA cache outside Flox, preferably under the Foundation.
- We also discussed the concept of a federated community infrastructure and of sponsoring third party non-profits (like hackspaces) to maintain their own and depend on the Foundation less. I believe Jonas was directly referenced.
- We all acknowledged that, after years of NVIDIA not responding to any attempts at starting a conversation on part of various Nixpkgs contributors and their companies, securing any communication channel at all is a priority.
On June 16 Ron shared an address he would present to the SC. The letter addressed the hypothetical outcome of NVIDIA only allowing Flox (a US-based for-profit entity) to redistribute "patchelfed" CUDA. None of this was to be publicly disclosed, as no deal had been secured by then, and neither company made any public announcements.
By June 24th the CUDA team sent back its feedback for the letter, which included minor stylistic edits, but also general comments about conflicts of interest, sustainability, cost of labour, interactions between the Foundation, the SC, and "the industry", etc.
On July 8 Ron shared that they are beginning to "draft contracts". I'm going to guess that by then the SLA-based arrangement was basically confirmed, and we must have known that Flox would have to do the redistribution via their own cache.
Between August 14 and August 19, Ron and Sebastian suggested that an announcement slide be added to the State of the Union presentation at NixCon.
On August 24 I, Connor, and Ron had another call about who and what is it acceptable to mention during State of the Union. Ron also shared that they might not be able to make the announcement at NixCon because of NVIDIA postponing theirs.
On September 5 Ron confirmed that they cannot make the announcement at NixCon.
On September 9 Ron shared they were given the green light and have to act quick to make the announcement.

Hope this gives you some answers. In particular, yes, we had known that the potential Flox-NVIDIA arrangement would involve a binary cache, and it was an oversight not to include you in the discussion back then. While I'm quite happy that Flox is open to talking to and cooperating with the Nixpkgs developer community, Flox's binary cache is, at the end of the day, Flox's binary cache. Flox's private deals with NVIDIA are Flox's private deals with NVIDIA. I personally do not think the Flox-NVIDIA partnership is any reason at all to abandon the CUDA CI efforts under Nix-Community. The communication and trust issues, of course, would be. If you do find yourself still willing to talk to us, both we at the CUDA Team and, I was just assured, Ron on the Flox side would appreciate having a meeting with you.

Cheers,
Serge

zowoq · 2025-09-17T02:54:55Z

Rather than responding to all that immediately I'd like an answer to another question first. If you scroll up to #1807 (comment) you'll see that I asked the same question three times over the course of a month before I got a response, why wasn't it answered promptly?

SomeoneSerge · 2025-09-17T08:25:39Z

If you scroll up to #1807 (comment) you'll see that I asked the same question three times over the course of a month before I got a response, why wasn't it answered promptly

This one is easy: <https://upload.wikimedia.org/wikipedia/en/4/43/Stoned_Fox.jpg> (alt text: spread thin, swamped, stunned, running in a compatibility mode at the lowest frequencies; you might be familiar with the sensation). For my own part, between mid-June and... NixCon, I was not checking Matrix, or replying to nearly any Email, and I only ocasionally reacted to Signal DMs. Replying was not affordable. Note I'm not trying to excuse myself for this one: I am fully in my right to be wasted and burnt out.

…

On September 17, 2025 2:55:17 AM UTC, zowoq ***@***.***> wrote: zowoq left a comment (nix-community/infra#1807) Rather than responding to all that immediately I'd like an answer to another question first. If you scroll up to #1807 (comment) you'll see that I asked the same question three times over the course of a month before I got a response, why wasn't it answered promptly?

zowoq · 2025-09-17T10:41:18Z

Between August 14 and August 19, Ron and Sebastian suggested that an announcement slide be added to the State of the Union presentation at NixCon.
On August 24 I, Connor, and Ron had another call about who and what is it acceptable to mention during State of the Union. Ron also shared that they might not be able to make the announcement at NixCon because of NVIDIA postponing theirs.

I didn't get a response for a question I asked on the 1st until the 29th.

It's fine that you were burnt out, your personal availability isn't the issue. The issue is that the cuda team needs to be reliable.

At the moment I'm trying to decide if it's even worth having a discussion about moving forward with this.

Mic92 · 2025-09-17T10:52:30Z

How much of cache do we need for nix-community? I might work on something that works with: https://www.hetzner.com/storage/object-storage/
Is Flox now also doing GPU tests? This seems a bit orthogonal to building cuda packages?

zowoq · 2025-09-17T11:20:00Z

How much of cache do we need for nix-community? I might work on something that works with: https://www.hetzner.com/storage/object-storage/

Please move this discussion to #1926.

nixos-discourse · 2025-09-27T15:43:51Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-flox-nvidia-opening-up-cuda-redistribution-on-nix/69189/23

github-actions bot added secrets docs dnscontrol labels May 1, 2025

zowoq force-pushed the build06 branch from 3eabcae to fca0d5c Compare May 1, 2025 02:07

SomeoneSerge reviewed May 1, 2025

View reviewed changes

ConnorBaker reviewed May 6, 2025

View reviewed changes

zowoq force-pushed the build06 branch from fca0d5c to 998f240 Compare May 8, 2025 01:02

ConnorBaker mentioned this pull request May 20, 2025

pre-build-hook receives a invalid (non-existing) drvPath when remotely building with Op::BuildDerivation NixOS/nix#9272

Open

build06: init

13e4524

zowoq force-pushed the build06 branch from 998f240 to 13e4524 Compare September 3, 2025 22:48

zowoq closed this Sep 27, 2025

zowoq deleted the build06 branch September 27, 2025 22:23

nix-community locked and limited conversation to collaborators Sep 27, 2025

nix-community unlocked this conversation Oct 21, 2025

hacker1024 mentioned this pull request Oct 24, 2025

GPU access in the sandbox NixOS/nixpkgs#256230

Merged

12 tasks

nix-community locked and limited conversation to collaborators Dec 2, 2025

Uh oh!

Conversation

zowoq commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zowoq Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zowoq commented Jul 31, 2025

Uh oh!

zowoq commented Aug 18, 2025

Uh oh!

zowoq commented Aug 27, 2025

Uh oh!

ConnorBaker commented Aug 28, 2025

Uh oh!

zowoq commented Aug 29, 2025

Uh oh!

ConnorBaker commented Aug 29, 2025

Uh oh!

zowoq commented Sep 3, 2025

Uh oh!

zowoq commented Sep 10, 2025

Uh oh!

zowoq commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SomeoneSerge commented Sep 17, 2025

Uh oh!

zowoq commented Sep 17, 2025

Uh oh!

SomeoneSerge commented Sep 17, 2025 via email

Uh oh!

zowoq commented Sep 17, 2025

Uh oh!

Mic92 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zowoq commented Sep 17, 2025

Uh oh!

nixos-discourse commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zowoq commented May 1, 2025 •

edited

Loading

zowoq Jul 17, 2025 •

edited

Loading

zowoq commented Sep 10, 2025 •

edited

Loading

Mic92 commented Sep 17, 2025 •

edited

Loading