Skip to content

build06: init#1807

Closed
zowoq wants to merge 1 commit intomasterfrom
build06
Closed

build06: init#1807
zowoq wants to merge 1 commit intomasterfrom
build06

Conversation

@zowoq
Copy link
Contributor

@zowoq zowoq commented May 1, 2025

Will wait until the config is finalised before ordering the machine.

cc @ConnorBaker @SomeoneSerge

This wires the machine up so it can be used by https://hydra.nix-community.org/ (and https://buildbot.nix-community.org/) but only for builds that set the cuda requiredSystemFeature.

open = true;
};

programs.nix-required-mounts = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to hack around NixOS/nix#9272, my tentative plan was to see if it's enough to ad hoc patch Nix just on the remote side without touching the requesting side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try that but as long as it can eventually be upstreamed patching both sides is fine.

@Mic92 What do you think a fix for NixOS/nix#9272 would look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the only builds that'll run on this machine are the cuda tests could we set the sandbox paths directly instead of using the hook?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue here? Is this needed for some cachix pre-build hook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue here? Is this needed for some cachix pre-build hook?

Did you read the issue linked in the previous comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH do we actually care? These outputs get garbage collected anyway.

Everything built gets pushed to our cachix cache. To avoid that we'd need to have a separate hydra instance just for running the tests. That then has problems, either we'd need to build the non-test derivations twice or have the test derivations on a much slower schedule to ensure that they have already been build on the main hydra (and still have the problem of derivations that failed on the main hydra being attempted a second time on the test hydra).

Adding features to non-test derivations instead of the test derivations is an interesting idea. However, a derivation's system features are part of the derivation struct and affect its hash

Yes. As long as the feature is a nix default (which big-parallel is) and also correct (I think big-parallel does apply for most of them) I don't see an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything built gets pushed to our cachix cache.

What I meant to say is: cachix contents eventually get garbage-collected...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nix default (which big-parallel is) and also correct (I think big-parallel does apply for most of them) I don't see an issue?

Ah, I see. Clever. This should work?

Copy link
Contributor Author

@zowoq zowoq Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/helsinki-systems/hydra-queue-runner

Using the new queue runner seems like it would address this problem.

Working on this in #1912, not useable yet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the paths that would have been set by the pre-build-hook to nix.settings.sandbox-paths

This is a little tricky, as the hook performs operations like symlink resolution that cannot be done at evaluation time.

Behold, my solution:

programs.nix-required-mounts.extraWrapperArgs = [
  "--run shift"
  "--add-flag '${builtins.unsafeDiscardOutputDependency (derivation { name = "needs-cuda"; builder = "_"; system = "_"; requiredSystemFeatures = [ "cuda" ]; }).drvPath}'"
];

inputs.srvos.nixosModules.hardware-hetzner-online-intel
];

nix.settings.max-jobs = 14;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend setting this to two -- both in the case this is used as an actual builder and because running a bunch of GPU tests simultaneously could cause an OOM event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 this is the place where we suffer very much from lack of a "real" scheduler for Nix, with support for negative affinity and resource constraints like in SLURM. We'll definitely run into issues running things like pytorch or pytorch-lightning test-suites, where you can have matrices of tests run in parallel

printf "$machines" > $out
substituteInPlace $out --replace-fail 'ssh-ng://' 'ssh://'
substituteInPlace $out --replace-fail ' 80 ' ' 3 '
substituteInPlace $out --replace-fail ' 14 ' ' 2 '

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zowoq
Copy link
Contributor Author

zowoq commented Jul 31, 2025

How are these tests going to be set up in nixpkgs for CI to run? Will they be added to the current release-cuda jobset or a new jobset?

@zowoq
Copy link
Contributor Author

zowoq commented Aug 18, 2025

How are these tests going to be set up in nixpkgs for CI to run? Will they be added to the current release-cuda jobset or a new jobset?

@ConnorBaker @SomeoneSerge

@zowoq
Copy link
Contributor Author

zowoq commented Aug 27, 2025

How are these tests going to be set up in nixpkgs for CI to run? Will they be added to the current release-cuda jobset or a new jobset?

@ConnorBaker @SomeoneSerge

^ asking again

@ConnorBaker
Copy link

Yes, I imagine they'd be added there and include requiredSystemFeatures = [ "cuda" ];; although without allowing the GPU in the sandbox for all builds (or something like #1912) I'm still not sure how we'd do the testing :l

@zowoq
Copy link
Contributor Author

zowoq commented Aug 29, 2025

(or something like #1912)

I'm working on this but hard to say when it'll be ready, could be in a week or two, I'd certainty hope to have it running within a month or two.

@ConnorBaker
Copy link

Hopefully that gives me a chance to make sure things actually build 🫠

@zowoq
Copy link
Contributor Author

zowoq commented Sep 3, 2025

#1912 is merged but I want to see that it can handle a full rebuild from a staging-next merge without problems. There is a staging-next PR open at the moment which I expect will be merged in a week or two, after that should be ready to move forward with this.

@zowoq
Copy link
Contributor Author

zowoq commented Sep 10, 2025

#1912 is merged but I want to see that it can handle a full rebuild from a staging-next merge without problems. There is a staging-next PR open at the moment which I expect will be merged in a week or two, after that should be ready to move forward with this.

Staging-next was merged earlier, there wasn't any problems with the new queue runner handling the full rebuild so we're ready to move forward with this.

I'll wait for the tests to be included in a jobset that we can run before ordering the machine.

@zowoq
Copy link
Contributor Author

zowoq commented Sep 10, 2025

https://discourse.nixos.org/t/nix-flox-nvidia-opening-up-cuda-redistribution-on-nix/69189

https://matrix.to/#/!eWOErHSaiddIbsUNsJ:nixos.org/$jZY-rPxCzGokgIGyZGMHXdP-iM2-LLQTMoycDZ6_J5g?via=nixos.org&via=matrix.org&via=nixos.dev
The negotiations with NVIDIA have been run by Flox (although in parallel with many other companies' simultaneous inquiries). Ron kept us, the Foundation, and the SC in the loop, and offered both legal help and workforce. The current idea roughly is that the CUDA team gets access to the relevant repo and infra, and work closely together with Flox to secure the position and a commx channel to NVIDIA.

I've been seeing occasional issues from people about cache misses because we're hitting the limits of our cache capacity, we really need to significantly increase our cache capacity (#1926) for the cuda builds and tests to be sustainable.

Not going to waste community money on a cache we don't need if flox is already doing it.

I don't think we should proceed with the gpu tests here and we should also stop building the cuda packages.

Honestly can't believe that no one bothered to mention that flox was working on their own cache and infra before the announcement.

@ConnorBaker @SomeoneSerge

cc @zimbatm

@SomeoneSerge
Copy link
Contributor

Honestly can't believe that no one bothered to mention that flox was working on their own cache and infra before the announcement.

DISCLAIMER: I didn't check the latest revision of this message with Connor or anyone else, and so cannot represent anyone but myself.

Hello @zowoq! This is a failure, on part of the cuda team at least. I can see how this breaks trust or reads as some sort of disrespect. I hope we can one day sit in one room none the less.
Regarding the jobset, if you'd rather separate and make distance, we can start running our own. We wouldn't want to abuse your work. We still do aim to implement GPU Tests outside Flox, for one thing because we're not currently aware of Flox having any interest in GPU Tests, but mainly because a Nix-Community-like structure offers a clearer story for sponsorships, greater flexibility, and more diversification. Concerning Cachix, I believe that to split and host that under a new name would involve too much false-signaling and communication overhead. It would be just more pragmatic if we could reuse the Nix-Community substitutor and include its scaling in the GPU Tests' "budget". Of course a dedicated server or a backblaze bucket used as a backend for snix-narbridge or attic must be still an order of magnitude more affordable than Cachix, as you outline in #1926.

With that out of the way, I'd like to attach a brief transcript of interactions between the CUDA Team and Flox to date so you may the better judge whether to cooperate with us any further.

  1. On May 5 Foundation board member Sebastian (ra33it) and the CUDA Team started a private matrix room for coordinating the communications that various parties (including commercial entities) have been engaging in, in regards to the CUDA EULA issue. We only included people we were already talking to privately and had consented to participating in a group chat. We also invited invited Ron and Tom as representatives of the Board and the SC respectively, and of Flox both. Ron shared that they at Flox had been very interested in the issue "for years", and are trying to find relevant contacts.
  2. On June 13 Ron, Connor, and I had a group call.
    • Connor and I provided a recap on the team's status with the infrastructure, labour, and communication with the potential sponsor companies.
    • Ron shared that they at Flox had managed to meet a contact at NVIDIA, which could potentially lead to progress for the EULA issue. A Flox-hosted cache was mentioned.
    • We all expressed an agreement that in it's in everyone's long-term interst to see Nix CUDA cache outside Flox, preferably under the Foundation.
    • We also discussed the concept of a federated community infrastructure and of sponsoring third party non-profits (like hackspaces) to maintain their own and depend on the Foundation less. I believe Jonas was directly referenced.
    • We all acknowledged that, after years of NVIDIA not responding to any attempts at starting a conversation on part of various Nixpkgs contributors and their companies, securing any communication channel at all is a priority.
  3. On June 16 Ron shared an address he would present to the SC. The letter addressed the hypothetical outcome of NVIDIA only allowing Flox (a US-based for-profit entity) to redistribute "patchelfed" CUDA. None of this was to be publicly disclosed, as no deal had been secured by then, and neither company made any public announcements.
  4. By June 24th the CUDA team sent back its feedback for the letter, which included minor stylistic edits, but also general comments about conflicts of interest, sustainability, cost of labour, interactions between the Foundation, the SC, and "the industry", etc.
  5. On July 8 Ron shared that they are beginning to "draft contracts". I'm going to guess that by then the SLA-based arrangement was basically confirmed, and we must have known that Flox would have to do the redistribution via their own cache.
  6. Between August 14 and August 19, Ron and Sebastian suggested that an announcement slide be added to the State of the Union presentation at NixCon.
  7. On August 24 I, Connor, and Ron had another call about who and what is it acceptable to mention during State of the Union. Ron also shared that they might not be able to make the announcement at NixCon because of NVIDIA postponing theirs.
  8. On September 5 Ron confirmed that they cannot make the announcement at NixCon.
  9. On September 9 Ron shared they were given the green light and have to act quick to make the announcement.

Hope this gives you some answers. In particular, yes, we had known that the potential Flox-NVIDIA arrangement would involve a binary cache, and it was an oversight not to include you in the discussion back then. While I'm quite happy that Flox is open to talking to and cooperating with the Nixpkgs developer community, Flox's binary cache is, at the end of the day, Flox's binary cache. Flox's private deals with NVIDIA are Flox's private deals with NVIDIA. I personally do not think the Flox-NVIDIA partnership is any reason at all to abandon the CUDA CI efforts under Nix-Community. The communication and trust issues, of course, would be. If you do find yourself still willing to talk to us, both we at the CUDA Team and, I was just assured, Ron on the Flox side would appreciate having a meeting with you.

Cheers,
Serge

@zowoq
Copy link
Contributor Author

zowoq commented Sep 17, 2025

Rather than responding to all that immediately I'd like an answer to another question first. If you scroll up to #1807 (comment) you'll see that I asked the same question three times over the course of a month before I got a response, why wasn't it answered promptly?

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Sep 17, 2025 via email

@zowoq
Copy link
Contributor Author

zowoq commented Sep 17, 2025

Between August 14 and August 19, Ron and Sebastian suggested that an announcement slide be added to the State of the Union presentation at NixCon.
On August 24 I, Connor, and Ron had another call about who and what is it acceptable to mention during State of the Union. Ron also shared that they might not be able to make the announcement at NixCon because of NVIDIA postponing theirs.

I didn't get a response for a question I asked on the 1st until the 29th.

It's fine that you were burnt out, your personal availability isn't the issue. The issue is that the cuda team needs to be reliable.

At the moment I'm trying to decide if it's even worth having a discussion about moving forward with this.

@Mic92
Copy link
Member

Mic92 commented Sep 17, 2025

How much of cache do we need for nix-community? I might work on something that works with: https://www.hetzner.com/storage/object-storage/
Is Flox now also doing GPU tests? This seems a bit orthogonal to building cuda packages?

@zowoq
Copy link
Contributor Author

zowoq commented Sep 17, 2025

How much of cache do we need for nix-community? I might work on something that works with: https://www.hetzner.com/storage/object-storage/

Please move this discussion to #1926.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-flox-nvidia-opening-up-cuda-redistribution-on-nix/69189/23

@zowoq zowoq closed this Sep 27, 2025
@zowoq zowoq deleted the build06 branch September 27, 2025 22:23
@nix-community nix-community locked and limited conversation to collaborators Sep 27, 2025
@nix-community nix-community unlocked this conversation Oct 21, 2025
@nix-community nix-community locked and limited conversation to collaborators Dec 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants