Conversation
| open = true; | ||
| }; | ||
|
|
||
| programs.nix-required-mounts = { |
There was a problem hiding this comment.
We'll need to hack around NixOS/nix#9272, my tentative plan was to see if it's enough to ad hoc patch Nix just on the remote side without touching the requesting side
There was a problem hiding this comment.
We can try that but as long as it can eventually be upstreamed patching both sides is fine.
@Mic92 What do you think a fix for NixOS/nix#9272 would look like?
There was a problem hiding this comment.
As the only builds that'll run on this machine are the cuda tests could we set the sandbox paths directly instead of using the hook?
There was a problem hiding this comment.
What is the issue here? Is this needed for some cachix pre-build hook?
There was a problem hiding this comment.
What is the issue here? Is this needed for some cachix pre-build hook?
Did you read the issue linked in the previous comments?
There was a problem hiding this comment.
OTOH do we actually care? These outputs get garbage collected anyway.
Everything built gets pushed to our cachix cache. To avoid that we'd need to have a separate hydra instance just for running the tests. That then has problems, either we'd need to build the non-test derivations twice or have the test derivations on a much slower schedule to ensure that they have already been build on the main hydra (and still have the problem of derivations that failed on the main hydra being attempted a second time on the test hydra).
Adding features to non-test derivations instead of the test derivations is an interesting idea. However, a derivation's system features are part of the derivation struct and affect its hash
Yes. As long as the feature is a nix default (which big-parallel is) and also correct (I think big-parallel does apply for most of them) I don't see an issue?
There was a problem hiding this comment.
Everything built gets pushed to our cachix cache.
What I meant to say is: cachix contents eventually get garbage-collected...
There was a problem hiding this comment.
a nix default (which big-parallel is) and also correct (I think big-parallel does apply for most of them) I don't see an issue?
Ah, I see. Clever. This should work?
There was a problem hiding this comment.
https://github.com/helsinki-systems/hydra-queue-runner
Using the new queue runner seems like it would address this problem.
Working on this in #1912, not useable yet.
There was a problem hiding this comment.
Adding the paths that would have been set by the pre-build-hook to nix.settings.sandbox-paths
This is a little tricky, as the hook performs operations like symlink resolution that cannot be done at evaluation time.
Behold, my solution:
programs.nix-required-mounts.extraWrapperArgs = [
"--run shift"
"--add-flag '${builtins.unsafeDiscardOutputDependency (derivation { name = "needs-cuda"; builder = "_"; system = "_"; requiredSystemFeatures = [ "cuda" ]; }).drvPath}'"
];| inputs.srvos.nixosModules.hardware-hetzner-online-intel | ||
| ]; | ||
|
|
||
| nix.settings.max-jobs = 14; |
There was a problem hiding this comment.
I'd recommend setting this to two -- both in the case this is used as an actual builder and because running a bunch of GPU tests simultaneously could cause an OOM event.
There was a problem hiding this comment.
👍🏻 this is the place where we suffer very much from lack of a "real" scheduler for Nix, with support for negative affinity and resource constraints like in SLURM. We'll definitely run into issues running things like pytorch or pytorch-lightning test-suites, where you can have matrices of tests run in parallel
modules/nixos/hydra.nix
Outdated
| printf "$machines" > $out | ||
| substituteInPlace $out --replace-fail 'ssh-ng://' 'ssh://' | ||
| substituteInPlace $out --replace-fail ' 80 ' ' 3 ' | ||
| substituteInPlace $out --replace-fail ' 14 ' ' 2 ' |
There was a problem hiding this comment.
This should change depending on https://github.com/nix-community/infra/pull/1807/files#r2075684863, correct?
|
How are these tests going to be set up in nixpkgs for CI to run? Will they be added to the current release-cuda jobset or a new jobset? |
|
^ asking again |
|
Yes, I imagine they'd be added there and include |
I'm working on this but hard to say when it'll be ready, could be in a week or two, I'd certainty hope to have it running within a month or two. |
|
Hopefully that gives me a chance to make sure things actually build 🫠 |
|
#1912 is merged but I want to see that it can handle a full rebuild from a staging-next merge without problems. There is a staging-next PR open at the moment which I expect will be merged in a week or two, after that should be ready to move forward with this. |
Staging-next was merged earlier, there wasn't any problems with the new queue runner handling the full rebuild so we're ready to move forward with this. I'll wait for the tests to be included in a jobset that we can run before ordering the machine. |
|
https://discourse.nixos.org/t/nix-flox-nvidia-opening-up-cuda-redistribution-on-nix/69189
I've been seeing occasional issues from people about cache misses because we're hitting the limits of our cache capacity, we really need to significantly increase our cache capacity (#1926) for the cuda builds and tests to be sustainable. Not going to waste community money on a cache we don't need if flox is already doing it. I don't think we should proceed with the gpu tests here and we should also stop building the cuda packages. Honestly can't believe that no one bothered to mention that flox was working on their own cache and infra before the announcement. cc @zimbatm |
DISCLAIMER: I didn't check the latest revision of this message with Connor or anyone else, and so cannot represent anyone but myself. Hello @zowoq! This is a failure, on part of the cuda team at least. I can see how this breaks trust or reads as some sort of disrespect. I hope we can one day sit in one room none the less. With that out of the way, I'd like to attach a brief transcript of interactions between the CUDA Team and Flox to date so you may the better judge whether to cooperate with us any further.
Hope this gives you some answers. In particular, yes, we had known that the potential Flox-NVIDIA arrangement would involve a binary cache, and it was an oversight not to include you in the discussion back then. While I'm quite happy that Flox is open to talking to and cooperating with the Nixpkgs developer community, Flox's binary cache is, at the end of the day, Flox's binary cache. Flox's private deals with NVIDIA are Flox's private deals with NVIDIA. I personally do not think the Flox-NVIDIA partnership is any reason at all to abandon the CUDA CI efforts under Nix-Community. The communication and trust issues, of course, would be. If you do find yourself still willing to talk to us, both we at the CUDA Team and, I was just assured, Ron on the Flox side would appreciate having a meeting with you. Cheers, |
|
Rather than responding to all that immediately I'd like an answer to another question first. If you scroll up to #1807 (comment) you'll see that I asked the same question three times over the course of a month before I got a response, why wasn't it answered promptly? |
|
If you scroll up to #1807 (comment) you'll see that I asked the same question three times over the course of a month before I got a response, why wasn't it answered promptly
This one is easy: <https://upload.wikimedia.org/wikipedia/en/4/43/Stoned_Fox.jpg> (alt text: spread thin, swamped, stunned, running in a compatibility mode at the lowest frequencies; you might be familiar with the sensation).
For my own part, between mid-June and... NixCon, I was not checking Matrix, or replying to nearly any Email, and I only ocasionally reacted to Signal DMs. Replying was not affordable.
Note I'm not trying to excuse myself for this one: I am fully in my right to be wasted and burnt out.
…On September 17, 2025 2:55:17 AM UTC, zowoq ***@***.***> wrote:
zowoq left a comment (nix-community/infra#1807)
Rather than responding to all that immediately I'd like an answer to another question first. If you scroll up to #1807 (comment) you'll see that I asked the same question three times over the course of a month before I got a response, why wasn't it answered promptly?
|
I didn't get a response for a question I asked on the 1st until the 29th. It's fine that you were burnt out, your personal availability isn't the issue. The issue is that the cuda team needs to be reliable. At the moment I'm trying to decide if it's even worth having a discussion about moving forward with this. |
|
How much of cache do we need for nix-community? I might work on something that works with: https://www.hetzner.com/storage/object-storage/ |
Please move this discussion to #1926. |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/nix-flox-nvidia-opening-up-cuda-redistribution-on-nix/69189/23 |
Will wait until the config is finalised before ordering the machine.
cc @ConnorBaker @SomeoneSerge
This wires the machine up so it can be used by https://hydra.nix-community.org/
(and https://buildbot.nix-community.org/)but only for builds that set thecudarequiredSystemFeature.