-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nix copy hangs forever sometimes #3017
Comments
I marked this as stale due to inactivity. → More info |
Still relevant |
I marked this as stale due to inactivity. → More info |
I have run into this issue multiple times and seemingly at random in my own CI. Here is an example where it hung for over 25 minutes until I manually cancelled it, strangely when I did hit the cancel button the job exited successfully, which was a surprise. I did another run right after to try and get more information by slapping a The line I link to here is usually where it hangs though, which seems to be after an ssh connection is already made. My suspicion, since the hung job mentioned above actually exited successful and even shows successful uploads in the logs (that portion of the log only showed up after I cancelled it) is that there is some sort of a dead lock in the parallel upload code. Just a guess though. |
I just noticed that this seems to happen frequently when a large amount of derivations are going to be copied (either because you passed them, or because the ones you passed have a lot of dependencies to copy). I also noticed that if you just wait for a long time that the copying actually does start. So now there are two possible explanations at this point: Does Only other explanation I can think of is that Update |
NIX_SSHOPTS"-oControlMaster=no" fixes this issue as far as I can tell |
But even then, ControlPath should probably be set to path which doesn't exist, since the Control socket will still be used regardless |
@kittywitch very interesting, can you say how you discovered this workaround? Is nix intentionally using ControlMaster for latency reasons, or accidentally picking it up from the ambient ssh config? Also, wouldn't this only apply to Just as an update, we still see this, but rarely. |
For me, closing my existing ssh session to the target system worked, but I'll try this next time. Almost certainly the same root cause. In my case, I was running I tried running it manually with
My stack trace looks a bit different though (possibly because this is
|
FWIW switching to |
@simonzkl Where did you set |
Where it accepts the Nix Store URL, e.g. in |
I am trying to copy closure of one of my systems to another and it is freezing all the time. My command:
|
As I also have this problem quite often, I've noticed that for me it was always #5304, so that there are file locks that weren't cleaned up by previous nix commands (I suspect, maybe because they were canceled?). To workaround this(?) issue, I wrote a simple script to obtain file locks (not sure if it's absolutely safe though, to do it like that): #!/bin/sh
for file in $(ls /nix/store/); do
test -f /nix/store/${file}.lock && echo "/nix/store/$file.lock"
done which lists all the stale file locks and then I remove those locks via sudo mount -o remount,rw,bind /nix/store
sudo rm #<list of these locks> Not optimal obviously... Edit: As I'm just facing it again, it seems there are also locks for files/folders that don't exist (so they aren't catched by that script unfortunately). deleting them still works around the issue, but it's a little bit cumbersome... Edit2: Maybe to improve that workaround script, check the file size to be == |
Here is a process that has been hung for 6 days:
elaforge 2909 0.0 0.1 514780 17684 ? Sl Jul25 0:00 /nix/store/zg66y04g2bvmw41cgrywysr86s40g5cc-nix-2.2/bin/nix copy --from https://cache.nixos.org /nix/store/jxw2sxagx9smpjklb00qzgiqgqv1zvl6-Groq.Util.Exceptions --option allowed-impure-host-deps /groq /etc/hostname --option extra-sandbox-paths /groq/models --option substituters https://cache.nixos.org http://narpile.groq --option sandbox relaxed --option require-sigs false --option fallback true --option keep-outputs true --option max-jobs 0 --option builders-use-substitutes true --option cores 0
Note that the store path doesn't exist on cache.nixos.org, and when I run by hand, I get a "path not valid" right away. But this one hung for some reason.
eu-stack -p 2909 -s
says:It's not just for invalid paths, here's another one:
elaforge 29718 0.0 0.1 514780 17892 ? Sl Jul23 0:00 /nix/store/zg66y04g2bvmw41cgrywysr86s40g5cc-nix-2.2/bin/nix copy --from http://narpile.groq /nix/store/5naa5ppxqz233bhwms2hkh8ifvn3f1z2-silently-1.2.5-doc --option allowed-impure-host-deps /groq /etc/hostname --option extra-sandbox-paths /groq/models --option substituters http://narpile.groq --option sandbox relaxed --option require-sigs false --option fallback true --option keep-outputs true --option max-jobs 0 --option builders-use-substitutes true --option cores 0
This path exists and when I run by hand it completes immediately (I have it locally). The stack looks similar:
This is nixStable from nixpkgs 19.03 at 5c52b25283a6cccca443ffb7a358de6fe14b4a81. The OS is the GCP image:
ubuntu-1604-xenial-v20180306
The text was updated successfully, but these errors were encountered: