Skip to content

Allow missing binary caches by default#7188

Closed
arcuru wants to merge 1 commit intoNixOS:masterfrom
arcuru:fallback-default
Closed

Allow missing binary caches by default#7188
arcuru wants to merge 1 commit intoNixOS:masterfrom
arcuru:fallback-default

Conversation

@arcuru
Copy link
Contributor

@arcuru arcuru commented Oct 18, 2022

This change allows for individual substituters to be unavailable even when fallback = false (default). Prior to this change, any unavailable substituters would cause a build failure even if others were reachable. This causes issues when people try to setup occasionally accessible binary caches.

The fallback setting states that it allows Nix to fallback to building from source if a binary substitute fails, and this change makes the behavior now match the description. Prior to this change it actually did two things:

  1. Allows for missing/erroring binary substituters
  2. Fallback to building from source if no binary substitute is available

Allowing unreachable binary substituters should probably be the expected behavior with or without the fallback flag, and is what this change allows.

I've tested simple scenarios locally, but it's not clear to me how to write a test for this. I think it needs a network failure to trigger the new paths, and I'm not sure how to set that up in the tests.

Closes #6901, Closes #7127, Closes #4383

@arcuru
Copy link
Contributor Author

arcuru commented Dec 7, 2022

Friendly ping. I am constantly seeing this come up, most recently in #7424 (comment).

@Ericson2314
Copy link
Member

Oh yes, things like this should not languish.

@Ericson2314
Copy link
Member

@edolstra based on our very brief convo this morning, I think there is a misunderstanding. The goal is not to fallback on building from source, but just to try all the substituters before giving up.

@arcuru
Copy link
Contributor Author

arcuru commented Dec 9, 2022

Thanks @Ericson2314 :)

Just for completeness:

From #7424:

#7188 still tries them sequentially, so a 60 second timeout can be annoying. We would need to them concurrently for it to not be annoying.

If we're ok with the changes in this PR, then rewriting this codepath to check substituters in parallel looks fairly straightforward. I'd be willing to rewrite this PR to include fetching the substituters in parallel if that's desired.

Also happy to wait and make that change separately (or at least log it for later), which is probably the better option.

@yajo
Copy link
Contributor

yajo commented Dec 14, 2022

Allowing unreachable binary substituters should probably be the expected behavior with or without the fallback flag

FWIW the same is true IMHO for remote builders.

@Ericson2314
Copy link
Member

Ericson2314 commented Dec 14, 2022

@patricksjackson thank you very much for that offer, but I would say let's leave that for a follow-up PR? This is good a a minimal thing to discuss what behavior we want (which I am not yet sure is uncontroversial, in case I am misunderstanding what @edolstra was saying), after which doing things in parallel is mostly an optimization.

@Ericson2314
Copy link
Member

You're free to work on PR concurrently if you want, though :)

@domenkozar
Copy link
Member

Cachix was down 10 days ago for 30min, half of the time was due to this not being fixed in Nix yet.

Ironically, users of Cachix had the same issue because of which Cachix was down.

Hope that helps motivate merging/fixing this finally.

@domenkozar
Copy link
Member

@roberth @Ericson2314 could you cover this one for the next Nix team meeting?

@roberth
Copy link
Member

roberth commented Dec 26, 2022

How can there be no test change and no test failure?

@domenkozar
Copy link
Member

It's likely that @patricksjackson is a first time contributor and needs a bit of guidance.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-01-02-nix-team-meeting-minutes-20/24403/1

@Ericson2314
Copy link
Member

Ericson2314 commented Jan 9, 2023

My understanding of the semantics:

Ignoring querySubstitutablePathInfos for now

Background

query path from substitutior results:

  • Success
  • Throws InvalidPath: Cache is accessible, and for sure does not have the store object
  • Throws SubstituterDisabled: not sure?
  • Throws Error: substitutor is inaccessible, it may or may not have the store object

SubstituterDisabled being subtle/unclear to me isn't so bad, because SubstituterDisabled and other Error are always handled the same way, so we can conflate them.

Today

The semantics on Nix master are very order dependent, hard to specify except for as the exact algorithm:

for each substitutor:
  if it has the path:
    return path
  else if InvalidPath:
    continue search
  else if SubstituterDisabled or other Error:
    if not fallback:
      give up
    else:
      keep searching
    end
  end
end

build from source

Note that this means that if one subsitutor is down (Error or SubstituterDisabled) but the next one has the path, it will fail. This is not very user friendly!

New semantics

These are presented in a a more "declarative" and order-independent way.

New definitions for querying on sets of substitutors (note that sets are order-independent!)

  1. If any substutitor returns success, the set of substituters has the path.
  2. If all substitutors returns InvalidPath, the set of substiters doesn't have the path.

Note the "any" vs "all" duality.

  1. Otherwise, we have a mix of Error/SubstituterDisabled and InvalidPath, in which case it is unknown whether the set substitutors has the path, in the sense that inaccessible substituters may or may not have path but are down but we dont't know because we cannot reach them.

New policy based on the above defs:

case result on query on set of substituters:
  case set has path:
    substitute path
      (left unspecified which one to get from, which matters the store objects they provide are not the same)
  case set does not have path:
    build path
  case set maybe has path:
    if fallback:
      build path (mention building because --fallback)
    else:
      error, describing the situation and possible workaround (using --fallback)
    end
  end
end

@fricklerhandwerk
Copy link
Contributor

Discussed in Nix team meeting on 2023-01-09:

@patricksjackson thanks for your contribution! We agree that we want to fix this, but haven't decided yet what the overall process of picking substituters should look like. Therefore no decision to merge, since we don't know if that would incur changing behavior down the line, or other redundant work.
@Ericson2314 volunteered to write up a proposal and, when done, put it back on the agenda for the team to decide. We will then come back to this PR.

Complete discussion
  • @thufschmitt: the problem with failing only after trying all substituters is that most non-trivial Nix builds will have a last wrapper that is not cached anywhere (such as nix-shell)
    • @Ericson2314: yes, this solution only gets us half way to fulfilling the user story
  • @roberth: how about adding a parameter for each substituter to tell if it's required?
    • @fricklerhandwerk: that requires more syntax for the configuration file
    • @roberth: we could use ? parameters. it's not great but a simple way to do it
    • @fricklerhandwerk: more configuration is probably okay, but we should decide more deliberately on how to add more parameters. would be better if we could solve the issue in way that just works in a way that is unsurprising
    • @thufschmitt: there are two kinds of caches: the "authoritative" cache.nixos.org which one uses for 95% of the cases without which nothing works, and the "convenience" caches such as Cachix.
      • this speaks in favor of having a parameter
  • @thufschmitt: another question is about warning users on failures. verbose output may drown the messages.
  • @thufschmitt: there are two types of failures: not reachable or actual error. they should be handled differently
  • @Ericson2314: the worst thing about the current state is that order of substituters matters, because it fails on the first one. order should not matter
    • @thufschmitt: ideally we should parallelise susbtituter access, even if the code currently doesn't allow for it
    • @roberth: speculative fetching of .narinfo files and removing order from the configuration are different things. the latter is more complicated actually, because it adds more corner cases.
  • @fricklerhandwerk: how to move on with this PR? it seems we don't really know what the desired behavior even is
    • @roberth: let's write a specification for what the behavior should be overall
    • @Ericson2314: yes, should go backwards from desired behavior and guide by proposing test cases
    • @thufschmitt: agree, the team should make a proposal for the semantics
  • agreement: @Ericson2314 will write up a proposal and, when done, put it back on the agenda for the team to decide over. then continue discussing the implementation with @patricksjackson

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-01-09-nix-team-meeting-minutes-22/24577/1

@Ericson2314
Copy link
Member

The thing I wrote above is the proposal. We should be able to resume progress on this.

@Ericson2314
Copy link
Member

Ericson2314 commented Feb 19, 2023

@fricklerhandwerk Can you move this back to "in discussion" since I wrote the above proposal? I no longer have the perms to do so.

@fricklerhandwerk fricklerhandwerk added the UX The way in which users interact with Nix. Higher level than UI. label Mar 3, 2023
@fricklerhandwerk
Copy link
Contributor

Discussed in the Nix team meeting:

  • @roberth: would be slightly more clear if we wrote "unknown" instead of "maybe has path"
  • @tomberek: does the unknown state provide feedback to the user about what's happening?
    • @Ericson2314: yes, we could provide a structured error message as we did with the profiles lately
    • @fricklerhandwerk: and that's up to the lower-level implementation to determine, so somewhat below the level of abstraction we care about here
    • @tomberek: some information should be visible somewhere why something is happening, such as why we're building or picking a certain substitutor
  • agreement
    • @Ericson2314 will help out getting the PR to implement the proposal

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-03-06-nix-team-meeting-minutes-38/26056/1

@arcuru
Copy link
Contributor Author

arcuru commented Mar 6, 2023

It should be obvious, but just to be clear about my feelings as the original PR author, and since I wasn't there to discuss things with you:

I don't feel any ownership over this change, and I expect it's easier for any one of you to handle this PR yourself anyways. Feel free to steal/close/whatever you want with this PR, since it looks like @Ericson2314 is planning to write the agreed solution.

Thanks for thinking about a holistic solution!

@Ericson2314
Copy link
Member

Ericson2314 commented Mar 6, 2023

@patricksjackson So I was requested to make this proposal by the rest of the Nix Team to help us clarify our own thoughts. It wasn't supposed to be in contrast to the PR -- quite on the contrary it was your PR description that helped me write that post!

I think this PR is close to doing that (by "today" I meant not your PR but Nix master, edited to make that explicit), and any divergence is not intentional on your part. I much rather help you finish this off if you are still interested than implement myself.

Sorry again this got stuck in the team backlog for so long before we returned to it.

@arcuru
Copy link
Contributor Author

arcuru commented Mar 8, 2023

I'll take a look soon.

@arcuru arcuru marked this pull request as draft March 8, 2023 21:45
@arcuru arcuru force-pushed the fallback-default branch from bb998f7 to 38148d1 Compare March 8, 2023 21:46
@Ericson2314
Copy link
Member

Glad to hear it!

@arcuru
Copy link
Contributor Author

arcuru commented Mar 16, 2023

Alright, I've now had a chance to come back and reload the context here. I have a couple of questions before I can make progress again: @Ericson2314

First, it's unclear to me if your design proposal is describing how you want this code to work at a high level or if you are describing what you want the actual code flow to look like.

  • If it's just a high level description, then you're basically describing what the code already does after my changes? I can take a second look and clean up the error messages and the code a bit, which I think is what you're suggesting, but that's pretty much the intended existing logic after the changes already in this PR.

  • If you do want the code flow to work exactly as described (querying all the substituters first then applying the logic):

    • It might make it clearer, however there are several questions I would need clarified from the Nix team concerning implementation details, so let me know if that's the approach you want to take. (for example: is it ok to leak store paths to lower-priority substituters, since that doesn't happen today? are there concerns about the extra network traffic? should we add a flag to choose substituters by latency instead of/in addition to priority? etc)
    • I would likely still need to apply this logic one substituter response at a time for performance reasons, so the benefits to readability would mainly come from refactoring.
    • This is a change I'd planned to propose as a followup anyways, before the current PR stalled; factoring out this set of code and then async querying the substituters in an intelligent way. I'd be happy to ditch the current PR and just move to that one if you preferred, but it will be more work.

Lastly, as I mentioned in the original comment, I'm unable to find how to test this in the current codebase. There appear to be no unit tests covering these modules, so do you want me to write the unit test infrastructure to test this? There are a few integration tests but even those I'm not sure how to trigger a network error within them, especially to trigger all the paths that this change would use. I suspect I'm missing something here, but I'm not very familiar with the codebase still.

@arcuru
Copy link
Contributor Author

arcuru commented Mar 18, 2023

This is a change I'd planned to propose as a followup anyways, before the current PR stalled; factoring out this set of code and then async querying the substituters in an intelligent way. I'd be happy to ditch the current PR and just move to that one if you preferred, but it will be more work.

I said screw it and wrote an MVP for querying them all async, since I had this problem on my mind. It's in a separate branch on my fork: https://github.com/patricksjackson/nix/tree/substitution-set

I won't be offended if it's not needed/wanted/feasible, I felt like writing it anyways.

@Ericson2314
Copy link
Member

@patricksjackson Sorry I missed this!! :( Feel free -- encoraged -- to ping me on matrix if I am not responding to a PR to which I am assigned.

First, it's unclear to me if your #7188 (comment) is describing how you want this code to work at a high level or if you are describing what you want the actual code flow to look like.

Just at a high level :). It is supposed to be a specification which admits many compliant implementations.

If it's just a high level description, then you're basically describing what the code already does after my changes?

I believe I have described your intent, but the code doesn't quite do what it says. I think the divergence is just an accident.

I said screw it and wrote an MVP for querying them all async

Oh that's very cool! I think we can wrap this up, but then try to do that after. How's that sound?

return;
}
throw;
tryNext();
Copy link
Member

@Ericson2314 Ericson2314 Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think what we need to do is stash the error (in a new substitution goal variable) here. And then at the end of the function, we re-throw the error instead of returning if nothing is found.

Does it make sense how that brings us in alignment with the spec I wrote?

return;
}
throw;
logError(e.info());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, not sure if we should log here, or log at the end if we don't end up re-throwing.

logError(e.info());
else
throw;
logError(e.info());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/a-common-public-nix-cache/26998/8

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/the-nixos-foundations-call-to-action-s3-costs-require-community-support/28672/88

@SuperSandro2000
Copy link
Member

bump

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-07-24-nix-team-meeting-minutes-75/31112/1

@bouk bouk mentioned this pull request Sep 15, 2023
@bouk
Copy link
Member

bouk commented Sep 15, 2023

I'm picking up this change at #8983

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/announcing-ncps-a-nix-cache-proxy-server-for-faster-builds/58166/10

Ericson2314 pushed a commit that referenced this pull request Sep 12, 2025
… are still enabled (#13301)

## Motivation

Nix currently hard fails if a substituter is inaccessible, even when they are other substituters available, unless `fallback = true`. 
This breaks nix build, run, shell et al entirely. 
This would modify the default behaviour so that nix would actually use the other available substituters and not hard error.

Here is an example before vs after when using dotenv where I have manually stopped my own cache to trigger this issue, before and after the patch. The initial error is really frustrating because there is other caches available.
![image](https://github.com/user-attachments/assets/b4aec474-52d1-497d-b4e8-6f5737d6acc7)
![image](https://github.com/user-attachments/assets/ee91fcd4-4a1a-4c33-bf88-3aee67ad3cc9)

## Context

#3514 (comment) is the earliest issue I could find, but there are many duplicates.

There is an initial PR at #7188, but this appears to have been abandoned - over 2 years with no activity, then a no comment review in jan. There was a subsequent PR at #8983 but this was closed without merge - over a year without activity.
<!-- Non-trivial change: Briefly outline the implementation strategy. -->
I have visualised the current and proposed flows. I believe my logic flows line up with what is suggested in #7188 (comment) but correct me if I am wrong.
Current behaviour:
![current](https://github.com/user-attachments/assets/d9501b34-274c-4eb3-88c3-9021a482e364)
Proposed behaviour:
![proposed](https://github.com/user-attachments/assets/8236e4f4-21ef-45d7-87e1-6c8d416e8c1c)

[Charts in lucid](https://lucid.app/lucidchart/1b51b08d-6c4f-40e0-bf54-480df322cccf/view)
<!-- Invasive change: Discuss alternative designs or approaches you considered. -->

Possible issues to think about:
- I could not figure out where the curl error is created... I can't figure out how to swallow it and turn it into a warn or better yet, a debug log.
- Unfortunately, in contrast with the previous point, I'm not sure how verbose we want to warns/traces to be - personally I think that the warn that a substituter has been disabled (when it happens) is sufficient, and that the next one is being used, but this is personal preference.
philipwilk added a commit to philipwilk/nix that referenced this pull request Sep 13, 2025
… are still enabled (NixOS#13301)

Nix currently hard fails if a substituter is inaccessible, even when they are other substituters available, unless `fallback = true`.
This breaks nix build, run, shell et al entirely.
This would modify the default behaviour so that nix would actually use the other available substituters and not hard error.

Here is an example before vs after when using dotenv where I have manually stopped my own cache to trigger this issue, before and after the patch. The initial error is really frustrating because there is other caches available.
![image](https://github.com/user-attachments/assets/b4aec474-52d1-497d-b4e8-6f5737d6acc7)
![image](https://github.com/user-attachments/assets/ee91fcd4-4a1a-4c33-bf88-3aee67ad3cc9)

There is an initial PR at NixOS#7188, but this appears to have been abandoned - over 2 years with no activity, then a no comment review in jan. There was a subsequent PR at NixOS#8983 but this was closed without merge - over a year without activity.
<!-- Non-trivial change: Briefly outline the implementation strategy. -->
I have visualised the current and proposed flows. I believe my logic flows line up with what is suggested in NixOS#7188 (comment) but correct me if I am wrong.
Current behaviour:
![current](https://github.com/user-attachments/assets/d9501b34-274c-4eb3-88c3-9021a482e364)
Proposed behaviour:
![proposed](https://github.com/user-attachments/assets/8236e4f4-21ef-45d7-87e1-6c8d416e8c1c)

[Charts in lucid](https://lucid.app/lucidchart/1b51b08d-6c4f-40e0-bf54-480df322cccf/view)
<!-- Invasive change: Discuss alternative designs or approaches you considered. -->

Possible issues to think about:
- I could not figure out where the curl error is created... I can't figure out how to swallow it and turn it into a warn or better yet, a debug log.
- Unfortunately, in contrast with the previous point, I'm not sure how verbose we want to warns/traces to be - personally I think that the warn that a substituter has been disabled (when it happens) is sufficient, and that the next one is being used, but this is personal preference.
@Ericson2314
Copy link
Member

Closed because replaced by #13301

@github-project-automation github-project-automation bot moved this from 🏁 Review to Done in Nix team Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

UX The way in which users interact with Nix. Higher level than UI.

Projects

Archived in project

9 participants