Skip to content

Don't download source tarball if not needed.#9819

Open
yshui wants to merge 3 commits intoNixOS:masterfrom
yshui:improve-cached-fetchTarball
Open

Don't download source tarball if not needed.#9819
yshui wants to merge 3 commits intoNixOS:masterfrom
yshui:improve-cached-fetchTarball

Conversation

@yshui
Copy link
Contributor

@yshui yshui commented Jan 20, 2024

Motivation

flake tarball inputs generate two store paths, it first downloads the source file into the store, then unpacks it into another store path.

assuming the file at the source URL hasn't changed, when I run nix flake update, if both paths exist, all is good. nix will fetch the source URL with an etag, and get a 304 response, and do nothing. but if the source file has been deleted from the store, then nix will redownload the source file, regardless if the unpacked store path is still valid.

this commit makes sure if the unpacked source exists in the store, the source file will only be downloaded if it has changed.

Context

  • I changed cache->lookupExpired(store, attrs) so it returns the cache entry along with if the store path is still valid, instead of just throwing it away when the path is invalid.
  • I added a parameter onlyIfChanged to downloadFile. when true, downloadFile will only download the source file if it's different from the cache, even when the source file doesn't exist in store.
  • Finally I changed downloadTarball to utilize this parameter to achieve what I described in the motivation section

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

yshui added 3 commits January 20, 2024 15:53
Instead of just reporting the cache doesn't exist.
If 1) the store path for the unpacked source is valid, and 2) the file
at the source has not changed, downloadTarball shouldn't redownload the
source file.
@yshui yshui requested a review from edolstra as a code owner January 20, 2024 16:11
@github-actions github-actions bot added the fetching Networking with the outside (non-Nix) world, input locking label Jan 20, 2024
@yshui
Copy link
Contributor Author

yshui commented Jan 20, 2024

The other option is to do what prefetch is already doing - don't call downloadFile and don't store the source file in the store. This also saves some space if you don't aggressively run nix store gc.

I am fine with either option.

@yshui
Copy link
Contributor Author

yshui commented Jan 20, 2024

BTW other input types have similar problems as well. For example for github sources, the request to api.github.com is made without etag if the json file is not in store, even when the server would have returned 304. But the json file is pretty small so this is less of a problem.

@Ericson2314
Copy link
Member

Is this trying to fix #9814?

bool locked,
const Headers & headers = {});
const Headers & headers = {},
bool onlyIfChanged = false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we want to download if not changed?

Copy link
Contributor Author

@yshui yshui Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errr, it's fully explained in my comments above

assuming the file at the source URL hasn't changed, when I run nix flake update, if both paths exist, all is good. nix will fetch the source URL with an etag, and get a 304 response, and do nothing. but if the source file has been deleted from the store, then nix will redownload the source file, regardless if the unpacked store path is still valid.

to summarize, if the unpacked tarball is available in store, we don't need to download the source tarball if not changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err sorry I might have gotten the negation backwards. What you propose of downloading if and if needed sounds reasonable. I am asking when would one not want the semantics you proposed? e.g. When would one want to download again if we already have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally i can't come up with a scenario where re-downloading would be preferable. do you have something in mind?

Copy link
Contributor Author

@yshui yshui Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errrr, did i misunderstand your question?

in code, onlyIfChanged would be false for a "normal" downloadFile call. so say the user calls builtins.fetchurl then the file will be downloaded if it's unchanged but not in store.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yshui Sorry I am still a bit confused. It seems in both cases you are saying "if we have it in the store and it didn't change we don't need to redownload it". I am missing something --- maybe missing that nix flake update need not care what is in the store?

Copy link
Contributor Author

@yshui yshui Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing something

Fine. let me just list all the possible cases here.

So, there's two blobs involved for fetchTarball.

tarball.tar.gz, and unpacked-tarball/. for our purposes, we assume both have a entry in fetcher cache. Now, let's consider different cases where either or both of them are in the store.

  1. both tarball.tar.gz and unpacked-folder/ are in store.
    fetchTarball fetches tarball.tar.gz with etag from cache, gets 304, stops. Which is good.
  2. only tarball.tar.gz is in store.
    fetchTarball fetches tarball.tar.gz with etag from cache, gets 304, unpacks the tarball already in store. This is good too.
  3. neither is in store.
    fetchTarball ignores cache entry for tarball.tar.gz, redownloads it, unpacks it. Good.
  4. only unpacked-tarball/ is in store.
    a. tarball.tar.gz is changed at the specified url
    fetchTarball ignores cache entry for tarball.tar.gz, redownloads it, unpacks it. Good.
    b. tarball.tar.gz is unchanged at the specified url
    fetchTarball ignores cache entry for tarball.tar.gz, redownloads it, then reuse the unpacked-tarball/ in store. NOT good.

This PR avoids redownloading tarball.tar.gz in case (4.b)

OTOH, fetchurl only puts one blob into the store. this is the case where downloadFile is called with onlyIfChanged = false. and the cases for fetchurl are:

  1. file is not in store
    doesn't matter if the file at the source url is changed or not, downloadFile needs to download it. this is why onlyIfChanged should be false.
  2. file is in store
    downloadFile downloads the file if it has changed at the source url. this is the current behavior and is unchanged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much @yshui. The big thing I was forgetting what that there is both the unpack and packed versions.

@edolstra has a PR that changes how the unpacked stuff is downlodaed that we are merging soon. I suspect we should land that first.

My only question left is, why does builtins.fetchurl use the tarball fetcher at all? Shouldn't it just use the file fetcher?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does builtins.fetchurl use the tarball fetcher at all?

maybe the way i phrased caused some confusion, but builtins.fetchurl doesn't call downloadTarball, it calls downloadFile.

Whereas builtins.fetchTarball will call downloadTarball, and that in turn calls downloadFile

@yshui
Copy link
Contributor Author

yshui commented Jan 22, 2024

Is this trying to fix #9814?

No, they are distantly related, but this PR won't fix that issue.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2024-01-22-nix-team-meeting-minutes-117/38838/1

@edolstra edolstra self-assigned this Jan 29, 2024
@yshui
Copy link
Contributor Author

yshui commented Feb 11, 2024

Hi, so what's the decision on this PR? what do I need to do to get this merged?

@Ericson2314
Copy link
Member

@yshui the other PR I was talking about #10038, is now merged. Now would be a good time to check what, if anything, is still needed.

@yshui
Copy link
Contributor Author

yshui commented Feb 22, 2024

@Ericson2314 what does #10038 do?

@Ericson2314
Copy link
Member

@yshui It caches downloaded tarballs (not individual files) in a separate git repo (for file-granularity dedup) instead of the store. I suppose it is still possible for data to be in the cache DB but not the git repo, so the principle you are trying to ensure works still applies, but the implementation details are quite different.

@yshui
Copy link
Contributor Author

yshui commented Feb 22, 2024

@Ericson2314 hmm, interesting. so if i do a nix store gc, would it also delete the tarball from the git repository? and it would be redownloaded next time i ask for unpacked tarball even if it's not changed?

@Ericson2314
Copy link
Member

@yshui nix store gc would not affect the tarball cache. I don't know how it is garbage collected.

@yshui
Copy link
Contributor Author

yshui commented Feb 27, 2024

@Ericson2314 I looked into it, and I think implementing this became a bit easier after the new git cache, but it looks like it also introduced some bugs (#10080), I'll update this PR after things are settled a bit more.

@yshui
Copy link
Contributor Author

yshui commented Feb 27, 2024

Hold on, why do we need to cache the tarball at all? What we want is the unpacked tarball, why are we keeping the tarball file in cache?

(or did I misunderstand? is the unpacked tarball being cached?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fetching Networking with the outside (non-Nix) world, input locking

Projects

Status: 🏁 Review

Development

Successfully merging this pull request may close these issues.

4 participants