Skip to content

git-lfs support#10153

Merged
roberth merged 62 commits intoNixOS:masterfrom
lucia3e8:lfs
Feb 13, 2025
Merged

git-lfs support#10153
roberth merged 62 commits intoNixOS:masterfrom
lucia3e8:lfs

Conversation

@lucia3e8
Copy link
Contributor

@lucia3e8 lucia3e8 commented Mar 4, 2024

Motivation

nix fetches git repos using libgit2, which does not run filters by default. This means LFS-enabled repos can be fetched, but LFS pointer files are not smudged.

This change adds a lfs attribute to fetcher URLs. With lfs=1, when fetching LFS-enabled repos, nix will smudge all the files.

Context

See #10079.
Git Large File Storage lets you track large files directly in git, using git filters. A clean filter runs on your LFS-enrolled files before push, replacing large files with small "pointer files". Upon checkout, a "smudge" filter replaces pointer files with full file contents. When this works correctly, it is not visible to users, which is nice.

Changes

  • builtins.fetchGit has new bool lfs attr
  • when lfs=true, GitSourceAccessor will smudge any pointer files with the lfs filter attribute
  • as verified by new test in tests/nixos/fetchgit (this is why lfs is now enabled on the test gitea instance)

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added the fetching Networking with the outside (non-Nix) world, input locking label Mar 4, 2024
@lucia3e8
Copy link
Contributor Author

lucia3e8 commented Mar 5, 2024

Small complication:
it seems that nix flake lock calls fetch which in turn calls Input::fetch -> InputScheme::fetch -> fetchToStore.

In other words, I don't currently see a way to:

  • materialize LFS files when fetching -source store paths, but
  • don't materialize LFS files during nix flake lock.

@L-as
Copy link
Member

L-as commented Mar 9, 2024

What use case do you have in mind? Isn't LFS typically for large files, that wouldn't usually affect evaluation anyway?

@lucia3e8
Copy link
Contributor Author

What use case do you have in mind? Isn't LFS typically for large files, that wouldn't usually affect evaluation anyway?

builtins.fetchGit populates the nix store with a <hash>-store path. This path is used as the source when building a derivation. Currently, the builder will see the unsmudged LFS pointer files, but I'd like the builder to optionally see smudged files. I agree that smudged files are usually not needed at eval time, but I don't see a good alternative of making them available at build time, besides a fixed-output derivation (but fixed-output derivations have their own problems)

@L-as
Copy link
Member

L-as commented Mar 14, 2024

A FOD seems optimal here, in general you shouldn't use builtins.fetchGit if you're only going to use it at build time.

@lucia3e8
Copy link
Contributor Author

In general I agree, but (afaik) other fetchers can't use git credentials.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2024-03-11-nix-team-meeting-132/42960/1

@khoitd1997
Copy link

@roberth have you had a chance to take a look at this issue? We have been staying at older versions of Nix as a workaround but newer versions now have fixes for critical issues so sticking to old ones would no longer be optimal.

@roberth
Copy link
Member

roberth commented Jun 22, 2024

Hi @b-camacho, thanks for the ping and sorry for the delay. This PR was assigned to me, but I hadn't prioritized it because it was a draft. Wrong assumption on my end, because I do think this is valuable, and I have some things to say :)

when fetching LFS-enabled repos, nix will smudge all the files.

That's a good start, but we need to make sure that the smudging happens in a controlled manner; otherwise we risk adding impurities.

Specifically, we should parse the attribute to check that they're supposed to be unsmudged by lfs; if not, ignore the smudge rule. It seems you were already investigating how this could be implemented.

Furthermore, we should validate the sha256 so that we don't increase the potential for silent errors by a whole external program. The hash should be easy to parse from the pointer file, and while reading other programs' inputs is a little ad hoc, I don't expect any serious issues from this, as we won't cause users to accidentally rely on a bug this way.


the LFS-tracked files are materialized during nix flake lock - this is bad

This won't happen unnecessarily either of these are implemented

If we need to backtrack on the removal of narHashes (#6530), we can also avoid re-locking transitive inputs whose lock has already been computed by the dependency's lock.

So yes, this isn't efficient yet, but it will be.

A FOD seems optimal here, in general you shouldn't use builtins.fetchGit if you're only going to use it at build time.

A fixed output derivation works best when all you're using it for is as an input to another derivation (and it's publicly available, as mentioned).
However, if the "fetched" source is a flake (e.g. you have a flake.nix in a repo with LFS files), then you also need to evaluate files from the fetched source, which would constitute import from derivation, which is not optimal. Furthermore you'd need to produce fixed-output hashes for your local repo files, which is such horrible UX we don't need to consider it as a solution.


To summarize, this is worth implementing, I see no blocking issues, design or otherwise, and the following needs to be done:

  • lfs attribute with default false, LGTM
  • figure out which files are LFS
  • invoke the Git LFS filter from $PATH; no need for a rigid dependency or makeWrapper
  • check the sha256
  • add a test, perhaps extending tests/nixos/fetch-git
  • documentation for the lfs attribute (currently under fetchGit's entry in doc/manual/src/language/builtins.md); mention the runtime dependency on the LFS package.

@kip93
Copy link
Contributor

kip93 commented Jul 23, 2024

What's the state on this PR? Seems to unfortunately be a bit stale given the delayed review. This issue has been plaguing us for a while, so I'm willing to pick up the torch here and try to get this out the door (was actually starting to see how to fix this myself back in March when I saw this PR and decided to see what came out of this).

@roberth
Copy link
Member

roberth commented Jul 23, 2024

@kip93 I think your question was directed towards @b-camacho, but I'd like to add that we would welcome and support anyone who'd like to work on this.

Feel free to ask questions here or in the meetings if you can make them. We generally have some agenda, but we also like to make time for contributors during or after, when we often hang out while we get some things done. Link to the video conference is in the scratchpad linked there. We also have a matrix room, although personally I'm guilty of neglecting that one sometimes.

@lucia3e8
Copy link
Contributor Author

Thanks for the thorough writeup @roberth !
I owe you all an update. To avoid shelling out and implement some features we need to merge, I need a subset of a git-lft C/C++ client. I reimplemented one from Python into C++ here https://github.com/b-camacho/git-lfs-fetch-cpp.

Once I add some tests and integrate git-lfs-fetch-cpp here, we should be ready for another review!

I'm still on vacation with not-great internet, but back in 6 days and will update you all on 7/31 regardless.

Thanks for the feedback and sorry for the wait!

@roberth
Copy link
Member

roberth commented Jul 24, 2024

Oh, I don't think shelling out was such a big deal because we can verify the correctness of the result, kind of like how fixed output derivations are allowed to do "grossly impure" things because we can verify the output.

I guess a library implementation of it is still nice for a consistent UX with a small closure size though.

@L-as

This comment was marked as off-topic.

@roberth

This comment was marked as off-topic.

@L-as

This comment was marked as off-topic.

@kip93
Copy link
Contributor

kip93 commented Oct 10, 2024

Hey! It's me again! I just want to ask if there's anything I can help with here. Maybe I can try and doing some testing, or do a smaller version of this that uses the git-lfs CLI tools while the full implementation gets done?

We have a lot of repos with LFS files that would greatly benefit from this, so I'm willing to do whatever work is needed, but also don't want to add extra work for others where it's not wanted.

@lucia3e8
Copy link
Contributor Author

lucia3e8 commented Nov 1, 2024

Thanks for offering to help @kip93, I'll very happily take you up on that! Here's some background on fetchers (sorry if you already know all this). SourceAccessor abstracts over fs-like entities. Only method we care about is 2 versions of readFile (one reads entire file into memory, the other streams it via callbacks). There are 2 SourceAccessor subclasses for git access:
image

Brief gitlog perusing does not show why GitSourceAccessor is not rolled into GitExportIgnoreSourceAccessor, but currently they are both used - so we need to make our change to both.

For now, I only added it to GitSourceAccessor, adding this patch on top of your nixpkgs then running /where/you/built/nix/bin/nix-build -A pkgs.test-lfs should give you an outpath with a correctly smudged file.

The major issues right now:

  • readFile in GitSourceAccessor pulls the entire file into memory. this will OOM on many many files, so we need to override the streaming version of readFile instead
  • GitExportIgnoreSourceAccessor ignores LFS for now
  • we do not know why GitExportIgnoreSourceAccessor and GitSourceAccessor both exist, figuring out exactly why they're both needed would be very helpful

In addition per Robert's comment above, we do still need a SHA check, some tests and docs. Tomorrow I'll investigate GitExportIgnoreSourceAccessor more, but I'm not very familiar with how nix handles git, so it's taking a while.

@lucia3e8
Copy link
Contributor Author

lucia3e8 commented Nov 2, 2024

Ok I don't think we need to touch GitExportIgnoreSourceAccessor! In it's constructor, it takes an instance of SourceAccessor but in reality we always pass in GitSourceAccessor there - so the smudging will be applied either way. Aspects of GitEISA remain a mystery to me, like which constructor is being invoked here At first it looks like we're invoking an inherited constructor because of this using declaration but neither CFSA or FSA define any constructors that take a lambda as second arg, friends I show this to say it shouldn't compile. Yet it does.

Side quest mysteries aside, tomorrow I'll change GitSA::readFile to override the streaming version of SA::readFile.

Eventually this should probably become a struct of options.
Copy link
Member

@edolstra edolstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some code review/cleanup, otherwise looks great! Thanks!

@kip93
Copy link
Contributor

kip93 commented Feb 10, 2025

Nice, thanks for the support! Looks like with my subpar c++ skills I missed some details (:

Now, from what I see there's 2 final talking points.

  1. The TODO that you've added. We're using the StringSink just to validate the download was not tampered with. Probably not NECESSARY, just a nice to have, so we could drop this in favour of a generic sink.
  2. The talk about warn vs fail on invalid lfs files. The warn behaviour matches the default git-lfs implementation, but the hard fail is probably best for reproducibility. I think changing to a fail might be best, even if it differs from the default behaviour, we don't need to support broken setups, and these are simple to fix on their side.

@roberth roberth merged commit 693a38a into NixOS:master Feb 13, 2025
12 checks passed
@roberth
Copy link
Member

roberth commented Feb 13, 2025

Thank you!

It would be great to also have this in inputs.self, like

@kip93
Copy link
Contributor

kip93 commented Feb 13, 2025

I can make a PR for that next week (:

@kip93
Copy link
Contributor

kip93 commented Feb 13, 2025

Well I lied I found a little bit of time to do this today and it turned out to be much simpler than I thought #12468

@lucia3e8 lucia3e8 deleted the lfs branch March 11, 2025 02:42
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-2-27-0-released/62003/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation fetching Networking with the outside (non-Nix) world, input locking

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

10 participants