Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deterministic source archives #2948

Open
Ericson2314 opened this issue Aug 2, 2016 · 8 comments
Open

deterministic source archives #2948

Ericson2314 opened this issue Aug 2, 2016 · 8 comments
Labels
C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` Command-package S-triage Status: This issue is waiting on initial triage.

Comments

@Ericson2314
Copy link
Contributor

Tarballs contain more information than we need (e.g. users, groups, fine-grained permissions, timestamps), and also allows representing the same information in multiple ways (e.g. order of directory contents, files defined twice). The basic problems this creates is that files cannot be deterministically assembled into an archive. In practice this means:

  • Directory registries cannot be verified against lockfiles as well
  • Packages may accidentally depend on permissions only supported on some platforms
  • Sources besides registries cannot be mirrored (distinct from what sorts of sources can serve as mirrors)
  • Users may unintentionally leak information about their current system when publishing packages.

None of these is terribly pressing on its own, but hopefully they are worthy of a solution in aggregate.

The solution is first carefully deciding which metadata we wish to support---the information our archives will contain, and then picking a canonical form for every possible archive containing that information. A thornier question is whether existing uploads should be normalized according to the chosen schema.

For backwards comparability, it is probably best to stick with some subset tar. This is what Debian does. Where an extraneous field cannot be elided, it should be constrained to some fixed value. Either the most expressive posix tar variant could be used, or the most minimal format that supports the information in question.

Other options might be git's tree objects or Nix's NAR. The Merkle DAG used by the former can lead to better error messages and free dedup, but SHA1 is dubiously secure. The latter can be hashed however we like, but still runs into backwards-compat.

CC @eternaleye

@eternaleye
Copy link

@Ericson2314 and I discussed this on IRC, and figured that the actual metadata needed is probably (at most) the executable bit. As a result, the v7 tar format (the oldest one, and supported absolutely everywhere) would be a viable option. For determinism, we'd want to constrain:

  • mode bits 0755 or 0644 (if executable bit not needed, 0644 can be hardcoded)
  • uid to 0
  • gid to 0
  • timestamp to 0
  • "link indicator" to 0 (normal file)

And sort the file names in a locale-independent manner.

(Note: UID zero would not be much an obstacle to users extracting them, as by default GNU tar does not preserve UIDs at extraction time when executed by an unprivileged user)

However, it has a 100-character limit on path length. The limit can be raised to 255 by moving to the "ustar" format, but that just postpones the issue. For unbounded filenames, we'd need to move to the "pax" format, which is considerably more complex.

However, we could say that pax is only used for crates that actually possess such long paths, and thus postpone the issue until such crates exist, while still being 100% deterministic (as a crate will either have only short paths, and thus be v7, or at least one long path, and thus must be pax).

Another possibility is to use pax from the start, but that will require more thought on how to make it deterministic.

Also, it may be best to exclude the executable bit - nothing cargo does natively needs it, and build scripts can either set it before executing things, or specify the interpreter for scripts. Putting it in the archive would then be unnecessary.

@alexcrichton
Copy link
Member

I'd be down for this! I'd also prefer to stick with tarballs if we can, I don't think there's any reason per-se they have to be nondeterminsitic.

@eternaleye note that all those tar formats are currently supported by the tar-rs crate, and currently uses the ustar format for backwards compatibility but it's perhaps been long enough now that we can switch to gnu (which is easier to write than pax right now). In any case though it should be easy enough to configure the header there to have whatever data we need, or add a .deterministic() method to headers in tar-rs.

@eternaleye
Copy link

Well, I'd actually prefer pax to gnu (actually a standard), ustar to either (older standard with more buy-in, simplicity is a benefit in validation, not just implementation), and v7 if we could get away with it (dead simplest).

@alexcrichton
Copy link
Member

We can't do v7 b/c ustar supports longer paths (what Cargo supports today). Stick with ustar, what Cargo currently does, is fine.

@eternaleye
Copy link

eternaleye commented Aug 2, 2016

Makes sense - however, ustar does raise the additional concern of exactly how the path is split between the path and path-prefix fields.

@Ericson2314
Copy link
Contributor Author

Well the idea would be to use v7 where possible, otherwise ustar, (and maybe in the future for really long names, otherwise pax). But just using ustar should be fine.

On the cargo end, removing the set_metadata might make things work in practice, but we should be careful to define our format so future versions of tar-rs don't inadvertently change it. More interesting to me is sanitation on the crates.io side, @alexcrichton do you know where code for that should go?

@Ericson2314
Copy link
Contributor Author

Well the idea would be to use v7 where possible, otherwise ustar, (and maybe in the future for really long names, otherwise pax). But just using ustar should be fine.

On the cargo end, removing the set_metadata might make things work in practice, but we should be careful to define our format so future versions of tar-rs don't inadvertently change it. More interesting to me is sanitation on the crates.io side, @alexcrichton do you know where code for that should go?

And should we normalize existing tarballs?

@eternaleye
Copy link

eternaleye commented Aug 2, 2016

I'm personally in favor of doing so, as otherwise any integrity system we deploy will work differently for crates uploaded before/after the change.

EDIT: Sadly, #2857 already got merged - as a result, this would be disruptive for anyone who has used local mirrors, because it adds just such an integrity system. Is there any chance that could be backed out, or not included in "stable" cargo releases until this is addressed?

@carols10cents carols10cents added C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` Command-package labels Sep 26, 2017
@epage epage added the S-triage Status: This issue is waiting on initial triage. label Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` Command-package S-triage Status: This issue is waiting on initial triage.
Projects
None yet
Development

No branches or pull requests

5 participants