Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic output due to non-deterministic os.walk and unix file permissions of original files #2155

Open
ento opened this issue May 29, 2023 · 3 comments
Assignees

Comments

@ento
Copy link
Contributor

ento commented May 29, 2023

Context

I'm using Pants to package up a GitHub Actions action I started writing in Python. The packaged pex file is checked in to the repo, so that when the action gets run, all dependencies get pulled down as part of the action itself, similar to how actions/typescript-action uses ncc.

Problem and current workaround

It appeared to work fine until I added a CI check that builds a pex file in the CI environment and verifies the checked in pex file is identical to the one that just got built: even though none of the source files nor dependencies changed, the check didn't pass.

I tracked down the causes and was able to use a forked pex repo with a few changes to make the check pass: main...ento:pex:deterministic-bootstrap

  1. os.walk was yielding directories in a non-deterministic order, apparently. Fixed by applying .sort() in a few places.
  2. My local environment and the CI environment had different umask values, and because pex preserves filesystem permissions, the resulting zip archive ended up being different too. Fixed by applying something like a umask when adding entries to the zip archive.
    • Although this worked in my case, I don't think the way I did it really guarantees reproducibility: umask at the OS level works fine when it comes to reproducibility because the default permission is something constant when a new filesystem entry is created (although I don't know if that's the same across all (POSIX-compliant?) OS'es), but in the case of pex, there's no default permission per se and it uses the permission of the entry on disk as the starting point.

Questions

I can work on opening a PR or two to address these sources of non-determinism if that's something good to incorporate into pex.

  1. For os.walk, I'm thinking of adding a wrapper around os.walk like deterministic_os_walk and using it at least in the two places I needed to patch. Would it be a good idea to use it everywhere, or should it be limited to just these two places? Or is this something out of scope of guarantees that pex is willing to provide?
  2. Is providing deterministic builds when it comes to unix file permissions something within scope of pex? If so, what would be a good approach?
    • I did think of specifying the umask in the CI environment and not handling it within pex when I was trying to come up with a fix, but enforcing the umask value in local dev environments is going to be hard, unless all contributors' machines are set up the same way, as the value would need to be set before you clone the GitHub actions repo.
    • I noticed there's a --use-system-time flag for allowing non-deterministic timestamps. With file permissions, we could similarly have a non-deterministic mode and deterministic mode, where deterministic mode creates new zipfile entries with u=rw,g=rw,o=rw for files and u=rwx,g=rwx,o=rwx for directories and with a configurable umask applied, but.. that will mean files that were originally executable will lose that bit. The determinisitc mode could, instead, limit what it controls to r and w permissions and let x pass through, which might work good enough in practice.
@jsirois
Copy link
Member

jsirois commented May 29, 2023

I think 1 is fine to solve as you suggest. Pex has a suite of determinism tests, but they assume you're trying to build the PEX again at some other time on the same filesystem / OS. I think its reasonable to expect the filesystem / OS should not matter. I'm pretty sure the issue here is not that os.walk is nondeterministic, but that its deterministic per OS / filesystem pair. You may get a different result on macOS / APFS (case insensitive) than Linux / EXT4.

As for permissions, those should be what you see is what you get, umasks aside. If I clone a git repo, say, and that contains the code to be PEXed, then the final permissions on the files of the clone that go into the PEX should be the permissions preserved in the PEX. Likewise, when the PEX extracts itself those exact permissions should be what the corresponding files have in their resting locations. In short, I'm not understanding the second issue and it would be great to have a repro if you can provide one - perhaps using ~docker.

@ento
Copy link
Contributor Author

ento commented May 29, 2023

Re 1: Sounds good.

Re 2: Here's a Dockerfile that reproduces the issue: https://gist.github.com/ento/ee0e7ca020d85a4f58c74be8d093502b#file-dockerfile.
It clones https://github.com/pantsbuild/example-python under different umask values and runs pants package ::. The resulting pex files have different checksums.

sha256sum: WARNING: 1 of 1 computed checksums did NOT match

So, given the same repo, pex may produce different outputs depending on the environment.

@jsirois
Copy link
Member

jsirois commented May 29, 2023

Hrm, ok. I'm not sure I like the umask munging angle. I need to think on that a bit. Catering to differing OS / filesystem makes sense - its prevalent. Folks like to develop on X but deploy to Y (I think that's crazy but it's a thing and sometimes actually unavoidable).

Umask though gets weird. You're handing Pex different inputs in that case; so different outputs makes complete sense and your build process should probably control your inputs better. Stepping back a bit, I think only these permissions matter for a PEX, a wheel, etc:

  1. Directories get 755 - anyone can traverse
  2. Executables get 755 - anyone can execute
  3. All other files get 644 - anyone can read

In other words, maybe Pex could have a mode where it normalizes perms, only preserving the execute bit for files.

I can find no spec guidance here but there is this related wheel spec thread: https://discuss.python.org/t/clarifications-to-the-wheel-specification/8141

jsirois pushed a commit that referenced this issue Aug 21, 2023
Addresses one part of #2155.

Adds a wrapper for `os.walk` that sorts directory entries as it
traverses directories. As discussed in the issue, `os.walk` may yield
directory entries in a different order depending on (possibly) the file
system.

I don't know if the two `os.walk` calls that this PR replaces are enough
for addressing this class of non-determinism in pex, but these were
sufficient for solving the issue in my case. Happy to expand the use of
the wrapper if there are other places that it should be used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants