-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: add tool to create deterministic tarballs #2362
Conversation
Summary: We plan to promote TensorBoard’s Pip packages to Bazel targets, but the filenames of these targets can’t actually be statically determined: they must include the TensorBoard version number, which is defined in a Python module. Instead, we’ll build a tarball containing the two wheels. Building tarballs deterministically with system tools is famously troublesome (you can do it with GNU `tar` and a lot of flags, but not portably), and it seems that the standard approach is to use Python’s built-in `tarfile` library. Implementation based off of <servo/servo#12244>, but with changes to make it Python 3-compatible, to use our code style, and to clean up some minor issues and fix some TODOs from the original. The original source is Apache-2 licensed. Test Plan: End-to-end tests included. Replacing the body of `_run_tool` with ```python return subprocess.check_output([ "/bin/sh", "-c", 'x="$(readlink -f "$1")" && cd "$2" && tar czf "$x" .', "unused", ] + args) ``` causes the `test_invariant_under_mtime` test case to fail for the right reason. wchargin-branch: deterministic-tgz
for entry in file_list: | ||
arcname = os.path.normpath(entry) | ||
tar_file.add(entry, filter=reset, recursive=False, arcname=arcname) | ||
shutil.move(temp_archive, archive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra bonus points if you can guess what happens if you change this
shutil.move
to os.rename
. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't guess - but this type of lurking subtlety is exactly why even code that's documented and tested still represents an ongoing cost, even if it's mostly conceptual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This one is on Bazel—the sandbox in which Bazel runs tests is
apparently a separate file system, so a full move is needed.)
wchargin-source: 991fdaa7a849ba49d062e9f640d6ddd12ffc703e wchargin-branch: deterministic-tgz
parser.add_argument( | ||
"archive", | ||
metavar="ARCHIVE", | ||
help="name for the output file; should probably end in '.tar.gz'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just require that it end with .tar.gz
. No need to make this more flexible than necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
shutil.move(temp_archive, archive) | ||
|
||
|
||
@contextlib.contextmanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid using context managers for this and the locale change? The script is only being invoked as a subprocess and there really isn't a need to restore the subprocess to its original settings immediately before it exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense.
help="name for the output file; should probably end in '.tar.gz'", | ||
) | ||
parser.add_argument( | ||
"directory", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just take a flat list of files instead? That is all we really need for the pip package bundling, and that way we avoid dealing with file order or directory traversals. If we want a directory inside the archive we can synthesize one from the stripped name of the archive file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, okay. It’s not obvious to me that this decreases the complexity,
because you now need to consider filename collisions (and you also need
to think at least in passing about what happens if the newly required
shell glob in the genrule fails). If you prefer that complexity to the
(4-line) os.walk
, fine by me; otherwise, I can revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems less complex to me because it only supports a single level hierarchy, so it's really just about bundling several files into one, rather than an arbitrary directory structure. And it eliminates the need not just to walk the tree but also the filename sorting, which in turn eliminates the need for locale setting. The onus is on the caller to specify the right set of files, and in our case this is easy for the caller.
file_list.sort(key=functools.cmp_to_key(locale.strcoll)) | ||
|
||
# Use a temporary file and atomic rename to avoid partially-formed | ||
# packaging (in case of exceptional situations like running out of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is being used from the context of a build helper, an error like this means the build will fail, and in that context I'm not sure why it's necessary to go to extra effort to ensure that none of the outputs are partially formed. As far as I know, Bazel doesn't generally make that guarantee; we have other genrules that would also leave partial data in the event of running out of disk space.
If we didn't try to guarantee that we could eliminate some complexity, including whatever cryptic thing is happening with shutil.move
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Fair enough. I’ll change it to write directly.
for entry in file_list: | ||
arcname = os.path.normpath(entry) | ||
tar_file.add(entry, filter=reset, recursive=False, arcname=arcname) | ||
shutil.move(temp_archive, archive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't guess - but this type of lurking subtlety is exactly why even code that's documented and tested still represents an ongoing cost, even if it's mostly conceptual.
wchargin-source: 30e936b83964ab4b78b9c9755fc62849b182c3d2 wchargin-branch: deterministic-tgz
wchargin-source: 30e936b83964ab4b78b9c9755fc62849b182c3d2 wchargin-branch: deterministic-tgz
wchargin-source: 54daa09db50cca2bc6b41c933fc40e7b124495ac wchargin-branch: deterministic-tgz
wchargin-source: 065ea1bdcfbd0621f5998a8e310b9e2e7f1f6a5f wchargin-branch: deterministic-tgz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks; IMO the condensed version is significantly more straightforward.
if len(frozenset(os.path.basename(f) for f in files)) != len(files): | ||
sys.stderr.write("Input basenames must be distinct; got: %r\n" % files) | ||
sys.exit(1) | ||
files.sort(key=os.path.basename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is sorting necessary if we just say that the input files are order-sensitive? If we are going to sort, then we probably do still need to set the locale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure; removed.
wchargin-source: a410c619117892bb8945f20c5c277d748801e342 wchargin-branch: deterministic-tgz
wchargin-source: a410c619117892bb8945f20c5c277d748801e342 wchargin-branch: deterministic-tgz
Summary:
We plan to promote TensorBoard’s Pip packages to Bazel targets, but the
filenames of these targets can’t actually be statically determined: they
must include the TensorBoard version number, which is defined in a
Python module. Instead, we’ll build a tarball containing the two wheels.
Building tarballs deterministically with system tools is famously
troublesome (you can do it with GNU
tar
and a lot of flags, but notportably), and it seems that the standard approach is to use Python’s
built-in
tarfile
library.Implementation based off of servo/servo#12244,
but with changes to make it Python 3-compatible, to use our code style,
and to remove some unneeded machinery. Original is Apache-2 licensed.
Test Plan:
End-to-end tests included. Replacing the body of
_run_tool
withcauses the
test_invariant_under_mtime
test case to fail for the rightreason.
wchargin-branch: deterministic-tgz