Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: add tool to create deterministic tarballs #2362

Merged
merged 8 commits into from
Jun 21, 2019

Conversation

wchargin
Copy link
Contributor

@wchargin wchargin commented Jun 19, 2019

Summary:
We plan to promote TensorBoard’s Pip packages to Bazel targets, but the
filenames of these targets can’t actually be statically determined: they
must include the TensorBoard version number, which is defined in a
Python module. Instead, we’ll build a tarball containing the two wheels.
Building tarballs deterministically with system tools is famously
troublesome (you can do it with GNU tar and a lot of flags, but not
portably), and it seems that the standard approach is to use Python’s
built-in tarfile library.

Implementation based off of servo/servo#12244,
but with changes to make it Python 3-compatible, to use our code style,
and to remove some unneeded machinery. Original is Apache-2 licensed.

Test Plan:
End-to-end tests included. Replacing the body of _run_tool with

    return subprocess.check_output(["tar", "czf"] + args)

causes the test_invariant_under_mtime test case to fail for the right
reason.

wchargin-branch: deterministic-tgz

Summary:
We plan to promote TensorBoard’s Pip packages to Bazel targets, but the
filenames of these targets can’t actually be statically determined: they
must include the TensorBoard version number, which is defined in a
Python module. Instead, we’ll build a tarball containing the two wheels.
Building tarballs deterministically with system tools is famously
troublesome (you can do it with GNU `tar` and a lot of flags, but not
portably), and it seems that the standard approach is to use Python’s
built-in `tarfile` library.

Implementation based off of <servo/servo#12244>,
but with changes to make it Python 3-compatible, to use our code style,
and to clean up some minor issues and fix some TODOs from the original.
The original source is Apache-2 licensed.

Test Plan:
End-to-end tests included. Replacing the body of `_run_tool` with

```python
    return subprocess.check_output([
        "/bin/sh",
        "-c",
        'x="$(readlink -f "$1")" && cd "$2" && tar czf "$x" .',
        "unused",
    ] + args)
```

causes the `test_invariant_under_mtime` test case to fail for the right
reason.

wchargin-branch: deterministic-tgz
for entry in file_list:
arcname = os.path.normpath(entry)
tar_file.add(entry, filter=reset, recursive=False, arcname=arcname)
shutil.move(temp_archive, archive)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra bonus points if you can guess what happens if you change this
shutil.move to os.rename. ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't guess - but this type of lurking subtlety is exactly why even code that's documented and tested still represents an ongoing cost, even if it's mostly conceptual.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This one is on Bazel—the sandbox in which Bazel runs tests is
apparently a separate file system, so a full move is needed.)

wchargin-source: 991fdaa7a849ba49d062e9f640d6ddd12ffc703e
wchargin-branch: deterministic-tgz
parser.add_argument(
"archive",
metavar="ARCHIVE",
help="name for the output file; should probably end in '.tar.gz'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just require that it end with .tar.gz. No need to make this more flexible than necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

shutil.move(temp_archive, archive)


@contextlib.contextmanager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid using context managers for this and the locale change? The script is only being invoked as a subprocess and there really isn't a need to restore the subprocess to its original settings immediately before it exits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense.

help="name for the output file; should probably end in '.tar.gz'",
)
parser.add_argument(
"directory",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just take a flat list of files instead? That is all we really need for the pip package bundling, and that way we avoid dealing with file order or directory traversals. If we want a directory inside the archive we can synthesize one from the stripped name of the archive file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay. It’s not obvious to me that this decreases the complexity,
because you now need to consider filename collisions (and you also need
to think at least in passing about what happens if the newly required
shell glob in the genrule fails). If you prefer that complexity to the
(4-line) os.walk, fine by me; otherwise, I can revert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems less complex to me because it only supports a single level hierarchy, so it's really just about bundling several files into one, rather than an arbitrary directory structure. And it eliminates the need not just to walk the tree but also the filename sorting, which in turn eliminates the need for locale setting. The onus is on the caller to specify the right set of files, and in our case this is easy for the caller.

file_list.sort(key=functools.cmp_to_key(locale.strcoll))

# Use a temporary file and atomic rename to avoid partially-formed
# packaging (in case of exceptional situations like running out of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is being used from the context of a build helper, an error like this means the build will fail, and in that context I'm not sure why it's necessary to go to extra effort to ensure that none of the outputs are partially formed. As far as I know, Bazel doesn't generally make that guarantee; we have other genrules that would also leave partial data in the event of running out of disk space.

If we didn't try to guarantee that we could eliminate some complexity, including whatever cryptic thing is happening with shutil.move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Fair enough. I’ll change it to write directly.

for entry in file_list:
arcname = os.path.normpath(entry)
tar_file.add(entry, filter=reset, recursive=False, arcname=arcname)
shutil.move(temp_archive, archive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't guess - but this type of lurking subtlety is exactly why even code that's documented and tested still represents an ongoing cost, even if it's mostly conceptual.

wchargin-source: 30e936b83964ab4b78b9c9755fc62849b182c3d2
wchargin-branch: deterministic-tgz
wchargin-source: 30e936b83964ab4b78b9c9755fc62849b182c3d2
wchargin-branch: deterministic-tgz
wchargin-source: 54daa09db50cca2bc6b41c933fc40e7b124495ac
wchargin-branch: deterministic-tgz
wchargin-source: 065ea1bdcfbd0621f5998a8e310b9e2e7f1f6a5f
wchargin-branch: deterministic-tgz
Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; IMO the condensed version is significantly more straightforward.

if len(frozenset(os.path.basename(f) for f in files)) != len(files):
sys.stderr.write("Input basenames must be distinct; got: %r\n" % files)
sys.exit(1)
files.sort(key=os.path.basename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sorting necessary if we just say that the input files are order-sensitive? If we are going to sort, then we probably do still need to set the locale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure; removed.

wchargin-source: a410c619117892bb8945f20c5c277d748801e342
wchargin-branch: deterministic-tgz
wchargin-source: a410c619117892bb8945f20c5c277d748801e342
wchargin-branch: deterministic-tgz
@wchargin wchargin merged commit fce7a8a into master Jun 21, 2019
@wchargin wchargin deleted the wchargin-deterministic-tgz branch June 24, 2019 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants