docker: tarfile: improve auto-decompression handling #427

cyphar · 2018-03-12T07:41:51Z

This matches how "docker load" deals with compressed images, as well as
being a general quality-of-life improvement over the previous error
messages we'd give. This also necessarily removes the previous
special-cased gzip handling, and adds support for auto-decompression for
streams as well.

For quite a while we were blocked on support xz decompression because it
effectively required shelling out to "unxz" (which is just bad in
general). However, there is now a library -- github.com/ulikunitz/xz --
which has implemented LZMA decompression in pure Go. It isn't as
featureful as liblzma (and only supports 1.0.4 of the specification) but
it is an improvement over not supporting xz at all. And since we aren't
using its writer implementation, we don't have to worry about sub-par
compression.

Tools like umoci will always compress layers, and in order to work
around some lovely issues with DiffIDs, tarfile.GetBlob would always
decompress them. However the BlobInfo returned from tarfile.GetBlob
would incorrectly give the size of the compressed layer because the
size caching code didn't actually check the layer size, resulting in
"skopeo copy" failing whenever sourcing umoci images.

Signed-off-by: Aleksa Sarai [email protected]

cyphar · 2018-03-12T09:29:15Z

This now includes fixes for docker copy docker-archive: being broken for images with compressed layers -- which caused images built with umoci breaking sometimes.

mtrmac · 2018-03-12T11:45:25Z

docker/tarfile/src.go

-	if err != nil {
-		return &Source{
-			tarPath: path,
-		}, nil


(Just a quick question for now:) Is there an user who benefits from adding decompression support for streams? The only user of NewSourceFromStream in c/image processes the output of docker save, which is not compressed.

If there is no user, I’d rather leave the decompression exclusive to NewSourceFromFile, or rewrite the code in another way to preserve the property that NewSourceFromFile on an uncompressed file doesn’t make an unnecessary (and pretty costly) copy.

If you want, I can rework this so that it just uses the original file if it is uncompressed (I actually missed that the code did this when I deleted this hunk -- oops!).

But I don't see the negative of adding it to the stream implementation. The uncompressed path for streams is basically identical to the compressed path (minus the overhead of reading 5 bytes rather than using bufio for the entire read -- which isn't a significant overhead. So adding it to the stream implementation just means that containers/image users can stream compressed archives directly to c/image rather than making two copies.

So adding it to the stream implementation just means that containers/image users can stream compressed archives directly to c/image rather than making two copies.

This does not actually change that, because docker-archive: always calls NewSourceFromFile (docker/tarball is an internal helper, not a transport with its own ImageReference that could be used via copy.Image). Which is why was asking about an user who benefits.

Sure, this does not hurt, and the consistency in the API is nice :)

I was thinking more about users who might want to use docker/tarball but there are probably no such users. Anyway, I've fixed up the code to now no longer make a copy if the archive is already uncompressed.

cyphar · 2018-03-14T01:45:54Z

Fixed up the real test failure. Now test-skopeo is failing because it doesn't vendor the new xz library. How should we proceed @mtrmac?

umohnani8 · 2018-03-15T20:04:20Z

@cyphar https://github.com/projectatomic/skopeo#contributing tells how to fix the breaking skopeo tests and test them.

mtrmac

ACK overall.

The need to decompress layers only to count the size is awkward, but not obviously avoidable—building a manifest with the compressed digests, and having to compute them, is not better.

Closing the gzip stream is the only outstanding comment. (It would be unreasonable to block this PR on using DetectCompression when the design of that is incorrect, but I think it’s fair to block this on designing AutoDecompress correctly. Still, maybe we can get away with ignoring the issue anyway. @runcom ?)

mtrmac · 2018-03-15T20:07:02Z

pkg/compression/compression_test.go

@@ -15,13 +15,12 @@ import (

 func TestDetectCompression(t *testing.T) {
 	cases := []struct {
-		filename      string
-		unimplemented bool
+		filename string


(Non-blocking: This could become a []string.)

It could, but I prefered the c.filename as it's more descriptive (and makes the diff smaller). But I could switch it if you prefer.

No, this is fine as well.

mtrmac · 2018-03-15T20:16:17Z

docker/tarfile/src.go

 		return &Source{
 			tarPath: path,
 		}, nil
 	}
-	defer reader.Close()


So, it turns out that gzip.NewReader returns a ReadCloser, and the caller is expected to really Close the stream; the compression.DetectCompression implementation missed that.

Luckily(?) the gzip.Reader.Close() is (currently?) trivial, it only returns an error value which Read() would AFAICT would have returned anyway as long as the consumer is reading until EOF.

But that complicates the AutoDecompress design; should it wrap uncompressed inputs in NopCloser? Should it return a separate close callback instead of modifying the io.Reader stream (like dirImageMockWithRef does)?

Or do we just pretend that gzip.Reader.Close does not exist? :)

We could change the API to return an io.ReadCloser. Since io.Reader ⊆ io.ReadCloser there shouldn't be a problem with any users of the interface (aside from the downside they probably won't notice they should call Close now).

… and that

reader = … defer reader.Close() reader, … = AutoDecompress(…) defer reader.Close() // Again!

is unintuitive. But then none of the alternatives I can think of are any more elegant.

Maybe, but if you compare it to gzip.NewReader you have to do a similar thing.

reader := ... defer reader.Close() reader2, ... := gzip.NewReader(reader) defer reader2.Close()

We could work around it by doing some very dodgy .(io.Closer) handling in order to still allow users to pass io.Readers that don't have a hidden Close() method. But that's probably what you were thinking when you said that you can't think of any more elegant workarounds. 😉

The unintuitive part, to me, is that with raw gzip, there is reader.Close() and reader2.Close(); with AutoDecompress, there would be two reader.Close()s, which looks pretty similar to a copy&paste bug.

But I still can’t think of a much better API.

reader = … defer reader.Close() reader, close, … = AutoDecompress(…) defer close()

does not look noticeably better.

For the moment I'm just going to ignore the close method for gzip if that's okay with you. Otherwise we'd have to NopCloser most of the other functions, and DetectCompression's signature would look wrong.

I’m not really happy to ignore resource cleanup and error detection, no. It’s a time bomb, and doing the right thing seems plausible.

I can take on the work of updating the existing compression package and its users, if you don’t want to deal with the rest of c/image as part of this PR (apart from updating it to use the new API).

Alright, in that case I'll work on doing that tomorrow.

I'll work on fixing this up today.

mtrmac · 2018-03-15T20:29:21Z

docker/tarfile/src.go

+			if err != nil {
+				return nil, errors.Wrapf(err, "auto-decompress %s to find size", h.Name)
+			}
+			rawSize := h.Size


(uncompressedSize perhaps? raw suggests to me the original, i.e. ”not decompressed”.)

cyphar · 2018-03-16T01:23:41Z

@umohnani8 Oh, so we still haven't sorted that problem out (it's been a problem for more than a year now -- and there were PRs posted that were supposed to resolve this issue). That's a shame.

mtrmac · 2018-03-16T13:11:25Z

@umohnani8 Oh, so we still haven't sorted that problem out (it's been a problem for more than a year now -- and there were PRs posted that were supposed to resolve this issue). That's a shame.

I can’t remember PRs for that, am I misremembering?

The coupling with skopeo tests has been added intentionally at a time when we were changing the c/image API so frequently, and without noticing that skopeo needs updating, that we didn't have a fresh skopeo build for about a month.

Now that the churn has settled a bit (but we do still break the API), and more importantly there are a few other prominent users of c/image, maybe we could reconsider that tradeoff… though I’m not exactly sure in which direction. Actually document and commit to a subset of API to be stable? That would probably still be a very small subset.

cyphar · 2018-03-18T06:45:09Z

@mtrmac This all happened more than a year ago, it took me a while to find the PRs. Here's the ones I could find:

makefile: run containers/image unit tests skopeo#308 (which is mine -- oops!)
.gitignore,vendor.conf,Makefile: use vndr to manage dependencies for testing. #240

The idea was to see if we could run the integration tests as part of skopeo so that we didn't have to run test-skopeo here -- and instead we would gate things on skopeo PRs. But I'm not sure how good of an idea that is in retrospect.

The main problem with test-skopeo is when you have to touch vendor.conf -- which is an outlier, but is an outlier that I've hit more often than you would expect. 😉

mtrmac · 2018-03-22T14:11:22Z

(

The idea was to see if we could run the integration tests as part of skopeo so that we didn't have to run test-skopeo here -- and instead we would gate things on skopeo PRs. But I'm not sure how good of an idea that is in retrospect.

At that point in the past, skopeo was IIRC the only known user, so breaking skopeo made the two repos rather pointless—and we were broken due to c/image changing its API frequently most of the time. So, test-skopeo was introduced to ensure that whoever changed the c/image API also immediately prepared a skopeo update and the breakage would last for about an hour instead of a week. The cost of ~30 minutes for the extra "DO NOT MERGE" version and Ci run was well worth it at the time.

c/image is now breaking the API a bit less, so maybe the tradeoff is no longer that compelling. Also, with umoci, Nix, libpod, buildah, and maybe other users, it is unclear whether treating skopeo specially is reasonable.

Ideally, of course, c/image would not break API ever, but with things like #431 incoming, I don’t think we are ready to pay the price of that commitment yet—or I haven’t heard that the demand is strong enough so far.

)

Anyway, in this particular case, I’m fine with skipping the ”do not merge” temp-PR, we can with reasonable confidence expect that this will just work, so that PR can be prepared after merging this one.

For me, this PR blocks only on closing the gzip.Reader.

mtrmac · 2018-03-22T20:45:37Z

(

c/image is now breaking the API a bit less, so maybe the tradeoff is no longer that compelling.

One more aspect of this is that c/image does not have any real integration tests; those are all in skopeo. We could just move them, but then skopeo would be missing tests, and maintaining two sets, differing only in whether a functionality is invoked from Go or from a CLI, is a bit wasteful.
)

rhatdan · 2018-06-01T13:01:33Z

@cyphar Needs a rebase, if you are still interested in adding this PR.

cyphar · 2018-06-04T08:46:17Z

Yes, I will revive this PR as it has broken skopeo-umoci interactions several times in the past.

For quite a while we were blocked on support xz decompression because it effectively required shelling out to "unxz" (which is just bad in general). However, there is now a library -- github.com/ulikunitz/xz -- which has implemented LZMA decompression in pure Go. It isn't as featureful as liblzma (and only supports 1.0.4 of the specification) but it is an improvement over not supporting xz at all. And since we aren't using its writer implementation, we don't have to worry about sub-par compression. Signed-off-by: Aleksa Sarai <[email protected]>

Most users of DetectCompression will use the decompressorFunc immediately afterwards (the only exception is the copy package). Wrap up this boilerplate in AutoDecompress, so it can be used elsewhere. Signed-off-by: Aleksa Sarai <[email protected]>

This matches how "docker load" deals with compressed images, as well as being a general quality-of-life improvement over the previous error messages we'd give. This also necessarily removes the previous special-cased gzip handling, and adds support for auto-decompression for streams as well. Signed-off-by: Aleksa Sarai <[email protected]>

Tools like umoci will always compress layers, and in order to work around some *lovely* issues with DiffIDs, tarfile.GetBlob would always decompress them. However the BlobInfo returned from tarfile.GetBlob would incorrectly give the size of the *compressed* layer because the size caching code didn't actually check the layer size, resulting in "skopeo copy" failing whenever sourcing umoci images. As an aside, for some reason the oci: transport doesn't report errors when the size is wrong... Signed-off-by: Aleksa Sarai <[email protected]>

cyphar · 2018-06-06T10:57:53Z

Rebased and containers/skopeo#516 is the test PR.

rhatdan · 2018-06-16T10:07:42Z

@cyphar Tests are failing?

cyphar · 2018-06-16T10:11:44Z

@rhatdan It's because I've added a new vendor which means we need a skopeo PR to test it (which I've already done in containers/skopeo#516) -- so we'll have to merge with the tests failing. But at the moment I'm fixing up the auto-decompression API to use a ReadCloser.

rhatdan · 2018-07-12T17:52:52Z

@cyphar Needs a rebase.
@mtrmac PTAL

rhatdan · 2018-07-12T18:31:58Z

@cyphar this is blocking us from fixing a couple of issues in podman.

mtrmac · 2018-07-14T00:41:42Z

@mtrmac PTAL

@rhatdan #481 ; the docker-archive: handling is completely untested yet, feel free to try with your use cases if I don’t get to this first.

mtrmac · 2018-07-17T23:03:46Z

This was finished in #481. Thanks again!

cyphar changed the title ~~docker: tarfile: support auto-decompression of source~~ docker: tarfile: improve auto-decompression handling Mar 12, 2018

vrothberg mentioned this pull request Mar 12, 2018

podman load doesn't work on xz compressed files containers/podman#472

Closed

mtrmac reviewed Mar 12, 2018

View reviewed changes

mtrmac reviewed Mar 15, 2018

View reviewed changes

mtrmac mentioned this pull request Apr 24, 2018

Problems trying to load cached images into minikube containers/podman#655

Closed

afbjorklund mentioned this pull request Apr 27, 2018

CRI: try to use "sudo podman load" instead of "docker load" kubernetes/minikube#2757

Merged

cyphar added 4 commits June 6, 2018 20:40

cyphar mentioned this pull request Jun 6, 2018

[donotmerge] vendor: update github.com/containers/image containers/skopeo#516

Closed

mtrmac mentioned this pull request Jul 14, 2018

Docker archive auto compression #481

Merged

mtrmac closed this Jul 17, 2018

cyphar deleted the docker-archive-auto-compression branch August 3, 2018 07:02

afbjorklund mentioned this pull request Aug 29, 2023

Podman takes several minutes to load a compressed .tar.xz file containers/podman#19779

Closed

docker: tarfile: improve auto-decompression handling #427

docker: tarfile: improve auto-decompression handling #427

Conversation

cyphar commented Mar 12, 2018 • edited Loading

cyphar commented Mar 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Mar 14, 2018

umohnani8 commented Mar 15, 2018

mtrmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Mar 16, 2018 • edited Loading

mtrmac commented Mar 16, 2018

cyphar commented Mar 18, 2018

mtrmac commented Mar 22, 2018 • edited Loading

mtrmac commented Mar 22, 2018

rhatdan commented Jun 1, 2018

cyphar commented Jun 4, 2018

cyphar commented Jun 6, 2018 • edited Loading

rhatdan commented Jun 16, 2018

cyphar commented Jun 16, 2018

rhatdan commented Jul 12, 2018

rhatdan commented Jul 12, 2018

mtrmac commented Jul 14, 2018

mtrmac commented Jul 17, 2018

cyphar commented Mar 12, 2018 •

edited

Loading

cyphar commented Mar 16, 2018 •

edited

Loading

mtrmac commented Mar 22, 2018 •

edited

Loading

cyphar commented Jun 6, 2018 •

edited

Loading