Skip to content

Conversation

@fgiudici
Copy link

@fgiudici fgiudici commented Sep 24, 2020

Found an issue when pulling images which have the same layer (diff-digest) compressed in different ways (so different compressed-diff-digest and compressed-size).
Reproducer:

 # crictl pull docker.io/fgiudici/test:1.0
 # crictl rmi docker.io/fgiudici/test:1.0
 # crictl pull docker.io/node:lts-slim
 # crictl pull docker.io/fgiudici/test:1.0
FATA[0003] pulling image failed: rpc error: code = Unknown desc = Error: blob sha256:6a73b40493d9a3fd12c93f0d03ec90c78e447dc9f182d119fb7477b75997ba72 is already present, but with size 22522274 instead of 23625492

Image docker.io/fgiudici/test:1.0 has some layers in common with docker.io/node:lts-slim, but the layers have been pushed with a different compressed archive (and so different compressed-diff-digest).
In particular:
node:lts-slim has

"id": "767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
"created": "2020-09-23T10:27:27.399649267Z",
"compressed-diff-digest": "sha256:abb454610128b028301ee40af387d31111a1e699e4ea424fd53186ff77067402",
"compressed-size": 22522274,
"diff-digest": "sha256:767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
"diff-size": 58489344,
"compression": 2

test:1.0 has

"id": "767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
"created": "2020-09-23T09:50:52.502675155Z",
"compressed-diff-digest": "sha256:6a73b40493d9a3fd12c93f0d03ec90c78e447dc9f182d119fb7477b75997ba72",
"compressed-size": 23625492,
"diff-digest": "sha256:767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
"diff-size": 58489344,
"compression": 2

As the size check has been dropped in more recent branches with commit:
223c722#diff-5f78baafa6d34438aba66ed64f4b16d5L517
wondering if could be safe doing the same here (the code that was in place before the rework on caching).

@TomSweeneyRedHat
Copy link
Member

Tests aren't happy @fgiudici

@fgiudici
Copy link
Author

@TomSweeneyRedHat , thanks for checking so quickly!
CI fails during the environment setup. The issue is on the cri-o-release-1.11 branch, we need to udpate the travis dockerfile base image as done on more recent branches (or swap the ubuntu repo in the old image once started).
Addressing that in a separate PR (#1052).

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

An initial very quick look: It does make sense that this check should be dropped. I’m not at all sure that setting srcInfo.Size = extantBlobSize unconditionally makes sense here, couldn’t that cause a call to ReapplyBlob with a mismatched (digest, size) pair? I’ll need to take a deeper look.

@fgiudici
Copy link
Author

Thanks for the PR!

An initial very quick look: It does make sense that this check should be dropped. I’m not at all sure that setting srcInfo.Size = extantBlobSize unconditionally makes sense here, couldn’t that cause a call to ReapplyBlob with a mismatched (digest, size) pair? I’ll need to take a deeper look.

Yeah, you are right, we will have a mismatched pair. Seems to me btw that the srcInfo.Size = extantBlobSize is useless (should we drop it?), as ReapplyBlob (in storage_image.go) calls again "HasBlob()" to get the size, and reapplies that one. So, seems we have also to fix HasBlob, in order to check also the CompressedDigest when searching layers by CompressedDigest: if it is not the same of the searched layers, we should keep the original layers source I guess. So, we will cache it properly.

@TomSweeneyRedHat
Copy link
Member

Just wanted to note that this will hopefully fix this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1867463 which is getting a lot of attention.

@fgiudici
Copy link
Author

Following the idea in the comment above, added a commit to check if, when we found a cached layer by the uncompressed digest, the compressed digest matches too:
if it does, we return the cached compressed size as before.
If not, we have different compressed archives and we return the compressed size of the source image.

Note that at this point we can drop the first commit (we can leave the check on the sizes in copy.go) as we now return the correct size. But it honestly doesn't seem that useful.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 30, 2020

Following the idea in the comment above, added a commit to check if, when we found a cached layer by the uncompressed digest, the compressed digest matches too:
if it does, we return the cached compressed size as before.
If not, we have different compressed archives and we return the compressed size of the source image.

That doesn’t work in general because the input blobInfo.Size value may be -1 (unknown), but must not be unknown when returned from ReapplyBlob.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 30, 2020

Actually, how does this fail precisely? Looking at storage commit b8e0174ae6b2dc083d9ada365b9a207371aa62a6 , CompressedSize and CompressedDigest are always updated together, and immediately before CompressedDigest is updated, the map used to look up LayersByCompressedDigest is updated. So, we should never get into the blobinfo.Digest != layers[0].CompressedDigest case AFAICS.

(Overall, before roughly the introduction of TryReusingBlob, the code is not very ready for substituting different blobs, and I’d lean towards dropping the reuse (forcing a re-pull of the layer) if we did have to change something, rather than introducing the new / never-before-exercised situations into a very old very stable branch. But right now I must be missing something.)

@fgiudici
Copy link
Author

fgiudici commented Oct 2, 2020

Actually, how does this fail precisely? Looking at storage commit b8e0174ae6b2dc083d9ada365b9a207371aa62a6 , CompressedSize and CompressedDigest are always updated together, and immediately before CompressedDigest is updated, the map used to look up LayersByCompressedDigest is updated. So, we should never get into the blobinfo.Digest != layers[0].CompressedDigest case AFAICS.

(Overall, before roughly the introduction of TryReusingBlob, the code is not very ready for substituting different blobs, and I’d lean towards dropping the reuse (forcing a re-pull of the layer) if we did have to change something, rather than introducing the new / never-before-exercised situations into a very old very stable branch. But right now I must be missing something.)

Well, I dug more till the root cause: the storage cache is not completely cleaned up on image removal.
crictl pull docker.io/fgiudici/test:1.0 : fills up the storage cache, i.e., the layerStore struct.
Let's track in particular this layer:

"id": "767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
"created": "2020-09-23T09:50:52.502675155Z",
"compressed-diff-digest": "sha256:6a73b40493d9a3fd12c93f0d03ec90c78e447dc9f182d119fb7477b75997ba72",
"compressed-size": 23625492,
"diff-digest": "sha256:767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
"diff-size": 5848934,

What we want to note is that we will fill in the byid map in the layerStore struct:
767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a:$POINTER_TO_TEST1.0_LAYER

and in the bycompressedsum map:
sha256:6a73b40493d9a3fd12c93f0d03ec90c78e447dc9f182d119fb7477b75997ba72:[767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a]

Let's move on:
crictl rmi docker.io/fgiudici/test:1.0
the byid map is cleaned but not the bycompressedsum one. This is what will cause the issue in a while!

Now, let's pull the other image:
crictl pull docker.io/node:lts-slim
Here the layer belonging to this image we are interested into:

"id": "767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
"created": "2020-09-23T10:27:27.399649267Z",
"compressed-diff-digest": "sha256:abb454610128b028301ee40af387d31111a1e699e4ea424fd53186ff77067402",
"compressed-size": 22522274,
"diff-digest": "sha256:767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
"diff-size": 5848934,

Now, our byid map has:
767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a:$POINTER_TO_LTS_SLIM_LAYER

And in the bycompressedsum we have the old stale entry and the new one referencing the same uncompressed digest:

  • sha256:6a73b40493d9a3fd12c93f0d03ec90c78e447dc9f182d119fb7477b75997ba72:[767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a]
  • sha256:abb454610128b028301ee40af387d31111a1e699e4ea424fd53186ff77067402:[767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a]

Next:
crictl pull docker.io/fgiudici/test:1.0

HasBlob() will succeed in finding here, by finding the 6a73b40493d9a layer as a compressed digest of 767a7b7a8ec530ae9a. By retrieving the layer from the byid map we will be passed the LTS_SLIM_LAYER, which has the (different) digest and size of the layer from node:lts-slim. We pass back the compressed size... and here we are.

IMHO we should fix the storage package: just opened a PR (containers/storage#730).
We can anyway manage this by checking the compressed digest before accepting the cached layer. This would not hurt and shouldn't be risky: updated this PR to just check the compressed digest and not accepting the returner layer if it doesn't match.

@fgiudici fgiudici changed the title copy: drop the check on the size of cached blobs storage: check the compressed digest of cached blobs Oct 2, 2020
@mtrmac
Copy link
Collaborator

mtrmac commented Oct 2, 2020

Ah, I have completely missed that case. Thanks for digging into it!

The updated fix makes sense to me. I’m a bit torn on whether to actually include the workaround in c/image or whether to just have this PR as a pointer to the c/storage PR.

What do others think?

When there are diffent images that have the same layer compressed
differently, pulling the images in cri-o may fail.
Reproducer:
 # crictl pull docker.io/fgiudici/test:1.0
 # crictl rmi docker.io/fgiudici/test:1.0
 # crictl pull docker.io/node:lts-slim
 # crictl pull docker.io/fgiudici/test:1.0
FATA[0003] pulling image failed: rpc error: code = Unknown desc = Error: blob sha256:6a73b40493d9a3fd12c93f0d03ec90c78e447dc9f182d119fb7477b75997ba72 is already present, but with size 22522274 instead of 23625492

Image docker.io/fgiudici/test:1.0 has some layers in common with docker.io/node:lts-slim, but the layers have been pushed with a different compressed archive (and so different compressed-diff-digest).
In particular:
node:lts-slim has
    "id": "767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
    "created": "2020-09-23T10:27:27.399649267Z",
    "compressed-diff-digest": "sha256:abb454610128b028301ee40af387d31111a1e699e4ea424fd53186ff77067402",
    "compressed-size": 22522274,
    "diff-digest": "sha256:767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
    "diff-size": 5848934,
    "compression": 2

test:1.0 has
    "id": "767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
    "created": "2020-09-23T09:50:52.502675155Z",
    "compressed-diff-digest": "sha256:6a73b40493d9a3fd12c93f0d03ec90c78e447dc9f182d119fb7477b75997ba72",
    "compressed-size": 23625492,
    "diff-digest": "sha256:767a7b7a8ec530ae9a7b1c4eeed530d54ca7ef9f224fe207be42dc74f565587a",
    "diff-size": 5848934,
    "compression": 2

This is due to an incomplete cleaning of the storage cache when the
images are deleted: the mappings of the compressed digest images are
not cleaned. So, if the same layer, but with a different compression, is
later pulled, the old undeleted reference will match, returning a layer
with a different compressed digest and compressed size.

Don't blindly trust the storage cache, check we got what we need.

Signed-off-by: Francesco Giudici <[email protected]>
@fgiudici
Copy link
Author

Once applied the c/storage PR ( containers/storage#730) the issue was solved and was not reported again.
Moreover, since this PR targets an old stable version, I don't think makes sense to consider this PR anymore, closing.

@fgiudici fgiudici closed this Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants