Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add the ability to deduplicate uploads #236

Closed
sargun opened this issue Feb 15, 2021 · 25 comments
Closed

Proposal: Add the ability to deduplicate uploads #236

sargun opened this issue Feb 15, 2021 · 25 comments

Comments

@sargun
Copy link
Contributor

sargun commented Feb 15, 2021

Problem statement

We have many repositories. On the order of ~tens of thousands. These are all managed by independent internal teams. Some teams may be building from a common base layer like Ubuntu. When Ubuntu is revved, it forces a re-upload of that layer to every repository. The security requirement of this solution is that the distribution server must be able to prove that the user has the file they're referring to. The distribution is NOT required to keep the information secret as to whether is has a given blob when the client attempts to prove its possession of the file, only access to those contents.

The current solution is to use "cross repository blob mounts", but that's non-trivial to implement for many users. Specifically, the:

  1. Keep track of which repo each layer originally came from
  2. Check if the blob that came from the relevant upstream repo exists in our private registry's copy of that repo, if so, do a cross-repo mount.
  3. Otherwise, create the repo and upload a blob + manifests that were responsible for the existence of the previous blob
  4. Cross-mount

Rule #3 is difficult in a couple respects. (1) If a lineage of multiple images is used -- say ubuntu -> titusoss/ubuntu -> titusoss/ubuntu-ping, it requires this logic of the "original" repository a blob came from, otherwise a dangling blob uploaded to another repo may be garbage collected without an associated manifest if the upload isn't completed quickly enough. This is easy if you know all the manifests that the blobs, but that's not in the image. (2) It also requires that any user can create a repository and read / write to any repo.

Proposal

When the user begins an upload session, and calls POST /v2/<name>/blobs/uploads/, they can optionally supply an argument, digest, with a digest value of the file that comes out at the end. If the registry would like to allow the user to prove that they have this file, it will begin a protocol to allow the user to prove it. The client may choose not to opt into the deduplication method.

The registry will respond with some data that allows the user to generate a random byte of strings of a particular length and value. This length should be of a reasonable size. The user's job is to then append the value to the blob, and run the given digest algorithm over their new value, and send it to the server. This is made to be possible because many hash functions have a relatively small internal state that can be checkpointed and stored alongside the blob on the server. Client-side, the user may value CPU over bandwidth, or choose to store similar metadata themselves.

Properties

  • generator string
    This must be a XOF, which is used to generate the data to append to the original value. We can come up with the allowed values later.

  • length int
    An integer greater than 0 which is used to define the output length to generate from the given XOF in bytes.

  • seed string
    The input into the XOF

Example:

POST /v2/titusoss/ubuntu/blobs/uploads/?digest=sha256:d9014c4624844aa5bac314773d6b689ad467fa4e1d1a50a1b8a99d5a95f72ff5 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 0
Host: registry.us-east-1.streamingtest.titus.netflix.net:7002
User-Agent: HTTPie/2.3.0



HTTP/1.1 202 Accepted
Connection: keep-alive
Content-Length: ...
Location: https://registry:7002/v2/titusoss/ubuntu/blobs/uploads/af8e1d5c-33cb-4e6b-a8e3-1c00418f0cfe?_state=mystate
Range: 0-0
Content-type: application/json

{
  "deduplicateUpload": {
    "length": 57,
    "seed": "foo",
    "generator": "blake3"
  }
}

The above defined request is asking the use to prove that they have contents, as described by sha256:d9014c4624844aa5bac314773d6b689ad467fa4e1d1a50a1b8a99d5a95f72ff5. In this case, that described the string Hello, world!\n

The user needs to do the following

  1. Perform a blake3 hash of "foo".
  2. Generate a 50 bytes hash. In this case, the hex digest of that is 04e0bb39f30b1a3feb89f536c93be15055482df748674b00d26e5a75777702e9791074b7511b59d31c71c62f5a745689fa6c.
  3. Append that to Hello, world!\n.
  4. Take a hash of the new value.

That's roughly described in the following python session:

In [1]: import blake3, hashlib

In [2]: data = b'Hello, world!\n'

In [3]: hashlib.sha256(data).hexdigest()
Out[3]: 'd9014c4624844aa5bac314773d6b689ad467fa4e1d1a50a1b8a99d5a95f72ff5'

In [4]: extra = blake3.blake3(b"foo").digest(length=50)

In [5]: extra
Out[5]: b'\x04\xe0\xbb9\xf3\x0b\x1a?\xeb\x89\xf56\xc9;\xe1PUH-\xf7HgK\x00\xd2nZuww\x02\xe9y\x10t\xb7Q\x1bY\xd3\x1cq\xc6/ZtV\x89\xfal'

In [6]: hashlib.sha256(data + extra).hexdigest()
Out[6]: '958864784c4661cd235c474a4105deedc00ba21ca372e39e03891a5c3d32696f'

This gave us the resultant hash 958864784c4661cd235c474a4105deedc00ba21ca372e39e03891a5c3d32696f. It must be the same type as the original hash type.

To complete the upload, put with the dedupe header:

PUT https://registry:7002/v2/titusoss/ubuntu/blobs/uploads/af8e1d5c-33cb-4e6b-a8e3-1c00418f0cfe?_state=mystate&dedupe=sha256:958864784c4661cd235c474a4105deedc00ba21ca372e39e03891a5c3d32696f
Content-Length: 0

202 Accepted

At this point, the registry will validate it, and accept it if deduped, or not.

Security implications

  • This will leak if the registry has the image, even if the regitry always adds a dedupe token whether or not it has it because of the timing attacks that can be waged against it during verification
  • The registry musn't make the length too large, otherwise this can be a DoS vector for the client, and server. We should think about setting an upper limit
  • The "seed" must not be predictable
  • The XOF isn't really all that important -- the security attributes of it, since we're using it as a random string generator.

FAQ

Should the PUT be a JSON document instead of the query parameter.

Maybe. I'm not sure.

Doesn't this require that more state is kept in the server, what's the implication there?

This state can probably be passed back and forth in the state query parameter, or embedded in the location object in another way. Alternatively, this information is pretty tiny (above example ~4 bytes).

Won't rehashing be horribly expensive?

The SHA256's hasher can be "checkpointed" and the state can be saved to disk on most implementations.

How do we deal with time, and CPUs becoming faster or one of our hash functions being weakened? ("security")

You can keep cranking up the length output by the XOF to force more hashing. This also has no requirement to work with only one hash function.

@sargun
Copy link
Contributor Author

sargun commented Feb 15, 2021

The one other thing:

We should also allow the registry to respond with a "409 Conflict", if this repository already contains the blob, and the user is authorized for this given repo to read that blob.

@justincormack
Copy link

This only requires that the client possesses an unfinalized hash of the content, which is not really much different from possessing a different hash function of the content. Most proof of data possession algorithms will choose a random piece of the content (eg from a precomputed set) so the client has to actually have the whole content, or some other less deterministic computation.

@sargun
Copy link
Contributor Author

sargun commented Feb 16, 2021

@justincormack If we limit the approach to a Merkle Damgård construction, I think that's much more digestable, so you don't have to do random seeks. TTFB is on the order of ~1 second, and the cache can only store on the order of ~1MB data per blob. If it's a BLAKE3 style hash, you can keep the terminal leaves of the tree and rehydrate from that.

Although, how often is an unfinalized hash of the contents common?

@vbatts
Copy link
Member

vbatts commented Feb 25, 2021

Like Justin mentioned, past thoughts for this Stephen Day had begun an exposed implementation of sha256 for having access to this unfinalized hash, but then this unfinalized hash state is also exposed to being tampered with.

The more probable option was to work from offsets of the original blob. (Like a challenge-response of being given a few offsets, and then the client side hashes the tar at these points, and returns checksum(s)).
While it would be expensive to use the existing tarball streams of the container layers, if there were a bypass made to get to the tar-split data, then accessing the offsets would be much more efficient. Given that data is just a series of Entries that records the size of the entry. It could be counted and then read the portions of data from the file on the disk.
Checksum the segments, sure make it a merkle or whatever.

@sargun
Copy link
Contributor Author

sargun commented Feb 25, 2021

@vbatts If e were to choose an offset based algorithm, would it have something like:

  • List of offsets
  • Random seed per offset
  • Random length per offset?

The problem that I kind of have is that I have two kinds of storage: Redis and S3. S3 is very slow and API operations are "costly", so we have to be conservative. Redis falls over at around 1 MB objects. Although we can have a merkle tree which makes the above very doable, each (1MB) "chunk" turns into about ~1KB of state in the leaf of the merkle tree. These leaves can be combined if they're in the middle of the object.

I guess my next question is what level of security do people want, given that this would be an opt-in option?

@sargun
Copy link
Contributor Author

sargun commented Mar 1, 2021

So, what if the structure was something like:

{
  "deduplicateUpload": {
    "generator": "blake3"
    "points": [
      {
        "length": 57,
        "seed": "foo",
        "offset": 0
      }
  }
}

We guarantee the points are non-overlapping. We may extend the length of the blob overall, and the client should be prepared for that.

Although this would be difficult to make efficient in sha256, it is easy enough to do efficiently in merkle constructs.

@sargun
Copy link
Contributor Author

sargun commented Mar 25, 2021

Based on the previous conversation, there are three security models:

  1. Many repositories, where all repository owners trust one another and have read AND write access to one another's registries
  2. Many repositories, where the not all repositories can be read from and written to by every user (HR vs. IT department). In this model is it important to keep the contents of the blob secret, but it is not necessary to keep the blob itself secret. In addition, side channel leaks are not a considered risk in this model.
  3. Many repositories, where the not all repositories can be read from and written to by every user (with potentially adversarial actors on the same registry). In this model, it is important to keep the knowledge that the registry has a given blob secret.

I propose we solve for use case #1 as it does not require solving any cryptography problems. I suggest we also solve a subset of #2, where there is an access control system already in place in which the registry can validate a given user has access to the given blob.

I propose we state:

To obtain a session ID, perform a POST request to a URL in the following format:
/v2/<name>/blobs/uploads/

The client MAY specify a query parameter "digest", with a well-formed descriptor. If the blob already exists and the user has access to it, the registry MAY respond with a 201 Created, and respond with a Location header that MUST include the following header:

Location: <blob-location>

Alternatively, if the registry does not blindly accept the dedupe / content, it may respond a different response code in the future. If the client does not understand the response code, then it should fall back and retry without a digest.

@justincormack
Copy link

Yes I think just solving for cases where leaking existence is not a problem is the best solution.

@hallyn
Copy link
Contributor

hallyn commented Apr 21, 2021

If I'm understanding @justincormack right, I agree.

@stevvooe
Copy link
Contributor

The original reason we didn't include this was that an attacker could block another upload. We added something like this in the containerd content store API by allowing the client to set upload identifiers.

I think this could be done safely on registries by having a client-provided identifier, namespaced by the repository, provided at upload creation time. While a cryptographic solution would be ideal, there would be some difficulties in choosing an algorithm that would be performant and meet all use cases. The endpoint should already be well protected by the uploader, preventing the blocking of uploads by an attacker. The blast radius there would be a single repository.

This could likely be implemented as an addition parameter on the upload post call. From there, you'd need something to inform clients that the upload is ongoing elsewhere. Since we already provide progress, we could use that endpoint to provide status, as well.

@sargun
Copy link
Contributor Author

sargun commented Apr 21, 2021

The original reason we didn't include this was that an attacker could block another upload.

How so, isn't the Location header for uploads dynamically generated?

@stevvooe
Copy link
Contributor

How so, isn't the Location header for uploads dynamically generated?

Yes, it is. The behavior we avoided was having user generated ids. User generated ids would solve this problem, as has been demonstrated in containerd's content store upload model.

I'm saying that using a user set id is a decent solution, with appropriate ACLs in place. Requiring a complex hash algorithm isn't necessary on the server-side. A dedupe key is sufficient, as long as the client agree on the algorithm for calculating it.

@SteveLasker
Copy link
Contributor

Security Models

Different registries handle security differently, based on their use case. Whether it's docker hub as a public registry, or mcr, nvida, redhat as software registries. Can we enable the handshake and let the specific registry decide how it handles what can be cross mounted?

@sargun
Copy link
Contributor Author

sargun commented Apr 21, 2021

@SteveLasker I'm not sure what you're getting at. Wouldn't registries with a security model that's not aligned with this just respond with 202 Accepted and continue as normal with the upload? This has the ability to "fail okay" on old registries and registries that do not support this security model.

@SteveLasker
Copy link
Contributor

Here you refereed to 3 models, and scoping to 1 with a subset of 2.
What I'm suggesting the cross repo mounting and security is handled differently by different registries. So, if we can identify the handshake of what will be uploaded, the registry can respond for how it handles cross-mounting.

@sargun
Copy link
Contributor Author

sargun commented Apr 21, 2021

@SteveLasker Wouldn't the algorithm be:

  1. If I already know of another repo with this layer, try the cross-mount API
  2. Try to start upload with digest
  3. Do traditional upload

I'm not sure where / why a handshake needs to occur as opposed to this fallback behaviour?

@SteveLasker
Copy link
Contributor

I build app1 FROM alpine, which originated on docker hub.

I push app1 to myregistry.azurecr.io/app1 (which was previously an empty registry)
At this point, myregistry.azurecr.io has no knowledge of the alpine layers, so the alpine layers wind up within /app1

I build app2 FROM alpine
I push app2 to myregistry.azurecr.io/app2

At this point, app2 should be able cross mount to the alpine layers in /app1.

As long as the token used to push has access to both repos (app1 and app2), the cross mount should be allowed.

So, you should be able to get all 3 scenarios if we define the semantics for how a client can identify the blob that's being uploaded (securely) and allow the registry to define whether it will support crossmount.

@sargun
Copy link
Contributor Author

sargun commented Apr 21, 2021

I think trying to build a secure protocol for cross-repo mounts (especially one that's replay resistant) is out of scope of this proposal. I think the "try and fall back" approach is the upper bound of complexity in this proposal.

@vbatts
Copy link
Member

vbatts commented Apr 23, 2021

@sargun:

  • this approach would be interesting to consider the common backends like you mentioned S3/redis. I'm guessing the consideration is that they would be generated and validated on the fly by the registry, so as to be random enough. 🤔
  • I think your models/use cases are tracking correctly. And from @justincormack's statement is regarding exposing the digest that may exist on another registry, which could be private, etc.

@jonjohnsonjr
Copy link
Contributor

To capture some of the discussion from the call...

There are two opportunities to do this without really changing the spec (much):

  1. Do this during HEAD blob requests.
  2. Do this as part of the cross-repo mount request.

Doing this as part of a HEAD request is problematic:

  1. HEAD should be side-effect free, so we would violate HTTP RFCs.
  2. Doing this would confuse or break HTTP caches in unpredictable ways (due to 1).
  3. There's no way to distinguish between a client just checking to see if a blob exists and a client that is checking to see if a blob exists because it wants to upload it to the registry.

If we want to do this during cross-repo mounting, we can just make the from parameter optional. Registries that expect from should return a 202 anyway, so the fallback path is fine. If a from is supplied, the registry can take that as a hint of where to look, but has the liberty to mount from anywhere that satisfies the registry's auth model. Clients can supply the from parameter if they know it, but they can also just omit it. This can be used to skip a blob existence check as well, if a client wants, eliminating an extra roundtrip.

One drawback of this approach is that it is susceptible to timing attacks. Assuming that an existence check is faster than an existence check + an auth check, a client can (probabilisticly) detect the presence of a blob in a registry even if they don't have access to it by comparing the latency of 202 responses to mount attempts. Certain registries will almost certainly not want to implement this, but fallback behavior is already well-specified.

Common use cases for this would be automatically mounting publicly available blobs (I took some liberties and implemented this with GCR a while ago, using our (partial) Docker Hub mirror if you want to test against a registry) or mounting across somewhat trusted boundaries (e.g. an org-wide registry where you have read access to everything).

@shizhMSFT
Copy link
Contributor

I have 3 major concerns to your proposal:

  1. The parameter digest on POST request conflicts with the single post functionality of the distribution-spec. Note that it is allowed to push an empty layer to the registry.
  2. Security model is important. As mentioned in the issue description and the comment, this proposal assumes that all users have registry-level authorization. This implies that any registry attempting to have fine-grained access control breaks the security model (e.g. DockerHub, ACR, etc.). It is worth noting that leaking one bit of the private content is critical since we should secure information in the fashion of all-or-nothing.
  3. Resumable hashing is insecure in general and thus might be disabled in some systems. Also, the blake3 algorithm or any algorithm not sha256 might not be available at the server side or the client side. Thus an algorithm negotiation might be required and it makes the protocol more complex.

@sargun
Copy link
Contributor Author

sargun commented Jun 15, 2021

@shizhMSFT Have you read the updated proposal in the PR?

  1. It addresses Seed with project-template master? #1 since it uses the mount endpoint, and not the start upload endpoint.
  2. The security model is important, and thus the proposal only applies to registries that have per-repository authz, and do not care about timing / disclosure attacks
  3. The PR does not try to propose a secure upload algorithm.

@sargun
Copy link
Contributor Author

sargun commented Jun 29, 2021

Success: #275

@sargun sargun closed this as completed Jun 29, 2021
@sudo-bmitch
Copy link
Contributor

Bit of a late follow up, but I started testing this in my client side code yesterday, pointing to registries that didn't support the feature yet, and with the registry:2 image and Docker Hub, the behavior is the ideal fallback. Both return a Location header when I call a "mount" without a "from".

@jonjohnsonjr
Copy link
Contributor

Thanks @sudo-bmitch that's great news.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants