-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manifest file #111
Comments
I've tested the signing part and it works. Next we need verification, blocked on sigstore/sigstore-python#628 |
note to myself. https://github.com/in-toto/attestation/blob/main/spec/v1/statement.md So either we use a dummy subject, or we move the predicate content to subject. I'm leaning towards the former, because the subject format is limited, whereas a custom predicate gives us room to evolve / version its format |
Btw regarding your question in #83 :
We can get a blob id from git which is basically a SHA1 that lets you know if the file has changed in a given revision. We also have a git diff endpoint to check changes between two revisions. Not sure we'd need an extra tool to create tags as all our endpoints should be accessible and usable with https://github.com/huggingface/huggingface_hub |
Thanks for the info. Do you think we'll need to build a specific tool / integration like a git hook to improve UX? (I can create a separate issue for this) |
Do you mean a pre-commit hook? I'm not sure people will have their git repos checked out, afaik people upload files via I guess we could think of integrating directly in |
Integrating with |
@TomHennen mentioned we could put the manifest content in the subject's resource descriptor's content field. |
Right that's definitely something that could work (and I'm happy to discuss it more if anyone is interested). However! After reading the initial description of this issue I'm pretty sure this can be accomplished with the existing in-toto statement type.
I think this can be supported easily and without any changes using the 'name' field as described here. AI model verifiers would check that the filename that changed matches the Either way, I think this is definitely a use case that https://github.com/in-toto/attestation wants to support, so if it's missing anything let's chat about it? CC @marcelamelara, @pxp928, @mikhailswift. |
I think @SantiagoTorres was also suggesting to use in-toto |
nope, that does not work. If you do that each file will be represented as an individual artifact. We don't want arbitrary tools to miss verifying some of the files. The way it needs to be interpreted for verification (for this specific case) is that the caller needs to give us a list of files that may be present. Say, PyTorch has config files, a list of model_*.bin, etc. The caller must give us this list, otherwise an attacker can remove files. Imagine if the config file is removed and it says to run the pickle.Unload() in a sandbox: an attacker would bypass signature verification entirely and get shell on the machine.
Although this technically works, I don't think it's a good approach either. This approach effectively adds an intoto layer that provides no benefits. Instead of being able to access the manifest content as One approach that could work is waiving the requirement of having a subject in the intoto attestation based on the predicateType. Would this be possible or would it go against the intoto framework? |
This doesn't sound like something that's specific to ML models but could occur in many other SW ecosystems. I'm not sure I entirely understand the workflow though. Is there anyplace I can learn more? |
I'm not sure. In the sense that we want to support both "full repository signing" (to start with) and "AI framework signing" that requires verifying a subset of the files signed for the "full repository signing". An example of this is Huggingface repository, where users sign all files under the repo. The repo often contains the same model in different formats. You can use Keras or PyTorch or another framework to load a model (using a subset of files; once we have integrated verification in these framework APIs) or you verify signature for the entire repo before loading it. So TL;DR: we want the existing signature for the full repo to be usable for verification by ML framework Load() APIs. However, for full repository signing, this is not ML specific, ie anyone who wants to sign a folder will need to verify the exact set of files. What's specific to ML is the size of these files (several 100 GB), so it's kinda specific to ML still in this sense. |
Is there anyplace I can find out more about what you're looking for? Any existing design docs? It sounds like the idea is "We have a lot of files with different paths and digests. Sometimes we want to check all of them, and sometimes we want to check a subset of them." Is there more to it than that? |
I think that's pretty much all there is to it, with specific use cases I mentioned above. Let me know if you have further questions. |
So it sounds like it's totally possible to represent a model in the in-toto Statement 'subject' field. Is the issue that the in-toto Statement rules are perceived as too restrictive and don't allow for all/subset use case? If so, can you say more about what that perceived limitation is? Something in https://github.com/in-toto/attestation/blob/main/docs/validation.md ? |
Chatted with Tom right now, and it seems in-toto can be used to model all of our scenarios. I'll do a deeper dig and add a comment with how things could work, at which point we can switch to in-toto, if it turns out there are no other blockers. |
Constraining ourselves to subject fields is not ideal imo. It basically adds technical debt early in the project, because the subject field is not flexible. If we ever want to evolve the format, we need to use annotations for that. Also awkward to pack in the subject what ought to live in the predicate - a subject is for artifact the attestation describes, not the attestation itself (defeats the purpose of having a predicate). I want to stress that putting things in subject means that each subject is considered an artifact, and can be verified independently of each other as per the intoto specs. That means other tools following intoto specs (cosign) would verify the attestation in an unsafe way, ie they would pass verification if they are given the wrong subset of files to verify. This problem is due to semantic differences, and unless the into specs change it can't be fixed iiuc. It's "possible" to express something via intoto, but that does not mean it's the right design if it does not benefit users and security. Looking forward to the proposal. |
Jumping into the discussion after following along so far. @laurentsimon based on your latest comment, what you really want to get out of a manifest predicate is to be able to make a claim a long the lines of "this is the expected set of files for model X". Is that a reasonable interpretation of your use case? If so, could your use case be addressed by something like an existing BOM format? I also know that C2PA has the notion of manifests, is that something you've looked into as a possible predicate? I'm mostly trying to avoid us reinventing the wheel when it comes to developing predicates. I completely agree that the in-toto Statement subject is not designed to represent claims about artifacts. That said, what would the subject for this manifest predicate be? The model as a logical artifact still needs to be identified. What representation of the model would be used, would a reference to the hosting repo of the model files be sufficient?
To confirm my understanding here, your concern is that an attacker could omit legitimate files that are needed for a model to operate correctly? Wouldn't a manifest predicate be susceptible to the same problem? |
Hi all, dropping by. I spnt a little bit thinking about it before I jumped. My understanding is that this is "manageable" using a combination of ITE-4 + a custom hasher (as per the spec). For example:
Where whatever can be any specified way to "hash" such a predicate. If we wanted to be cute about it, we could use a MHT of the file directory. Even cuter, it could follow the git tree object structure to avoid reinventing the wheel. I'm not fully sure going through a fully qualified SBOM is a reasonable way to go, given that thy are in quite a flux, and I blieeve here we are trying to focus on `flow integrity rather than down-the-line transparency.
I'm not enrirely sure I understand this part, you could certainly have a predicate that says "this was reviewed by a legal team" or "the dataset used is considered un-biased". Am I missing something here? |
Yes. Eg: When we sign all files in a huggingface repo. We want to treat all files "together".
We have not. Thanks for the idea. We can explore these for inspiration.
Correct
There would be no subject. In effect we don't even need to wrap the manifest into an intoto statement. We'd simply use DSSE with this manifest schema inside. The semantics would be different from the intoto model. The hack I was eluding to in the issue description was to either use an empty subject or a dummy hash value, none of which are particularly appealing :/
The manifest effectively contains the list of files that comprise the model, so the manifest is the representation of the model.
It depends how we define the semantics for this manifest schema. If the schema lists all files and the verification semantics say that the exact same files listed in the manifest MUST match for verification to succeed, then we don't have this problem. (Here I'm assuming we don't use intoto at all. If we do use intoto, we still have this problem I think). |
"subject": {"name": "inline+aipredicate://", "sha256sum": "whatever"} What would you hash in this case? If we hash the predicate, we have to canonicalize the JSON predicate, which we should avoid.
If we hash the directory to identify the model, why use a manifest that lists the files? We'd end up hashing twice: for the manifest and for the model identification. We're trying to avoid serializing the directory because we want to support fast re-hashing, eg when a README file changes. Serializing the directory is what the current implementation of this repo does, and we've decided (with feedback from folks at huggingface) that it's not the right approach. |
So, I think the root of the semantic problem boils down to "Reject if matchedSubjects is empty" and what you need is more flexibility? To me it seems like one thing that might be tripping us up is that policy is getting mixed up with how we make statements about things? Is it correctly to say that, if we ignore the opinionated validation model, it is possible to express all the files in a model in an in-toto statement and to express the properties of that model in the predicate? Could this be resolved by relaxing the validation model, and giving verifiers (and policy owners) more freedom to say "all files must be verified and present" (among other things)? |
I think this should use Statement and not invent some new format.
This is false. I think you are misreading the specification.
It is unspecified how to verify a collection of artifacts because that will be dependent on each use case. In this particular case, the verifier should simply check all files. It would be crazy to allow a collection of files {A, B, C} if any match. I can't imagine a scenario where that would be acceptable. |
If the use case works for Statement, for sure. Can you please address the various concerns we highlighted in the previous comments?
Here's what the specs say:
|
Can you enumerate those concerns? The only one I can find is the misconception that we are discussing now.
Right, that is what you do for each artifact. Again, I'll quote:
Perhaps you are confused about "subject": [
{"digest": {"sha256": "abcd1234"}, "name": "A"},
{"digest": {"sha512": "98765432"}, "name": "B"}
] or you could have a subject that contains file copies with identical hashes: "subject": [
{"digest": {"sha256": "abcd1234"}, "name": "A"},
{"digest": {"sha256": "abcd1234"}, "name": "copy-of-A"}
] We should update the spec to avoid this confusion, but to be clear, nothing in the spec says how to verify a collection of artifacts. It's unspecified. Let's use a concrete example: TensorFlow SavedModel. Suppose the attestation were this: "subject": [
{"digest": {"sha256": "aaaa"}, "name": "assets/foo"},
{"digest": {"sha256": "bbbb"}, "name": "variables/variables.data-00000-of-00002"},
{"digest": {"sha256": "cccc"}, "name": "variables/variables.data-00001-of-00002"},
{"digest": {"sha256": "dddd"}, "name": "variables/variables.index"},
{"digest": {"sha256": "eeee"}, "name": "saved_model.pb"}
] If you want to use the logic that all of the file must match, with no renames, additions, or deletions, then you'd do something like this: def verify_all(artifactsToVerify: Struct[name, digest], attestation, *args):
# Verify that all of the files on disk have a corresponding entry in the subject.
for a in artifactsToVerify:
matchedSubjects, ... = verify_single(a.digest, *args)
if a.name not in [s.name for s in matchedSubjects]:
error
# Verify that there are no missing files.
for s in attestation.statement.subject:
if s.name not in [a.name for a in artifactsToVerify]:
error Now if you want to be less strict, you'd need to figure out what policy you want to apply. Do you allow renames? Added files? Removed files? That's all OK - it's up to the verifier logic. For example:
Note that |
Thanks @MarkLodato . That is the crux of the problem.
That would be nice, yes. What you're saying is that it's up to the predicateType to dictate the semantic verification? I was looking for a claim like this but could not find it. Another problem is https://github.com/in-toto/attestation/blob/main/spec/v1/statement.md In our case, each file is not a software artifact. The model artifact is the set of files. A (published) package is made up of multiple artifacts (like in a release attestation); but the ML artifact is made up of a collection of files. |
That's just a terminology nitpick. Our definition of "artifact" is "an immutable blob of data". You could represent an ML model as either a collection of artifacts (one per file) or as a single artifact (one hash over all the files). Either way works.
What specifically do you want to put in there? Could you give a concrete example? |
If that works it's fine. I would encourage updating the terminology, because I did not see this defined. All the use cases I've seen in intoto involve the subject being an independent artifact. Other intoto maintainers on the thread, can you confirm that the predicateType dictates how to verify the subject list?
You can't know the unknown, and we have to plan for it (that's technical debt). But to give an example, we don't really need the intoto Statement to make things work. We don't need an additional intoto wrap / level of indirection. Instead, we can use: DSSE: {
"payload": "...",
"payloadType": "model/signing",
"signatures": [...]
} And manifest: {
// Some global fields
"version": x,
"some-other-property":
// The files in this model.
"files": [
{
"path":
// Any field we want, no tied to the ResourceDescriptor
"whatever_we_want":
}
]
} That's the simple alternative to using an intoto statement for our specific use case. That's the solution other ecosystems use (Java, npm). We don't have to separate the files from the rest of the manifest / predicate. What are the advantages of using intoto in this use case? Again, we're not again using intoto (we created this issue in the first place mentioning it!). We're only trying to gather pros and cons for each approach to help in the final decision. One argument in favor of intoto for this use case is that it may simplify upgrading tooling to support SLSA provenance (since provenance uses intoto format). |
I'm not sure it's even up to the predicateType? We have existing use cases where it's the user (or the tooling they're using) that would know best what to do. E.g. it might be a matter of policy. |
So that you don't have to roll your own format and tooling. It's the advantage of any standard. Since there are no extra fields that you anticipate needing, and there is a way to add them should you ever need them, I strongly recommend using the in-toto Statement here. It does everything you need. |
The problem is that there is no tooling that supports what we're trying to achieve, ie we're rolling our own. The verification semantics are also are own. I can argue that the JSON format we use is an implementation detail. Existing tooling is a source of risk since it won't understand our intoto predicate and will probably screw up verification because they interpret the subject field as |
Seconded. The in-toto attestation spec does not dictate that the subject artifact(s) MUST be executable code. It's meant to be quite generic and apply to any immutable blob of data (e.g., files, binaries, packages, even other attestations).
We'd definitely like to edit the spec to address the concerns here. So we're clear on what to change: Are you looking for more explicit language on validation based on subject artifacts, more clarity around the |
I'm not sure it is. Take the example of SLSA provenance. If users were to set subject entry per model file, how would an end-user know they need to verify all of them? The description of the verification only says that statement’s subject matches the digest of the artifact in question. There seems to be an implicit assumption that an artifact is "self-contained". Do you envisage creating different predicates for SLSA provenance?
I suppose giving concrete examples for different use cases, in particular the one that requires all subjects to match. Is it dictated by the predicate type (my question) or the policy (the verification mentions a "policy engine" and @TomHennen also mentioned that)? Could you give an example for SLSA as an example? Maybe also provide some real code would help too. How about taking this discussion to a tracking issue on intoto repo? Note: I've not looked at whether existing tooling like cosign outputs |
To give a concrete example. Say we have recipients RSet and ROneOf. RSet needs to verify a set of files. ROneOf only has a use case with single-file artifacts. An attacker gives ROneOf a file with a provenance intended for RSet. ROneOf accepts it, because it's unaware that verification requires set verification. I think that's an attack we want to protect against, correct?
To add context. The manifest we created in this issue can be signed and added to a Sigstore bundle using |
This is the middle layer of the API design work (sigstore#172). We add a manifest abstract class to represent various manifests (sigstore#111 sigstore#112) and also ways to serialize a model directory into manifests and ways to verify the manifests. For now, this only does what was formerly known as `serialize_v0`. The v1 and the manifest versions will come soon. Note: This has a lot of inspiration from sigstore#112, but makes the API work with all the usecases we need to consider right now. Signed-off-by: Mihai Maruseac <[email protected]>
This is the middle layer of the API design work (sigstore#172). We add a manifest abstract class to represent various manifests (sigstore#111 sigstore#112) and also ways to serialize a model directory into manifests and ways to verify the manifests. For now, this only does what was formerly known as `serialize_v0`. The v1 and the manifest versions will come soon. Note: This has a lot of inspiration from sigstore#112, but makes the API work with all the usecases we need to consider right now. Signed-off-by: Mihai Maruseac <[email protected]>
This is the middle layer of the API design work (sigstore#172). We add a manifest abstract class to represent various manifests (sigstore#111 sigstore#112) and also ways to serialize a model directory into manifests and ways to verify the manifests. For now, this only does what was formerly known as `serialize_v0`. The v1 and the manifest versions will come soon. Note: This has a lot of inspiration from sigstore#112, but makes the API work with all the usecases we need to consider right now. Signed-off-by: Mihai Maruseac <[email protected]>
) * Migrate `serialize_v0` to new API. This is the middle layer of the API design work (#172). We add a manifest abstract class to represent various manifests (#111 #112) and also ways to serialize a model directory into manifests and ways to verify the manifests. For now, this only does what was formerly known as `serialize_v0`. The v1 and the manifest versions will come soon. Note: This has a lot of inspiration from #112, but makes the API work with all the usecases we need to consider right now. Signed-off-by: Mihai Maruseac <[email protected]> * Clarify some comments Signed-off-by: Mihai Maruseac <[email protected]> * Encode name with base64 Signed-off-by: Mihai Maruseac <[email protected]> * Add another test case Signed-off-by: Mihai Maruseac <[email protected]> * Empty commit to retrigger DCO check. See dcoapp/app#211 (comment) Signed-off-by: Mihai Maruseac <[email protected]> --------- Signed-off-by: Mihai Maruseac <[email protected]>
Have we made a final decision on the format of this manifest file? It appears we're settling on a DSSE envelope consisting of intoto statements. However, I've also seen reference to leveraging the |
We agreed on using sigstore bundle as the "wire format". The "manifest" refers to the DSSE payload, which is inside the sigstore bundle. |
THIS IS DRAFT, WIP. Will split into separate PRs once it works. But posting publicly to show what the plans are (sigstore#224, sigstore#248, sigstore#240, sigstore#111). Signed-off-by: Mihai Maruseac <[email protected]>
THIS IS DRAFT, WIP. Will split into separate PRs once it works. But posting publicly to show what the plans are (sigstore#224, sigstore#248, sigstore#240, sigstore#111). Signed-off-by: Mihai Maruseac <[email protected]>
THIS IS DRAFT, WIP. Will split into separate PRs once it works. But posting publicly to show what the plans are (sigstore#224, sigstore#248, sigstore#240, sigstore#111). Signed-off-by: Mihai Maruseac <[email protected]>
THIS IS DRAFT, WIP. Will split into separate PRs once it works. But posting publicly to show what the plans are (sigstore#224, sigstore#248, sigstore#240, sigstore#111). Signed-off-by: Mihai Maruseac <[email protected]>
Our current code signs / serializes folders using a custom hash built using sha256. It works well but has 3 disadvantages:
A workaround is to create and sign a dedicated "manifest" file that lists all files within a directory with their corresponding hash (similar to SHA256SM output), rather than the output of the folder serialization. With our current code, however, this creates 2 files: the manifest and a signature (on the manifest), which is bad UX-wise, especially for single-file models.
What we want is a single file. I think a solution to this problem is to use DSEE envelope (supported by Sigstore and widely used by tools like cosign). DSSE lets us define a payload which is stored in the envelope, along the signature. This payload is the content of our manifest. We can use intoto / json as a format since it's widely adopted.
sigstore-python added support for signing intoto statements in their main branch (yet to be released), so I think we can use that. All we need is define our own predicate and its format, something like:
The text was updated successfully, but these errors were encountered: