-
Notifications
You must be signed in to change notification settings - Fork 12
Preparatory splitstream format changes for ostree support #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3803dc3 to
9aedd96
Compare
crates/composefs-oci/src/skopeo.rs
Outdated
|
|
||
| use crate::{sha256_from_descriptor, sha256_from_digest, tar::split_async, ContentAndVerity}; | ||
|
|
||
| pub const TAR_LAYER_CONTENT_TYPE: u64 = 0x2a037edfcae1ffea; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment for where these came from? I'm guessing random? If so just add a comment // Random unique ID ?
That said I wonder if it wouldn't be nicer to store (variable length) strings for this in the format? Maybe it could go all the way to literally suggested to be the mediaType from OCI (if applicable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. These are random.
But, I'd rather avoid having variable length things in the header. That makes parsing it much more tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make it a real uuid tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine keeping as u64 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments here
crates/composefs/src/repository.rs
Outdated
| pub fn has_named_stream(&self, name: &str) -> bool { | ||
| let stream_path = format!("streams/refs/{}", name); | ||
|
|
||
| readlinkat(&self.repository, &stream_path, []).is_ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "swallowing" errors like this, I'd say call stat instead and require it's S_IFLNK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I redid this with stat()
crates/composefs/src/splitstream.rs
Outdated
| #[derive(Clone, Debug, FromBytes, Immutable, IntoBytes, KnownLayout)] | ||
| #[repr(C)] | ||
| pub struct MappingEntry { | ||
| pub body: Sha256Digest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're changing the format here...I think it'd be nice to make this one extensible.
However...bigger picture there's another consideration: There's obviously a metric ton of binary serialization formats out there. A custom one isn't wrong necessarily but...how about say CBOR ? It has some usage and a proper RFC etc.
I guess a dividing line is "do we care about mmap()"? Probably not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.
That said, I agree that we should make it at least a bit extensible. This MR adds a magic header, but also adding a version field and a few bytes of unused/unparsed space does seem quite useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed mmap, and the end result was, no, we don't want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.
Yeah, though the nice thing about Rust here is that for stuff like this there's a lot of well-done crates.
It also makes it a lot more obvious and easy to parse from other languages too if we can say "it's just CBOR" (or whatever).
Anyways: I'm basically fine with this as is too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although what about the algorithm agility? There's been some thoughts that for post-quantum crypto we may need to get away from sha256 in theory as far as I understand things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the right thing is to add a header size field, and skip parts we don't understand. Then we can easily extend this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new extension_size field which we skip on read.
057121b to
bed66dc
Compare
allisonkarlitskaya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about this. I might have a bit of bandwidth to work on this if you like.
doc/splitstream.md
Outdated
| struct SplitstreamHeader { | ||
| magic: [u8; 7], // Contains SPLITSTREAM_MAGIC | ||
| algorithm: u8, // The fs-verity algorithm used, 1 == sha256, 2 == sha512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also do fs-verity block size here. That's usually expressed as a bit-shift count, so 12 or 16...
We could also write it like "fsverity-sha256-12" or so as a string... some relevant discussion in #181.
doc/splitstream.md
Outdated
| n_refs: u64, | ||
| n_mappings: u64, | ||
| refs: [ObjectID; n_refs] // sorted | ||
| mappings: [MappingEntry; n_mappings] // sorted by body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so sketch that we hardcode sha256 here... I think that's probably OK, but maybe we'd add an extension mechanism so we could add new types of mappings tables...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this from the start:
- magic
- n_sections
- n_sections * (
- section_start
- section_size_in_bytes
- )
We could name the sections but I think it's quite OK to just know what the numbers mean and require that they're all present, in order. An empty section would be denoted by a zero size.
We could also get into compat vs. uncompat extensions... not sure how far it's worth going here...
|
We should get this in probably very soon and then I think declare the format stable? |
I have a branch..... lol |
|
One of the things that I'm tormenting myself on a bit right now is the sha256 mapping. I'm considering changing it to a general-purpose "named object reference" mapping: we could then have a hashmap mapping names like "sha256:12345" to the object ID in question, and adding sha512 would be seamless. I think I'd chose to encode that as a nul-separated series of strings of the form The alternative is to stay with what we have now more or less, but it's much less flexible and is gonna be cruft one day, I'm sure. That being said: we have an extensibility mechanism now, at least... The other thing that really needs fixing in @alexlarsson's work vs. the current version of the branch is that we should really take advantage of the fact that we have the references array out front now and use indexes into it from the splitstream content instead of repeating the whole object ID. It would have the additional advantage of ensuring that it became physically impossible to refer to an object that wasn't listed in the "depends" header (which we'll use during GC) and yet another advantage in that we could use the 64bit "starting word" for each internal/external section in the stream for both cases:
(or with the high bit as a flag or whatever). It's just "work" to get this over the line.... After that, I think we need to figure out a way to kill off the content-sha256-based naming of splitstreams and perhaps even consider getting rid of the streams/ directory entirely... each backend would have its own way of 'caching' interesting splitstream objects for itself. I'm not entirely sure how I'd do that for the OCI backend. In a related conversation with @Johan-Liebert1 we discussed having the layers (and possibly the config) ((and possibly possibly the manifest some day)) referenced from the erofs image itself as a way to optionally prevent those things from being GC'd. We'd probably want some sort of a better "lookup table" still, though... but the key difference is that this table would be unique to OCI, not some global "streams" directory that we try to pretend is sharable by everyone on equal footing... |
bed66dc to
bcafffa
Compare
|
|
||
| use hex::FromHexError; | ||
| use sha2::{digest::FixedOutputReset, digest::Output, Digest, Sha256, Sha512}; | ||
| use std::cmp::Ord; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this change to a separate commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we can drop this. We're not sorting the objects arrays anymore.
|
I took a look at this, and the file format looks ok to me. I don't think the content type is needed anymore with the new structured names though. |
| let config = if verity.is_none() { | ||
| // No verity means we need to verify the content hash | ||
| let mut data = vec![]; | ||
| stream.read_to_end(&mut data)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use std::io::copy for the hashing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad idea, although it means we can't use our existing hash() function...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do this but I don't think it's a good idea: we need the data for two things, the hashing and the creation of the ImageConfiguration. Reading it once into a buffer lets us do both.
c4b6ae7 to
65ade39
Compare
.github/workflows/bootc-revdep.yml
Outdated
| - name: Build sealed bootc image | ||
| working-directory: bootc | ||
| run: just build-sealed | ||
| run: just build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it's actually just variant=composefs-sealeduki-sdboot build now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran into this too and then remembered it from here. I threw up bootc-dev/bootc#1787 to propose bringing back the old target as a wrapper.
|
|
||
| ```rust | ||
| struct FileRange { | ||
| start: U64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this U uppercase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's from zerocopy: it's a little endian u64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking for a clarifying note in the docs? That's a good suggestion. I actually thought I had added one...
|
The thing I'll raise here again is I'm still uncertain about the "centrality" of splitstream for composefs-rs; this previously was discussed in #14 (comment) To continue and flesh that out, I still feel it'd be a lot easier to understand if what we always parsed was EROFS images, and splitstream was just a lookaside that could help map from the EROFS back bit-for-bit only for heavyweight serializations like tar - and would only be used when we actually needed to return back a tar. Storing small data objects like manifests and config JSON would just be saved bit for bit as is. |
|
Basically everyone dealing with composefs+OCI needs to deal with two binary serialization formats: tar and erofs. tar-split is funnily enough not binary but perhaps arguably should be...man I hadn't looked at this in a while but the choice of JSON with base64 encoded payloads is a bit wasteful...but eh, I guess compression still does well with it anyways. Anyways splitstream is a third binary format competing for mental bandwidth with both tar and erofs. |
|
Splitstream at its core, is really just tar-split, fs-verity content-addressed, in binary for performance reasons, with enough information for GC. A small addition is named references for things like OCI configs.... tar-split exists in the container space for a good reason.... and we've discussed this before. The canonical representation of the container depends on the ability to reconstruct the tar stream: and this isn't just for uploading: you need to be able to reconstruct that stream to validate it for local use as well... You've mentioned a few times that you think canonical tar might be a better way here (so that we can rebuild the tar without storing the original tar stream in some fashion) but I'm not sure how we square that with the world as it is... So, splitstream... |
We need some new features in systemd-repart and mkfs.ext4. We were pulling those from feature branches and commit IDs in the past, but these features are now available in stable releases. Build those release versions instead. Signed-off-by: Allison Karlitskaya <[email protected]>
This patch is machine-generated. Signed-off-by: Allison Karlitskaya <[email protected]>
This is only used from tests and not exported, so conditionalize it. Signed-off-by: Allison Karlitskaya <[email protected]>
This wasn't yet stabilized when the code was first written but newer patches have already added the use of this function in other parts of the same file, so this ought to be safe by now. Signed-off-by: Allison Karlitskaya <[email protected]>
This comment is an overview, so move it to a higher level. Change it a bit to make it more accurate. Also: move the declaration of the buffer outside of the loop body to avoid having to re-zero it each time. Signed-off-by: Allison Karlitskaya <[email protected]>
We're going to start referring to OCI images by their names starting with `sha256:` (as podman names them in the `--iidfile`) soon. Skopeo doesn't like that, so add a workaround. This will soon let us get rid of some of the hacking about we do in our `examples/` build scripts. Signed-off-by: Allison Karlitskaya <[email protected]>
Let's just have users write the padding as a separate inline section after they write the external data. This makes things a lot easier and reduces thrashing of the internal buffer. Signed-off-by: Allison Karlitskaya <[email protected]>
This is a substantial change to the splitstream file format to add more
features (required for ostree support) and to add forwards- and
backwards- compatibility mechanisms for future changes. This change
aims to finalize the file format so we can start shipping this to the
systems of real users without future "breaks everything" changes.
This change itself breaks everything: you'll need to delete your
repository and start over. Hopefully this is the last time.
The file format is substantially more fleshed-out at this point. Here's
an overview of the changes:
- there is a header with a magic value, a version. flags field, and the
fs-verity algorithm number and block size in use
- everything else in the file can be freely located which will help if
we ever want to create a version of the writer that streams data to
disk as it goes: in that case we may want to store the stream before
the associated metadata
- there is an expandable "info" section which contains most other
information about the stream and is intended to be used as the primary
mechanism for making compatible changes to the file format in the
future
- the info section stores the total decompressed/reassembled stream
size and a unique identifier value for the file type stored in the
stream
- the referenced external objects and splitstreams are now stored in a
flat array of binary fs-verity hash values to improve the performance
of garbage collection operations in large repositories (informed by
Alex's battlescars from dealing with GC on Flathub)
- it is possible to add arbitrary external object and stream references
- the "sha256 mapping" has been replaced with a more flexible "named
stream refs" mechanism that allows assigning arbitrary names to
associated streams. This will be useful if we ever want to support
formats that are based on anything other than SHA-256 (including
future OCI versions which may start using SHA-512 or something else).
- whereas the previous implementation concerned itself with ensuring
the correct SHA-256 content hash of the stream and creating a link to
the stream with that hash value from the `streams/` directory, the new
implementation requires that the user perform whatever hashing they
consider appropriate and name their streams with a "content
identifier".
This change, taken together with the above change, removes all SHA-256
specific logic from the implementation.
The main reason for this change is that a SHA-256 content hash over a
file isn't a sufficiently unique identifier to locate the relevant
splitstream for that file. Each different file type is split into a
splitstream in a different way. It just so happens that OCI JSON
documents, `.tar` files, and GVariant OSTree commit objects have no
possible overlaps (which means that SHA-256 content hashes have
uniquely identified the files up to this point), but this is mostly a
coincidence. Each file type is now responsible to name its streams
with a sufficiently unique "content identifier" based on the component
name, the file name, and a content hash, for example:
- `oci-commit-sha256:...`
- `oci-layer-sha256:...`
- `ostree-commit-...`
- &c.
Having the repository itself no longer care about the content hashes
means that the OCI code can now trust the SHA-256 verification
performed by skopeo, and we don't need to recompute it, which is a
nice win.
Update the file format documentation.
Update the repository code and the users of splitstream (ie: OCI) to
adjust to the post-sha256-hardcoded future.
Adjust the way we deal with verification of OCI objects when we lack
fs-verity digests: instead of having an "open" operation which verifies
everything and a "shallow open" which doesn't, just have the open
operation verify only the config and move the verification of the layers
to when we access them.
Co-authored-by: Alexander Larsson <[email protected]>
Signed-off-by: Alexander Larsson <[email protected]>
Signed-off-by: Allison Karlitskaya <[email protected]>
The `fdatasync()` per written object thing is an unmitigated performance disaster and we need to get rid of it. It forces the filesystem to create thousands upon thousands of tiny commits. I tried another approach where we would reopen and `fsync()` each object file referred to from a splitstream *after* all of the files were written, but before we wrote the splitstream data, but it was also quite bad. syncfs() is a really really dumb hammer, and it could get us into trouble if other users on the system have massive outstanding amounts of IO. On the other hand, it works: it's almost instantaneous, and is a massive performance improvement over what we were doing before. Let's just go with that for now. Maybe some day filesystems will have a mechanism for this which isn't horrible. Signed-off-by: Allison Karlitskaya <[email protected]>
2a8007f to
fd91a26
Compare
Sorry, remind me why one would need to do that outside of a push operation?
Canonical tar would be great I think, but we definitely need to be able to store existing containers too. So let's set that aside. Again my point is that our EROFS is already a well-known binary format that we have userspace (and obviously kernel side) parsers for that can contain named references to objects via fsverity digests. EROFS doesn't help with the "reconstruct exact original bits" but we can design a format that operates on the EROFS representation (in some ways this may be harder). Or a different approach here is to keep splitstreams mostly as is, but just don't make them the Source Of Truth for garbage collection. |
Because the layers are referred to by their diffid from the config, which is the sha256 over their content, including the tar headers. When we construct the image we want to ensure that the file we're constructing from hasn't been tampered, which means we need to measure it...
So splitstreams contain references to the objects, which means that we need some way to keep those objects around if the splitstream exists. But it's not the only story here: if you create an erofs image (eg. GC on the erofs is unfortunately substantially more expensive since we need to walk the tree to find and parse all the xattrs.... |
|
Note an interesting discussion in containers/skopeo#2750 right now about if referring to containers with |
This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see #144), but it it imho a generally nice change.
The primary differences are:
The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects.
This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects.