Preparatory splitstream format changes for ostree support #185

alexlarsson · 2025-09-29T09:04:18Z

This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see #144), but it it imho a generally nice change.

The primary differences are:

The header is not compressed
All referenced fs-verity objects are stored in the header, including external chunks, mapped splitstreams and (a new feature) references that are not used in chunks.
The mapping table is separate from the reference table (and generally smaller), and indexes into it.
There is a magic value to detect the file format.
There is a magic content type to detect the type wrapped in the stream.
We store a tag for what ObjectID format is used
The total size of the stream is stored in the header.

The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects.

This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects.

cgwalters · 2025-09-29T12:29:31Z

crates/composefs-oci/src/skopeo.rs


 use crate::{sha256_from_descriptor, sha256_from_digest, tar::split_async, ContentAndVerity};

+pub const TAR_LAYER_CONTENT_TYPE: u64 = 0x2a037edfcae1ffea;


Can you add a comment for where these came from? I'm guessing random? If so just add a comment // Random unique ID ?

That said I wonder if it wouldn't be nicer to store (variable length) strings for this in the format? Maybe it could go all the way to literally suggested to be the mediaType from OCI (if applicable)

Yes. These are random.

But, I'd rather avoid having variable length things in the header. That makes parsing it much more tricky.

We could make it a real uuid tho

I'm fine keeping as u64 too.

I added some comments here

cgwalters · 2025-09-29T12:31:27Z

crates/composefs/src/repository.rs

+    pub fn has_named_stream(&self, name: &str) -> bool {
+        let stream_path = format!("streams/refs/{}", name);
+
+        readlinkat(&self.repository, &stream_path, []).is_ok()


I don't like "swallowing" errors like this, I'd say call stat instead and require it's S_IFLNK

I redid this with stat()

cgwalters · 2025-09-29T12:53:54Z

crates/composefs/src/splitstream.rs

+#[derive(Clone, Debug, FromBytes, Immutable, IntoBytes, KnownLayout)]
+#[repr(C)]
+pub struct MappingEntry {
+    pub body: Sha256Digest,


If we're changing the format here...I think it'd be nice to make this one extensible.

However...bigger picture there's another consideration: There's obviously a metric ton of binary serialization formats out there. A custom one isn't wrong necessarily but...how about say CBOR ? It has some usage and a proper RFC etc.

I guess a dividing line is "do we care about mmap()"? Probably not?

splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.

That said, I agree that we should make it at least a bit extensible. This MR adds a magic header, but also adding a version field and a few bytes of unused/unparsed space does seem quite useful.

We discussed mmap, and the end result was, no, we don't want it.

splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.

Yeah, though the nice thing about Rust here is that for stuff like this there's a lot of well-done crates.

It also makes it a lot more obvious and easy to parse from other languages too if we can say "it's just CBOR" (or whatever).

Anyways: I'm basically fine with this as is too.

Although what about the algorithm agility? There's been some thoughts that for post-quantum crypto we may need to get away from sha256 in theory as far as I understand things.

I think the right thing is to add a header size field, and skip parts we don't understand. Then we can easily extend this later.

I added a new extension_size field which we skip on read.

allisonkarlitskaya

Let's talk about this. I might have a bit of bandwidth to work on this if you like.

allisonkarlitskaya · 2025-10-07T06:30:26Z

doc/splitstream.md

+
+struct SplitstreamHeader {
+    magic: [u8; 7], // Contains SPLITSTREAM_MAGIC
+    algorithm: u8,  // The fs-verity algorithm used, 1 == sha256, 2 == sha512


We should also do fs-verity block size here. That's usually expressed as a bit-shift count, so 12 or 16...

We could also write it like "fsverity-sha256-12" or so as a string... some relevant discussion in #181.

allisonkarlitskaya · 2025-10-07T06:32:00Z

doc/splitstream.md

+    n_refs: u64,
+    n_mappings: u64,
+    refs: [ObjectID; n_refs]    // sorted
+    mappings: [MappingEntry; n_mappings] // sorted by body


It's so sketch that we hardcode sha256 here... I think that's probably OK, but maybe we'd add an extension mechanism so we could add new types of mappings tables...

Something like this from the start:

magic

n_sections

n_sections * (

section_start

section_size_in_bytes

)

We could name the sections but I think it's quite OK to just know what the numbers mean and require that they're all present, in order. An empty section would be denoted by a zero size.

We could also get into compat vs. uncompat extensions... not sure how far it's worth going here...

cgwalters · 2025-10-30T12:42:34Z

We should get this in probably very soon and then I think declare the format stable?

allisonkarlitskaya · 2025-10-30T12:59:48Z

We should get this in probably very soon and then I think declare the format stable?

I have a branch..... lol

allisonkarlitskaya · 2025-10-30T13:07:48Z

One of the things that I'm tormenting myself on a bit right now is the sha256 mapping. I'm considering changing it to a general-purpose "named object reference" mapping: we could then have a hashmap mapping names like "sha256:12345" to the object ID in question, and adding sha512 would be seamless. I think I'd chose to encode that as a nul-separated series of strings of the form 0:name0, 1:name1, etc. and compress the whole thing.

The alternative is to stay with what we have now more or less, but it's much less flexible and is gonna be cruft one day, I'm sure. That being said: we have an extensibility mechanism now, at least...

The other thing that really needs fixing in @alexlarsson's work vs. the current version of the branch is that we should really take advantage of the fact that we have the references array out front now and use indexes into it from the splitstream content instead of repeating the whole object ID. It would have the additional advantage of ensuring that it became physically impossible to refer to an object that wasn't listed in the "depends" header (which we'll use during GC) and yet another advantage in that we could use the 64bit "starting word" for each internal/external section in the stream for both cases:

positive: this many internal bytes
negative or zero: an index into the references array

(or with the high bit as a flag or whatever).

It's just "work" to get this over the line....

After that, I think we need to figure out a way to kill off the content-sha256-based naming of splitstreams and perhaps even consider getting rid of the streams/ directory entirely... each backend would have its own way of 'caching' interesting splitstream objects for itself. I'm not entirely sure how I'd do that for the OCI backend. In a related conversation with @Johan-Liebert1 we discussed having the layers (and possibly the config) ((and possibly possibly the manifest some day)) referenced from the erofs image itself as a way to optionally prevent those things from being GC'd. We'd probably want some sort of a better "lookup table" still, though... but the key difference is that this table would be unique to OCI, not some global "streams" directory that we try to pretend is sharable by everyone on equal footing...

alexlarsson · 2025-11-17T10:09:56Z

crates/composefs/src/fsverity/hashvalue.rs


 use hex::FromHexError;
 use sha2::{digest::FixedOutputReset, digest::Output, Digest, Sha256, Sha512};
+use std::cmp::Ord;


I would move this change to a separate commit

Actually, I think we can drop this. We're not sorting the objects arrays anymore.

alexlarsson · 2025-11-17T10:37:10Z

I took a look at this, and the file format looks ok to me. I don't think the content type is needed anymore with the new structured names though.

Johan-Liebert1 · 2025-11-17T11:15:57Z

crates/composefs-oci/src/lib.rs

+    let config = if verity.is_none() {
+        // No verity means we need to verify the content hash
+        let mut data = vec![];
+        stream.read_to_end(&mut data)?;


Maybe we could use std::io::copy for the hashing?

Not a bad idea, although it means we can't use our existing hash() function...

I tried to do this but I don't think it's a good idea: we need the data for two things, the hashing and the creation of the ImageConfiguration. Reading it once into a buffer lets us do both.

cgwalters · 2025-11-18T13:45:05Z

.github/workflows/bootc-revdep.yml

      - name: Build sealed bootc image
        working-directory: bootc
-        run: just build-sealed
+        run: just build


Sorry it's actually just variant=composefs-sealeduki-sdboot build now.

I just ran into this too and then remembered it from here. I threw up bootc-dev/bootc#1787 to propose bringing back the old target as a wrapper.

cgwalters · 2025-11-18T21:06:18Z

doc/splitstream.md


+```rust
+struct FileRange {
+    start: U64,


Why is this U uppercase?

That's from zerocopy: it's a little endian u64.

Are you asking for a clarifying note in the docs? That's a good suggestion. I actually thought I had added one...

cgwalters · 2025-11-19T22:43:53Z

The thing I'll raise here again is I'm still uncertain about the "centrality" of splitstream for composefs-rs; this previously was discussed in #14 (comment)

To continue and flesh that out, I still feel it'd be a lot easier to understand if what we always parsed was EROFS images, and splitstream was just a lookaside that could help map from the EROFS back bit-for-bit only for heavyweight serializations like tar - and would only be used when we actually needed to return back a tar. Storing small data objects like manifests and config JSON would just be saved bit for bit as is.

cgwalters · 2025-11-19T22:51:23Z

Basically everyone dealing with composefs+OCI needs to deal with two binary serialization formats: tar and erofs. tar-split is funnily enough not binary but perhaps arguably should be...man I hadn't looked at this in a while but the choice of JSON with base64 encoded payloads is a bit wasteful...but eh, I guess compression still does well with it anyways. Anyways splitstream is a third binary format competing for mental bandwidth with both tar and erofs.

allisonkarlitskaya · 2025-11-20T12:38:39Z

Splitstream at its core, is really just tar-split, fs-verity content-addressed, in binary for performance reasons, with enough information for GC. A small addition is named references for things like OCI configs....

tar-split exists in the container space for a good reason.... and we've discussed this before. The canonical representation of the container depends on the ability to reconstruct the tar stream: and this isn't just for uploading: you need to be able to reconstruct that stream to validate it for local use as well...

You've mentioned a few times that you think canonical tar might be a better way here (so that we can rebuild the tar without storing the original tar stream in some fashion) but I'm not sure how we square that with the world as it is...

So, splitstream...

We need some new features in systemd-repart and mkfs.ext4. We were pulling those from feature branches and commit IDs in the past, but these features are now available in stable releases. Build those release versions instead. Signed-off-by: Allison Karlitskaya <[email protected]>

This patch is machine-generated. Signed-off-by: Allison Karlitskaya <[email protected]>

This is only used from tests and not exported, so conditionalize it. Signed-off-by: Allison Karlitskaya <[email protected]>

This wasn't yet stabilized when the code was first written but newer patches have already added the use of this function in other parts of the same file, so this ought to be safe by now. Signed-off-by: Allison Karlitskaya <[email protected]>

This comment is an overview, so move it to a higher level. Change it a bit to make it more accurate. Also: move the declaration of the buffer outside of the loop body to avoid having to re-zero it each time. Signed-off-by: Allison Karlitskaya <[email protected]>

We're going to start referring to OCI images by their names starting with `sha256:` (as podman names them in the `--iidfile`) soon. Skopeo doesn't like that, so add a workaround. This will soon let us get rid of some of the hacking about we do in our `examples/` build scripts. Signed-off-by: Allison Karlitskaya <[email protected]>

Let's just have users write the padding as a separate inline section after they write the external data. This makes things a lot easier and reduces thrashing of the internal buffer. Signed-off-by: Allison Karlitskaya <[email protected]>

This is a substantial change to the splitstream file format to add more features (required for ostree support) and to add forwards- and backwards- compatibility mechanisms for future changes. This change aims to finalize the file format so we can start shipping this to the systems of real users without future "breaks everything" changes. This change itself breaks everything: you'll need to delete your repository and start over. Hopefully this is the last time. The file format is substantially more fleshed-out at this point. Here's an overview of the changes: - there is a header with a magic value, a version. flags field, and the fs-verity algorithm number and block size in use - everything else in the file can be freely located which will help if we ever want to create a version of the writer that streams data to disk as it goes: in that case we may want to store the stream before the associated metadata - there is an expandable "info" section which contains most other information about the stream and is intended to be used as the primary mechanism for making compatible changes to the file format in the future - the info section stores the total decompressed/reassembled stream size and a unique identifier value for the file type stored in the stream - the referenced external objects and splitstreams are now stored in a flat array of binary fs-verity hash values to improve the performance of garbage collection operations in large repositories (informed by Alex's battlescars from dealing with GC on Flathub) - it is possible to add arbitrary external object and stream references - the "sha256 mapping" has been replaced with a more flexible "named stream refs" mechanism that allows assigning arbitrary names to associated streams. This will be useful if we ever want to support formats that are based on anything other than SHA-256 (including future OCI versions which may start using SHA-512 or something else). - whereas the previous implementation concerned itself with ensuring the correct SHA-256 content hash of the stream and creating a link to the stream with that hash value from the `streams/` directory, the new implementation requires that the user perform whatever hashing they consider appropriate and name their streams with a "content identifier". This change, taken together with the above change, removes all SHA-256 specific logic from the implementation. The main reason for this change is that a SHA-256 content hash over a file isn't a sufficiently unique identifier to locate the relevant splitstream for that file. Each different file type is split into a splitstream in a different way. It just so happens that OCI JSON documents, `.tar` files, and GVariant OSTree commit objects have no possible overlaps (which means that SHA-256 content hashes have uniquely identified the files up to this point), but this is mostly a coincidence. Each file type is now responsible to name its streams with a sufficiently unique "content identifier" based on the component name, the file name, and a content hash, for example: - `oci-commit-sha256:...` - `oci-layer-sha256:...` - `ostree-commit-...` - &c. Having the repository itself no longer care about the content hashes means that the OCI code can now trust the SHA-256 verification performed by skopeo, and we don't need to recompute it, which is a nice win. Update the file format documentation. Update the repository code and the users of splitstream (ie: OCI) to adjust to the post-sha256-hardcoded future. Adjust the way we deal with verification of OCI objects when we lack fs-verity digests: instead of having an "open" operation which verifies everything and a "shallow open" which doesn't, just have the open operation verify only the config and move the verification of the layers to when we access them. Co-authored-by: Alexander Larsson <[email protected]> Signed-off-by: Alexander Larsson <[email protected]> Signed-off-by: Allison Karlitskaya <[email protected]>

The `fdatasync()` per written object thing is an unmitigated performance disaster and we need to get rid of it. It forces the filesystem to create thousands upon thousands of tiny commits. I tried another approach where we would reopen and `fsync()` each object file referred to from a splitstream *after* all of the files were written, but before we wrote the splitstream data, but it was also quite bad. syncfs() is a really really dumb hammer, and it could get us into trouble if other users on the system have massive outstanding amounts of IO. On the other hand, it works: it's almost instantaneous, and is a massive performance improvement over what we were doing before. Let's just go with that for now. Maybe some day filesystems will have a mechanism for this which isn't horrible. Signed-off-by: Allison Karlitskaya <[email protected]>

cgwalters · 2025-11-20T14:25:11Z

you need to be able to reconstruct that stream to validate it for local use as well...

Sorry, remind me why one would need to do that outside of a push operation?

You've mentioned a few times that you think canonical tar might be a better way here (so that we can rebuild the tar without storing the original tar stream in some fashion) but I'm not sure how we square that with the world as it is...

Canonical tar would be great I think, but we definitely need to be able to store existing containers too. So let's set that aside.

Again my point is that our EROFS is already a well-known binary format that we have userspace (and obviously kernel side) parsers for that can contain named references to objects via fsverity digests. EROFS doesn't help with the "reconstruct exact original bits" but we can design a format that operates on the EROFS representation (in some ways this may be harder).

Or a different approach here is to keep splitstreams mostly as is, but just don't make them the Source Of Truth for garbage collection.

allisonkarlitskaya · 2025-11-20T20:19:55Z

Sorry, remind me why one would need to do that outside of a push operation?

Because the layers are referred to by their diffid from the config, which is the sha256 over their content, including the tar headers. When we construct the image we want to ensure that the file we're constructing from hasn't been tampered, which means we need to measure it...

Or a different approach here is to keep splitstreams mostly as is, but just don't make them the Source Of Truth for garbage collection.

So splitstreams contain references to the objects, which means that we need some way to keep those objects around if the splitstream exists. But it's not the only story here: if you create an erofs image (eg. prepare-boot), you could then throw the splitstreams away and still do GC based on the references found in the image.

GC on the erofs is unfortunately substantially more expensive since we need to walk the tree to find and parse all the xattrs....

allisonkarlitskaya · 2025-11-20T20:28:04Z

Note an interesting discussion in containers/skopeo#2750 right now about if referring to containers with sha256: in front is correct at all.... I had assumed that this is some semi-standard thing, but it might not be as standard as I thought...

alexlarsson force-pushed the splitstream-new-format branch from 3803dc3 to 9aedd96 Compare September 29, 2025 09:26

cgwalters reviewed Sep 29, 2025

View reviewed changes

alexlarsson force-pushed the splitstream-new-format branch 2 times, most recently from 057121b to bed66dc Compare October 6, 2025 15:19

allisonkarlitskaya requested changes Oct 7, 2025

View reviewed changes

allisonkarlitskaya force-pushed the splitstream-new-format branch from bed66dc to bcafffa Compare November 17, 2025 09:41

alexlarsson commented Nov 17, 2025

View reviewed changes

Johan-Liebert1 reviewed Nov 17, 2025

View reviewed changes

allisonkarlitskaya force-pushed the splitstream-new-format branch 4 times, most recently from c4b6ae7 to 65ade39 Compare November 18, 2025 10:55

cgwalters reviewed Nov 18, 2025

View reviewed changes

cgwalters reviewed Nov 19, 2025

View reviewed changes

allisonkarlitskaya and others added 9 commits November 20, 2025 13:50

various: clippy fixes

db98522

This patch is machine-generated. Signed-off-by: Allison Karlitskaya <[email protected]>

test.rs: #[cfg(test)] a function

2559c7c

This is only used from tests and not exported, so conditionalize it. Signed-off-by: Allison Karlitskaya <[email protected]>

oci: use a newer stdlib Vec API

1d487fc

This wasn't yet stabilized when the code was first written but newer patches have already added the use of this function in other parts of the same file, so this ought to be safe by now. Signed-off-by: Allison Karlitskaya <[email protected]>

splitstream: remove special handling of padding

ccc457e

Let's just have users write the padding as a separate inline section after they write the external data. This makes things a lot easier and reduces thrashing of the internal buffer. Signed-off-by: Allison Karlitskaya <[email protected]>

allisonkarlitskaya force-pushed the splitstream-new-format branch 2 times, most recently from 2a8007f to fd91a26 Compare November 20, 2025 13:02


		use crate::{sha256_from_descriptor, sha256_from_digest, tar::split_async, ContentAndVerity};

		pub const TAR_LAYER_CONTENT_TYPE: u64 = 0x2a037edfcae1ffea;

Preparatory splitstream format changes for ostree support #185

Are you sure you want to change the base?

Preparatory splitstream format changes for ostree support #185

Uh oh!

Conversation

alexlarsson commented Sep 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonkarlitskaya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Oct 30, 2025

Uh oh!

allisonkarlitskaya commented Oct 30, 2025

Uh oh!

allisonkarlitskaya commented Oct 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexlarsson commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Nov 19, 2025

Uh oh!

cgwalters commented Nov 19, 2025

Uh oh!

allisonkarlitskaya commented Nov 20, 2025

Uh oh!