Skip to content

Conversation

@alexlarsson
Copy link
Contributor

This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see #144), but it it imho a generally nice change.

The primary differences are:

  • The header is not compressed
  • All referenced fs-verity objects are stored in the header, including external chunks, mapped splitstreams and (a new feature) references that are not used in chunks.
  • The mapping table is separate from the reference table (and generally smaller), and indexes into it.
  • There is a magic value to detect the file format.
  • There is a magic content type to detect the type wrapped in the stream.
  • We store a tag for what ObjectID format is used
  • The total size of the stream is stored in the header.

The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects.

This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects.

@alexlarsson alexlarsson force-pushed the splitstream-new-format branch from 3803dc3 to 9aedd96 Compare September 29, 2025 09:26

use crate::{sha256_from_descriptor, sha256_from_digest, tar::split_async, ContentAndVerity};

pub const TAR_LAYER_CONTENT_TYPE: u64 = 0x2a037edfcae1ffea;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment for where these came from? I'm guessing random? If so just add a comment // Random unique ID ?

That said I wonder if it wouldn't be nicer to store (variable length) strings for this in the format? Maybe it could go all the way to literally suggested to be the mediaType from OCI (if applicable)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. These are random.

But, I'd rather avoid having variable length things in the header. That makes parsing it much more tricky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it a real uuid tho

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine keeping as u64 too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments here

pub fn has_named_stream(&self, name: &str) -> bool {
let stream_path = format!("streams/refs/{}", name);

readlinkat(&self.repository, &stream_path, []).is_ok()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "swallowing" errors like this, I'd say call stat instead and require it's S_IFLNK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I redid this with stat()

#[derive(Clone, Debug, FromBytes, Immutable, IntoBytes, KnownLayout)]
#[repr(C)]
pub struct MappingEntry {
pub body: Sha256Digest,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're changing the format here...I think it'd be nice to make this one extensible.

However...bigger picture there's another consideration: There's obviously a metric ton of binary serialization formats out there. A custom one isn't wrong necessarily but...how about say CBOR ? It has some usage and a proper RFC etc.

I guess a dividing line is "do we care about mmap()"? Probably not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.

That said, I agree that we should make it at least a bit extensible. This MR adds a magic header, but also adding a version field and a few bytes of unused/unparsed space does seem quite useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed mmap, and the end result was, no, we don't want it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitstreams are essentially thin wrappers of existing binary formats (tars, ostree objects, etc), adding just references to other composefs repo objects. I'm not sure its overly helpful to use a complicated binary format for the wrapping, especially one which is completely different from the inner format.

Yeah, though the nice thing about Rust here is that for stuff like this there's a lot of well-done crates.

It also makes it a lot more obvious and easy to parse from other languages too if we can say "it's just CBOR" (or whatever).

Anyways: I'm basically fine with this as is too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although what about the algorithm agility? There's been some thoughts that for post-quantum crypto we may need to get away from sha256 in theory as far as I understand things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right thing is to add a header size field, and skip parts we don't understand. Then we can easily extend this later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new extension_size field which we skip on read.

@alexlarsson alexlarsson force-pushed the splitstream-new-format branch 2 times, most recently from 057121b to bed66dc Compare October 6, 2025 15:19
Copy link
Collaborator

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk about this. I might have a bit of bandwidth to work on this if you like.

struct SplitstreamHeader {
magic: [u8; 7], // Contains SPLITSTREAM_MAGIC
algorithm: u8, // The fs-verity algorithm used, 1 == sha256, 2 == sha512
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also do fs-verity block size here. That's usually expressed as a bit-shift count, so 12 or 16...

We could also write it like "fsverity-sha256-12" or so as a string... some relevant discussion in #181.

n_refs: u64,
n_mappings: u64,
refs: [ObjectID; n_refs] // sorted
mappings: [MappingEntry; n_mappings] // sorted by body
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so sketch that we hardcode sha256 here... I think that's probably OK, but maybe we'd add an extension mechanism so we could add new types of mappings tables...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this from the start:

  • magic
  • n_sections
  • n_sections * (
    • section_start
    • section_size_in_bytes
  • )

We could name the sections but I think it's quite OK to just know what the numbers mean and require that they're all present, in order. An empty section would be denoted by a zero size.

We could also get into compat vs. uncompat extensions... not sure how far it's worth going here...

@cgwalters
Copy link
Collaborator

We should get this in probably very soon and then I think declare the format stable?

@allisonkarlitskaya
Copy link
Collaborator

We should get this in probably very soon and then I think declare the format stable?

I have a branch..... lol

@allisonkarlitskaya
Copy link
Collaborator

One of the things that I'm tormenting myself on a bit right now is the sha256 mapping. I'm considering changing it to a general-purpose "named object reference" mapping: we could then have a hashmap mapping names like "sha256:12345" to the object ID in question, and adding sha512 would be seamless. I think I'd chose to encode that as a nul-separated series of strings of the form 0:name0, 1:name1, etc. and compress the whole thing.

The alternative is to stay with what we have now more or less, but it's much less flexible and is gonna be cruft one day, I'm sure. That being said: we have an extensibility mechanism now, at least...

The other thing that really needs fixing in @alexlarsson's work vs. the current version of the branch is that we should really take advantage of the fact that we have the references array out front now and use indexes into it from the splitstream content instead of repeating the whole object ID. It would have the additional advantage of ensuring that it became physically impossible to refer to an object that wasn't listed in the "depends" header (which we'll use during GC) and yet another advantage in that we could use the 64bit "starting word" for each internal/external section in the stream for both cases:

  • positive: this many internal bytes
  • negative or zero: an index into the references array

(or with the high bit as a flag or whatever).

It's just "work" to get this over the line....

After that, I think we need to figure out a way to kill off the content-sha256-based naming of splitstreams and perhaps even consider getting rid of the streams/ directory entirely... each backend would have its own way of 'caching' interesting splitstream objects for itself. I'm not entirely sure how I'd do that for the OCI backend. In a related conversation with @Johan-Liebert1 we discussed having the layers (and possibly the config) ((and possibly possibly the manifest some day)) referenced from the erofs image itself as a way to optionally prevent those things from being GC'd. We'd probably want some sort of a better "lookup table" still, though... but the key difference is that this table would be unique to OCI, not some global "streams" directory that we try to pretend is sharable by everyone on equal footing...


use hex::FromHexError;
use sha2::{digest::FixedOutputReset, digest::Output, Digest, Sha256, Sha512};
use std::cmp::Ord;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this change to a separate commit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we can drop this. We're not sorting the objects arrays anymore.

@alexlarsson
Copy link
Contributor Author

I took a look at this, and the file format looks ok to me. I don't think the content type is needed anymore with the new structured names though.

let config = if verity.is_none() {
// No verity means we need to verify the content hash
let mut data = vec![];
stream.read_to_end(&mut data)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use std::io::copy for the hashing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea, although it means we can't use our existing hash() function...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do this but I don't think it's a good idea: we need the data for two things, the hashing and the creation of the ImageConfiguration. Reading it once into a buffer lets us do both.

@allisonkarlitskaya allisonkarlitskaya force-pushed the splitstream-new-format branch 4 times, most recently from c4b6ae7 to 65ade39 Compare November 18, 2025 10:55
- name: Build sealed bootc image
working-directory: bootc
run: just build-sealed
run: just build
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's actually just variant=composefs-sealeduki-sdboot build now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran into this too and then remembered it from here. I threw up bootc-dev/bootc#1787 to propose bringing back the old target as a wrapper.


```rust
struct FileRange {
start: U64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this U uppercase?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's from zerocopy: it's a little endian u64.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking for a clarifying note in the docs? That's a good suggestion. I actually thought I had added one...

@cgwalters
Copy link
Collaborator

The thing I'll raise here again is I'm still uncertain about the "centrality" of splitstream for composefs-rs; this previously was discussed in #14 (comment)

To continue and flesh that out, I still feel it'd be a lot easier to understand if what we always parsed was EROFS images, and splitstream was just a lookaside that could help map from the EROFS back bit-for-bit only for heavyweight serializations like tar - and would only be used when we actually needed to return back a tar. Storing small data objects like manifests and config JSON would just be saved bit for bit as is.

@cgwalters
Copy link
Collaborator

Basically everyone dealing with composefs+OCI needs to deal with two binary serialization formats: tar and erofs. tar-split is funnily enough not binary but perhaps arguably should be...man I hadn't looked at this in a while but the choice of JSON with base64 encoded payloads is a bit wasteful...but eh, I guess compression still does well with it anyways. Anyways splitstream is a third binary format competing for mental bandwidth with both tar and erofs.

@allisonkarlitskaya
Copy link
Collaborator

Splitstream at its core, is really just tar-split, fs-verity content-addressed, in binary for performance reasons, with enough information for GC. A small addition is named references for things like OCI configs....

tar-split exists in the container space for a good reason.... and we've discussed this before. The canonical representation of the container depends on the ability to reconstruct the tar stream: and this isn't just for uploading: you need to be able to reconstruct that stream to validate it for local use as well...

You've mentioned a few times that you think canonical tar might be a better way here (so that we can rebuild the tar without storing the original tar stream in some fashion) but I'm not sure how we square that with the world as it is...

So, splitstream...

allisonkarlitskaya and others added 9 commits November 20, 2025 13:50
We need some new features in systemd-repart and mkfs.ext4.  We were
pulling those from feature branches and commit IDs in the past, but
these features are now available in stable releases.  Build those
release versions instead.

Signed-off-by: Allison Karlitskaya <[email protected]>
This patch is machine-generated.

Signed-off-by: Allison Karlitskaya <[email protected]>
This is only used from tests and not exported, so conditionalize it.

Signed-off-by: Allison Karlitskaya <[email protected]>
This wasn't yet stabilized when the code was first written but newer
patches have already added the use of this function in other parts of
the same file, so this ought to be safe by now.

Signed-off-by: Allison Karlitskaya <[email protected]>
This comment is an overview, so move it to a higher level.  Change it a
bit to make it more accurate.

Also: move the declaration of the buffer outside of the loop body to
avoid having to re-zero it each time.

Signed-off-by: Allison Karlitskaya <[email protected]>
We're going to start referring to OCI images by their names starting
with `sha256:` (as podman names them in the `--iidfile`) soon.  Skopeo
doesn't like that, so add a workaround.

This will soon let us get rid of some of the hacking about we do in our
`examples/` build scripts.

Signed-off-by: Allison Karlitskaya <[email protected]>
Let's just have users write the padding as a separate inline section
after they write the external data.  This makes things a lot easier and
reduces thrashing of the internal buffer.

Signed-off-by: Allison Karlitskaya <[email protected]>
This is a substantial change to the splitstream file format to add more
features (required for ostree support) and to add forwards- and
backwards- compatibility mechanisms for future changes.  This change
aims to finalize the file format so we can start shipping this to the
systems of real users without future "breaks everything" changes.

This change itself breaks everything: you'll need to delete your
repository and start over.  Hopefully this is the last time.

The file format is substantially more fleshed-out at this point.  Here's
an overview of the changes:

  - there is a header with a magic value, a version. flags field, and the
    fs-verity algorithm number and block size in use

  - everything else in the file can be freely located which will help if
    we ever want to create a version of the writer that streams data to
    disk as it goes: in that case we may want to store the stream before
    the associated metadata

  - there is an expandable "info" section which contains most other
    information about the stream and is intended to be used as the primary
    mechanism for making compatible changes to the file format in the
    future

  - the info section stores the total decompressed/reassembled stream
    size and a unique identifier value for the file type stored in the
    stream

  - the referenced external objects and splitstreams are now stored in a
    flat array of binary fs-verity hash values to improve the performance
    of garbage collection operations in large repositories (informed by
    Alex's battlescars from dealing with GC on Flathub)

  - it is possible to add arbitrary external object and stream references

  - the "sha256 mapping" has been replaced with a more flexible "named
    stream refs" mechanism that allows assigning arbitrary names to
    associated streams.  This will be useful if we ever want to support
    formats that are based on anything other than SHA-256 (including
    future OCI versions which may start using SHA-512 or something else).

  - whereas the previous implementation concerned itself with ensuring
    the correct SHA-256 content hash of the stream and creating a link to
    the stream with that hash value from the `streams/` directory, the new
    implementation requires that the user perform whatever hashing they
    consider appropriate and name their streams with a "content
    identifier".

    This change, taken together with the above change, removes all SHA-256
    specific logic from the implementation.

    The main reason for this change is that a SHA-256 content hash over a
    file isn't a sufficiently unique identifier to locate the relevant
    splitstream for that file.  Each different file type is split into a
    splitstream in a different way.  It just so happens that OCI JSON
    documents, `.tar` files, and GVariant OSTree commit objects have no
    possible overlaps (which means that SHA-256 content hashes have
    uniquely identified the files up to this point), but this is mostly a
    coincidence.  Each file type is now responsible to name its streams
    with a sufficiently unique "content identifier" based on the component
    name, the file name, and a content hash, for example:

      - `oci-commit-sha256:...`
      - `oci-layer-sha256:...`
      - `ostree-commit-...`
      - &c.

    Having the repository itself no longer care about the content hashes
    means that the OCI code can now trust the SHA-256 verification
    performed by skopeo, and we don't need to recompute it, which is a
    nice win.

Update the file format documentation.

Update the repository code and the users of splitstream (ie: OCI) to
adjust to the post-sha256-hardcoded future.

Adjust the way we deal with verification of OCI objects when we lack
fs-verity digests: instead of having an "open" operation which verifies
everything and a "shallow open" which doesn't, just have the open
operation verify only the config and move the verification of the layers
to when we access them.

Co-authored-by: Alexander Larsson <[email protected]>
Signed-off-by: Alexander Larsson <[email protected]>
Signed-off-by: Allison Karlitskaya <[email protected]>
The `fdatasync()` per written object thing is an unmitigated performance
disaster and we need to get rid of it.  It forces the filesystem to
create thousands upon thousands of tiny commits.

I tried another approach where we would reopen and `fsync()` each object
file referred to from a splitstream *after* all of the files were
written, but before we wrote the splitstream data, but it was also quite
bad.

syncfs() is a really really dumb hammer, and it could get us into
trouble if other users on the system have massive outstanding amounts of
IO.  On the other hand, it works: it's almost instantaneous, and is a
massive performance improvement over what we were doing before.  Let's
just go with that for now.

Maybe some day filesystems will have a mechanism for this which isn't
horrible.

Signed-off-by: Allison Karlitskaya <[email protected]>
@allisonkarlitskaya allisonkarlitskaya force-pushed the splitstream-new-format branch 2 times, most recently from 2a8007f to fd91a26 Compare November 20, 2025 13:02
@cgwalters
Copy link
Collaborator

you need to be able to reconstruct that stream to validate it for local use as well...

Sorry, remind me why one would need to do that outside of a push operation?

You've mentioned a few times that you think canonical tar might be a better way here (so that we can rebuild the tar without storing the original tar stream in some fashion) but I'm not sure how we square that with the world as it is...

Canonical tar would be great I think, but we definitely need to be able to store existing containers too. So let's set that aside.

Again my point is that our EROFS is already a well-known binary format that we have userspace (and obviously kernel side) parsers for that can contain named references to objects via fsverity digests. EROFS doesn't help with the "reconstruct exact original bits" but we can design a format that operates on the EROFS representation (in some ways this may be harder).

Or a different approach here is to keep splitstreams mostly as is, but just don't make them the Source Of Truth for garbage collection.

@allisonkarlitskaya
Copy link
Collaborator

Sorry, remind me why one would need to do that outside of a push operation?

Because the layers are referred to by their diffid from the config, which is the sha256 over their content, including the tar headers. When we construct the image we want to ensure that the file we're constructing from hasn't been tampered, which means we need to measure it...

Or a different approach here is to keep splitstreams mostly as is, but just don't make them the Source Of Truth for garbage collection.

So splitstreams contain references to the objects, which means that we need some way to keep those objects around if the splitstream exists. But it's not the only story here: if you create an erofs image (eg. prepare-boot), you could then throw the splitstreams away and still do GC based on the references found in the image.

GC on the erofs is unfortunately substantially more expensive since we need to walk the tree to find and parse all the xattrs....

@allisonkarlitskaya
Copy link
Collaborator

Note an interesting discussion in containers/skopeo#2750 right now about if referring to containers with sha256: in front is correct at all.... I had assumed that this is some semi-standard thing, but it might not be as standard as I thought...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants