Skip to content

Use io_uring for creating files when unpacking snapshot#6671

Merged
kskalski merged 55 commits intoanza-xyz:masterfrom
kskalski:ks/dev/tar_unpack
Jul 25, 2025
Merged

Use io_uring for creating files when unpacking snapshot#6671
kskalski merged 55 commits intoanza-xyz:masterfrom
kskalski:ks/dev/tar_unpack

Conversation

@kskalski
Copy link
Copy Markdown

@kskalski kskalski commented Jun 20, 2025

Problem

Unpacking snapshot uses tar crate unpack for each entry, which calls sync IO and copy data into intermediate buffer before performing writes. This blocks and spends a lot of CPU time on syscalls.

Summary of Changes

  • memlock limit of around 800MiB is now hard requirement for starting validator (when it does snapshot unpacking) - added breaking change entry in the Changelog
  • Introduce IoUringFileCreator (plus compatibility trait for non-linux platforms) and use it for creating files while unpacking snapshot.
  • Remove ArchiveChunker and perform whole unpacking in single thread - all IO is done in background kernel threads (with io_uring) and unless we run out of disk write bandwidth this thread will spend time on decompression.
  • Change entry_processor into file_path_processor and only execute it for files (such that is_file() call can be avoided in the only non-trivial call site that is filtering for files)
  • Rafactor auxiliary code shared by io_uring sequential file reader and file creator

This change is more or less performance neutral - untar times highly depend on achieved disk read / write throughput (and data layout on attached disks).
Observed timings:

  • baseline snapshot untar: 150s - 153s
  • this PR: 138s - 157s

@kskalski kskalski changed the title tar unpack Use io_uring for creating files when unpacking snapshot Jun 23, 2025
@kskalski kskalski force-pushed the ks/dev/tar_unpack branch 2 times, most recently from d820a2e to c99189e Compare June 23, 2025 16:20
@kskalski kskalski marked this pull request as ready for review June 23, 2025 16:22
Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please fix CI and wait for the PR to make it through until the 'coverage' step before requesting a review, please?

Also, can you add perf numbers for with and without this change?

Comment thread CHANGELOG.md Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 24, 2025

Codecov Report

Attention: Patch coverage is 86.24813% with 92 lines in your changes missing coverage. Please review.

Project coverage is 83.2%. Comparing base (0fac0d1) to head (dc14171).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #6671     +/-   ##
=========================================
- Coverage    83.2%    83.2%   -0.1%     
=========================================
  Files         852      853      +1     
  Lines      373763   374060    +297     
=========================================
+ Hits       311290   311492    +202     
- Misses      62473    62568     +95     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@brooksprumo brooksprumo self-requested a review June 25, 2025 18:05
@brooksprumo
Copy link
Copy Markdown

I'm trying to square these two things:

This one, from the problem statement:

Unpacking snapshot uses tar crate unpack for each entry, which calls sync IO and copy data into intermediate buffer before performing writes. This blocks and spends a lot of CPU time on syscalls.

And this one from the solution:

This change is more or less performance netural

Why go through the trouble of io_uring-ifying if performance is not changed? Is the benefit that we only use a single thread instead of all the unpacker threads? If yes, I'd argue that this PR greatly improves performance then :)

Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is very large. Can it be split up? Minimally I need some additional help of where to start. Also a high level overview of what the design is would be much appreciated (I have read the Summary of Changes, so looking for more detail please).

Comment thread accounts-db/src/buffered_reader.rs
Comment thread io-uring/src/ring.rs
Comment thread accounts-db/src/lib.rs Outdated
@kskalski
Copy link
Copy Markdown
Author

kskalski commented Jun 26, 2025

If you mean performance as amount and shape of resources used, then there is an improvement, though it's a bit hard to reason about. The change is:

  • Baseline:
    • one decompression thread
    • [kernel] one io_uring submission queue polling / read worker thread
    • 4 regular threads mostly waiting on syscalls and channel for receiving archive chunks
  • This PR:
    • one decompression and untarring thread
    • [kernel] one io_uring submission queue polling / read worker thread
    • [kernel] a dynamic set of io_uring write threads (those are limited to 4 in the code, but actually profiler shows a weird and changing picture of them)

I think the true statement is that we save some CPU on doing syscalls, since we use a more efficient API to kernel, which batches syscalls into once-in-a-while io_uring queue sync. There are a couple of CPU-saving optimizations too (less copying of buffers) and kernel's work is a bit more efficient (use of Fixed buffers and file descriptors). From this point of view we save something like a fraction of 1 core CPU work.

From the point of view of start-up time, this is neutral though. Depending on hardware (disks), e.g. when we do hit zstd decompression bottleneck, this should in theory be faster, queing model is also simpler.

@kskalski
Copy link
Copy Markdown
Author

This PR is very large. Can it be split up? Minimally I need some additional help of where to start. Also a high level overview of what the design is would be much appreciated (I have read the Summary of Changes, so looking for more detail please).

I guess it should be easy to split the new files creator trait and implementations from its actual use for untarring and removal of chunker.
Would this be a your preference?

@alessandrod
Copy link
Copy Markdown

Would this be a your preference?

Please no. PRs (and commits) are supposed to be atomic: merging code that isn't used makes no sense - how can you possibly review that it's correct if you don't see how it's used?

What matters is making atomic things, not minimizing diffs [man-standing.jpeg]

@kskalski
Copy link
Copy Markdown
Author

kskalski commented Jun 26, 2025

@brooksprumo
Copy link
Copy Markdown

I agree that it's better to have impl and usage in one change, but I could be convinced otherwise if the API introduced is clean enough...

I agree. Wasn't looking for the PR to be broken up into horizontal slices. Sometimes PRs contain multiple vertical slices that can be broken up into separate atomic PRs. If that's not the case for this PR, that's OK too.

Anyway, for now let me suggest the way to review:

Thanks!

@kskalski kskalski force-pushed the ks/dev/tar_unpack branch 2 times, most recently from adc5203 to d1b38d7 Compare June 30, 2025 05:26
@brooksprumo brooksprumo self-requested a review July 1, 2025 13:54
@kskalski kskalski force-pushed the ks/dev/tar_unpack branch 2 times, most recently from c8dda78 to 4dd2af4 Compare July 8, 2025 06:24
Copy link
Copy Markdown

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! generally look good, left a few comments

Comment thread accounts-db/src/file_io.rs Outdated
Comment thread accounts-db/src/file_io.rs Outdated
Comment thread accounts-db/src/file_io.rs Outdated
Comment thread accounts-db/src/file_io.rs Outdated
Comment thread accounts-db/src/hardened_unpack.rs Outdated
Comment thread accounts-db/src/io_uring/memory.rs Outdated
Comment thread accounts-db/src/file_io.rs Outdated
cursor: Cursor::new(buf.sub_buf_to(total_read_len)),
io_buf_index: *io_buf_index,
};
reader_state.buffers[*reader_buf_index] =
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, you can use the existing buf and advance the size field in place

Copy link
Copy Markdown
Author

@kskalski kskalski Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advance the size field in place

Changed sub_buf_to to consume self, so it now basically does the size shortening in place.

In the future PR I'm getting rid of cursor and sub_buf_to completely just preserving the buffer (https://github.com/anza-xyz/agave/pull/6878/files#diff-c34f1749c5990606c2430a4289eee1207e1f227f9824b85dd9371d15381da2d8R451-R453), since for reading multiple files I need to re-use the full buffer instead of permanently shortening it.

Comment thread io-uring/src/ring.rs Outdated
Comment thread runtime/src/snapshot_utils.rs
@kskalski kskalski requested a review from alessandrod July 8, 2025 13:55
Copy link
Copy Markdown

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've discussed some on slack, and here's some more comments.

Once you change the register/mlock stuff I'll do a final pass on the io-uring code

Comment thread accounts-db/src/file_io.rs Outdated
Comment thread accounts-db/src/file_io.rs Outdated
Comment thread accounts-db/src/file_io.rs Outdated
Comment thread accounts-db/src/file_io.rs Outdated
Comment thread accounts-db/src/hardened_unpack.rs Outdated
Comment thread accounts-db/src/io_uring/file_creator.rs Outdated
Comment thread accounts-db/src/io_uring/sequential_file_reader.rs Outdated
Comment thread accounts-db/src/file_io.rs
Comment thread accounts-db/src/io_uring/memory.rs Outdated
Comment thread runtime/src/snapshot_utils.rs
@kskalski kskalski force-pushed the ks/dev/tar_unpack branch from baa10c4 to b6ed59c Compare July 9, 2025 10:06
@kskalski
Copy link
Copy Markdown
Author

kskalski commented Jul 9, 2025

Some changes that came up when making memlock a requirement:

  • updated changelog to indicate breaking change
  • I'm using a heuristic now for sizing buffer for file_creator - use the provided unpack size limit / count limit and clamp / align to the prod buffer size and default write size (this fixes tests at github action which won't allow updating ulimit and it saves mem for genesis unpacking)
  • preparing buffer for registration in io_uring will now also check the memlock limit, try to upgrade if it's definitely too small and give a more useful error message to the end user if it fails
  • register* are now called through agave-io-uring, since I decoupled state creation, registration and can drop state on error
  • removed setup_coop_taskrun since it's only supported by kenrle 5.19, but we are still supporting 5.15 on 22.04 (discovered through CI)

@kskalski kskalski requested a review from alessandrod July 9, 2025 15:56
@kskalski kskalski force-pushed the ks/dev/tar_unpack branch from 64188a2 to 9515fde Compare July 9, 2025 18:53
Copy link
Copy Markdown

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok another pass. I haven't done the io-uring file creator yet I'll do it now

Comment thread accounts-db/src/io_uring/file_creator.rs Outdated
Comment thread accounts-db/src/file_io.rs
Comment thread accounts-db/src/hardened_unpack.rs Outdated
.read(true)
.custom_flags(libc::O_NOATIME)
.open(path)?;
let buffers = IoFixedBuffer::split_buffer_chunks(buffer, read_capacity)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_buffer_chunks/register_buffer are pretty ugly imo.

I would do something like

let buffers = IoFixedBuffer::split_buffer_chunks(buffer, read_capacity);
IoFixedBuffer::register(buffers, ring);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should register each buffer as separate fixed buffer with its own index in the kernel or just to rename stuff a bit? I'm changing the names as you suggested, but I guess it's better to register a whole (original) buffer.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm no that's not what I meant. Registering one buffer is faster.

My suggestion was wrong tho - I missed that we were chunking buffers by write_capacity, but then obviously the chunks registered as fixed buffers in io-uring are chunked by something different (FIXED_BUFFER_LEN).

I'll think about it, the API still looks pretty ugly to me.

Comment thread io-uring/src/ring.rs Outdated
Comment thread accounts-db/src/io_uring/memory.rs Outdated
}
/// Split buffer into `chunk_size` sized `IoFixedBuffer` buffers for use as registered
/// buffer in io_uring operations.
pub fn split_buffer_chunks<'a>(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this code should be here, because it makes it impossible to
track the lifetime of the memory.

I would move this chunking/registering to what actually owns the buffer, so it's
clear from there that even thought we're downgrading to pointers, they won't be dangling

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is actually just a factory of IoFixedBuffer items and the caller, which is the owner of the input buffer too, manages the result - the code here doesn't leak the unsafe pointers anywhere else.

Also, the code was moved here because it's shared between file creator and file sequential reader.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is actually just a factory of IoFixedBuffer items and the caller, which is the owner of the input buffer too, manages the result

This is exactly the problem tho: the code returns values that embed pointers, but it's up to the caller to guarantee that the backing memory for those pointers remain valid.

Or in other words: this is a safe API, that can be misused to trigger use after free. Safe APIs should never allow use after free.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose for that reason it will be best to mark those functions as unsafe. I prefer to keep the constructor with the struct being constructed (also copy-pasting unsafe code to each use place seem quite counter-productive), but indeed this operation is unsafe.

Comment thread accounts-db/src/io_uring/memory.rs Outdated
}

/// Registed provided buffer as fixed buffer in `io_uring`.
pub fn register_buffer<S, E: RingOp<S>>(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, move to what owns the buffer?

Comment thread accounts-db/src/io_uring/memory.rs Outdated
Comment thread accounts-db/src/io_uring/memory.rs Outdated
Comment thread io-uring/src/ring.rs Outdated
Copy link
Copy Markdown

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, went over the creator now

Comment thread accounts-db/src/io_uring/file_creator.rs Outdated
Comment thread accounts-db/src/io_uring/file_creator.rs
Comment thread accounts-db/src/io_uring/file_creator.rs Outdated
Comment thread accounts-db/src/io_uring/file_creator.rs Outdated
},
};

const DEFAULT_WRITE_SIZE: usize = 1024 * 1024;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this number? add comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, there is a dd experimental truth and theoretical truth, they don't match...

Comment thread accounts-db/src/io_uring/file_creator.rs Outdated
Comment thread accounts-db/src/io_uring/file_creator.rs Outdated
Comment thread accounts-db/src/io_uring/file_creator.rs Outdated
Comment thread accounts-db/src/io_uring/file_creator.rs
struct PendingFile {
path: PathBuf,
completed_open: bool,
backlog: SmallVec<[PendingWrite; 8]>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember writing this backlog thing, but I don't remember where 8 comes from?
Have you measured?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the sizes of files in accounts directory, most of them fall within 5-6 MB, so 8*1MB write capacity will cover most cases without alloc

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

99.9% files are <8000000 bytes

@kskalski kskalski force-pushed the ks/dev/tar_unpack branch from c2dac3c to 9445866 Compare July 24, 2025 06:57
alessandrod
alessandrod previously approved these changes Jul 25, 2025
Copy link
Copy Markdown

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job!

@kskalski
Copy link
Copy Markdown
Author

Got a small merge conflict in imports, the CI is green again, @alessandrod @brooksprumo whenever one of you gets to it, please re-approve

@kskalski kskalski merged commit 6c80057 into anza-xyz:master Jul 25, 2025
52 checks passed
@kskalski kskalski deleted the ks/dev/tar_unpack branch July 25, 2025 09:42
@willhickey
Copy link
Copy Markdown

This seems to have broken v2.2 -> master upgrade compatibility. Is that expected? I'm getting these errors

[2025-07-28T21:51:44.891053944Z INFO  solana_runtime::snapshot_bank_utils] Loading bank from full snapshot archive: /home/sol/ledger-snapshots/snapshot-348118791-ASwNBbeyz1LjNXNEU6WV6BYRUQQpgu2Cpc6xichjhdyx.tar.zst, and incremental snapshot archive: Some("/home/sol/ledger-snapshots/incremental-snapshot-348118791-348169965-AY6nLJxYC1FCwxoV6jMyvNxexysGv896fjRe3QTubdf.tar.zst")
[2025-07-28T21:51:44.891587812Z ERROR solana_accounts_db::io_uring::memory] Unable to increase the maximum memory lock limit to 2000000000 from 65536
[2025-07-28T21:51:45.160108254Z ERROR agave_validator] Failed to start validator: failed to load bank: I/O error: failed to open snapshot archive '/home/sol/ledger-snapshots/snapshot-348118791-ASwNBbeyz1LjNXNEU6WV6BYRUQQpgu2Cpc6xichjhdyx.tar.zst': unable to set memory lock limit, full snapshot archive: /home/sol/ledger-snapshots/snapshot-348118791-ASwNBbeyz1LjNXNEU6WV6BYRUQQpgu2Cpc6xichjhdyx.tar.zst, incremental snapshot archive: /home/sol/ledger-snapshots/incremental-snapshot-348118791-348169965-AY6nLJxYC1FCwxoV6jMyvNxexysGv896fjRe3QTubdf.tar.zst

Comment thread CHANGELOG.md
### Validator

#### Breaking
* Require increased `memlock` limits - recommended setting is `LimitMEMLOCK=2000000000` in systemd service configuration. Lack of sufficient limit (on Linux) will cause startup error.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brooksprumo
Copy link
Copy Markdown

brooksprumo commented Jul 28, 2025

[..] I'm getting these errors

Yes, must set the memlock limit now. I tagged you on the change to the changelog: #6671 (comment)

Here's the message on discord for posterity: https://discord.com/channels/428295358100013066/439194979856809985/1398240774315053056

@willhickey
Copy link
Copy Markdown

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants