Use io_uring for creating files when unpacking snapshot by kskalski · Pull Request #6671 · anza-xyz/agave

kskalski · 2025-06-20T15:48:48Z

Problem

Unpacking snapshot uses tar crate unpack for each entry, which calls sync IO and copy data into intermediate buffer before performing writes. This blocks and spends a lot of CPU time on syscalls.

Summary of Changes

memlock limit of around 800MiB is now hard requirement for starting validator (when it does snapshot unpacking) - added breaking change entry in the Changelog
Introduce IoUringFileCreator (plus compatibility trait for non-linux platforms) and use it for creating files while unpacking snapshot.
Remove ArchiveChunker and perform whole unpacking in single thread - all IO is done in background kernel threads (with io_uring) and unless we run out of disk write bandwidth this thread will spend time on decompression.
Change entry_processor into file_path_processor and only execute it for files (such that is_file() call can be avoided in the only non-trivial call site that is filtering for files)
Rafactor auxiliary code shared by io_uring sequential file reader and file creator

This change is more or less performance neutral - untar times highly depend on achieved disk read / write throughput (and data layout on attached disks).
Observed timings:

baseline snapshot untar: 150s - 153s
this PR: 138s - 157s

brooksprumo

Can you please fix CI and wait for the PR to make it through until the 'coverage' step before requesting a review, please?

Also, can you add perf numbers for with and without this change?

codecov-commenter · 2025-06-24T14:17:46Z

Codecov Report

Attention: Patch coverage is 86.24813% with 92 lines in your changes missing coverage. Please review.

Project coverage is 83.2%. Comparing base (0fac0d1) to head (dc14171).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #6671     +/-   ##
=========================================
- Coverage    83.2%    83.2%   -0.1%     
=========================================
  Files         852      853      +1     
  Lines      373763   374060    +297     
=========================================
+ Hits       311290   311492    +202     
- Misses      62473    62568     +95

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

brooksprumo · 2025-06-26T15:23:13Z

I'm trying to square these two things:

This one, from the problem statement:

Unpacking snapshot uses tar crate unpack for each entry, which calls sync IO and copy data into intermediate buffer before performing writes. This blocks and spends a lot of CPU time on syscalls.

And this one from the solution:

This change is more or less performance netural

Why go through the trouble of io_uring-ifying if performance is not changed? Is the benefit that we only use a single thread instead of all the unpacker threads? If yes, I'd argue that this PR greatly improves performance then :)

brooksprumo

This PR is very large. Can it be split up? Minimally I need some additional help of where to start. Also a high level overview of what the design is would be much appreciated (I have read the Summary of Changes, so looking for more detail please).

kskalski · 2025-06-26T16:30:44Z

If you mean performance as amount and shape of resources used, then there is an improvement, though it's a bit hard to reason about. The change is:

Baseline:
- one decompression thread
- [kernel] one io_uring submission queue polling / read worker thread
- 4 regular threads mostly waiting on syscalls and channel for receiving archive chunks
This PR:
- one decompression and untarring thread
- [kernel] one io_uring submission queue polling / read worker thread
- [kernel] a dynamic set of io_uring write threads (those are limited to 4 in the code, but actually profiler shows a weird and changing picture of them)

I think the true statement is that we save some CPU on doing syscalls, since we use a more efficient API to kernel, which batches syscalls into once-in-a-while io_uring queue sync. There are a couple of CPU-saving optimizations too (less copying of buffers) and kernel's work is a bit more efficient (use of Fixed buffers and file descriptors). From this point of view we save something like a fraction of 1 core CPU work.

From the point of view of start-up time, this is neutral though. Depending on hardware (disks), e.g. when we do hit zstd decompression bottleneck, this should in theory be faster, queing model is also simpler.

kskalski · 2025-06-26T16:54:02Z

This PR is very large. Can it be split up? Minimally I need some additional help of where to start. Also a high level overview of what the design is would be much appreciated (I have read the Summary of Changes, so looking for more detail please).

I guess it should be easy to split the new files creator trait and implementations from its actual use for untarring and removal of chunker.
Would this be a your preference?

alessandrod · 2025-06-26T17:05:05Z

Would this be a your preference?

Please no. PRs (and commits) are supposed to be atomic: merging code that isn't used makes no sense - how can you possibly review that it's correct if you don't see how it's used?

What matters is making atomic things, not minimizing diffs [man-standing.jpeg]

kskalski · 2025-06-26T17:23:00Z

I agree that it's better to have impl and usage in one change, but I could be convinced otherwise if the API introduced is clean enough...

Anyway, for now let me suggest the way to review:

files creator trait: https://github.com/anza-xyz/agave/pull/6671/files#diff-53e1dc77b82ee64510c94a63efc07efa1313287d80b288bdb0ad28d29a25b561R92-R119
its use https://github.com/anza-xyz/agave/pull/6671/files#diff-a055b3097dffd2a17bafc557a41361ce0df813d8a9cd3d6d21213077ec57877aR117-R232
change to single thread processing https://github.com/anza-xyz/agave/pull/6671/files#diff-c957621ae265a12e03f7b9c5656e7aa74137223e603cc3cfcc2828dda1af5479R1679-L1704
io_uring refactored memory buffer registration https://github.com/anza-xyz/agave/pull/6671/files#diff-8967a4c05c45c47d43376ec2d48fa900dd795c25dc04add54c3a8ec0dc183618R131-R199
files creator state https://github.com/anza-xyz/agave/pull/6671/files#diff-02166488d1300d6505d82d626e3055c0e24d9441fd3c1e5c22bc5ac32fc407b4R250-R329
the place were initially ops are submitted / stored for later https://github.com/anza-xyz/agave/pull/6671/files#diff-02166488d1300d6505d82d626e3055c0e24d9441fd3c1e5c22bc5ac32fc407b4R193-R229

brooksprumo · 2025-06-26T17:45:49Z

I agree that it's better to have impl and usage in one change, but I could be convinced otherwise if the API introduced is clean enough...

I agree. Wasn't looking for the PR to be broken up into horizontal slices. Sometimes PRs contain multiple vertical slices that can be broken up into separate atomic PRs. If that's not the case for this PR, that's OK too.

Anyway, for now let me suggest the way to review:

Thanks!

alessandrod

thanks! generally look good, left a few comments

alessandrod · 2025-07-08T10:23:42Z

-                cursor: Cursor::new(buf.sub_buf_to(total_read_len)),
-                io_buf_index: *io_buf_index,
-            };
+            reader_state.buffers[*reader_buf_index] =


same here, you can use the existing buf and advance the size field in place

advance the size field in place

Changed sub_buf_to to consume self, so it now basically does the size shortening in place.

In the future PR I'm getting rid of cursor and sub_buf_to completely just preserving the buffer (https://github.com/anza-xyz/agave/pull/6878/files#diff-c34f1749c5990606c2430a4289eee1207e1f227f9824b85dd9371d15381da2d8R451-R453), since for reading multiple files I need to re-use the full buffer instead of permanently shortening it.

alessandrod

we've discussed some on slack, and here's some more comments.

Once you change the register/mlock stuff I'll do a final pass on the io-uring code

kskalski · 2025-07-09T15:56:19Z

Some changes that came up when making memlock a requirement:

updated changelog to indicate breaking change
I'm using a heuristic now for sizing buffer for file_creator - use the provided unpack size limit / count limit and clamp / align to the prod buffer size and default write size (this fixes tests at github action which won't allow updating ulimit and it saves mem for genesis unpacking)
preparing buffer for registration in io_uring will now also check the memlock limit, try to upgrade if it's definitely too small and give a more useful error message to the end user if it fails
register* are now called through agave-io-uring, since I decoupled state creation, registration and can drop state on error
removed setup_coop_taskrun since it's only supported by kenrle 5.19, but we are still supporting 5.15 on 22.04 (discovered through CI)

alessandrod

ok another pass. I haven't done the io-uring file creator yet I'll do it now

alessandrod · 2025-07-10T11:09:08Z

+            .read(true)
+            .custom_flags(libc::O_NOATIME)
+            .open(path)?;
+        let buffers = IoFixedBuffer::split_buffer_chunks(buffer, read_capacity)


split_buffer_chunks/register_buffer are pretty ugly imo.

I would do something like

let buffers = IoFixedBuffer::split_buffer_chunks(buffer, read_capacity); IoFixedBuffer::register(buffers, ring);

Do you mean we should register each buffer as separate fixed buffer with its own index in the kernel or just to rename stuff a bit? I'm changing the names as you suggested, but I guess it's better to register a whole (original) buffer.

hm no that's not what I meant. Registering one buffer is faster.

My suggestion was wrong tho - I missed that we were chunking buffers by write_capacity, but then obviously the chunks registered as fixed buffers in io-uring are chunked by something different (FIXED_BUFFER_LEN).

I'll think about it, the API still looks pretty ugly to me.

alessandrod · 2025-07-11T06:13:09Z

-        }
+    /// Split buffer into `chunk_size` sized `IoFixedBuffer` buffers for use as registered
+    /// buffer in io_uring operations.
+    pub fn split_buffer_chunks<'a>(


I don't think that this code should be here, because it makes it impossible to
track the lifetime of the memory.

I would move this chunking/registering to what actually owns the buffer, so it's
clear from there that even thought we're downgrading to pointers, they won't be dangling

This function is actually just a factory of IoFixedBuffer items and the caller, which is the owner of the input buffer too, manages the result - the code here doesn't leak the unsafe pointers anywhere else.

Also, the code was moved here because it's shared between file creator and file sequential reader.

This function is actually just a factory of IoFixedBuffer items and the caller, which is the owner of the input buffer too, manages the result

This is exactly the problem tho: the code returns values that embed pointers, but it's up to the caller to guarantee that the backing memory for those pointers remain valid.

Or in other words: this is a safe API, that can be misused to trigger use after free. Safe APIs should never allow use after free.

I suppose for that reason it will be best to mark those functions as unsafe. I prefer to keep the constructor with the struct being constructed (also copy-pasting unsafe code to each use place seem quite counter-productive), but indeed this operation is unsafe.

alessandrod · 2025-07-11T06:13:31Z

    }
+
+    /// Registed provided buffer as fixed buffer in `io_uring`.
+    pub fn register_buffer<S, E: RingOp<S>>(


same here, move to what owns the buffer?

alessandrod

okay, went over the creator now

alessandrod · 2025-07-11T06:40:45Z

+    },
+};
+
+const DEFAULT_WRITE_SIZE: usize = 1024 * 1024;


why this number? add comment

Added some comments, there is a dd experimental truth and theoretical truth, they don't match...

alessandrod · 2025-07-11T07:19:25Z

+struct PendingFile {
+    path: PathBuf,
+    completed_open: bool,
+    backlog: SmallVec<[PendingWrite; 8]>,


I remember writing this backlog thing, but I don't remember where 8 comes from?
Have you measured?

I looked at the sizes of files in accounts directory, most of them fall within 5-6 MB, so 8*1MB write capacity will cover most cases without alloc

99.9% files are <8000000 bytes

…ent.

alessandrod

great job!

kskalski · 2025-07-25T07:11:30Z

Got a small merge conflict in imports, the CI is green again, @alessandrod @brooksprumo whenever one of you gets to it, please re-approve

willhickey · 2025-07-28T21:55:32Z

This seems to have broken v2.2 -> master upgrade compatibility. Is that expected? I'm getting these errors

[2025-07-28T21:51:44.891053944Z INFO  solana_runtime::snapshot_bank_utils] Loading bank from full snapshot archive: /home/sol/ledger-snapshots/snapshot-348118791-ASwNBbeyz1LjNXNEU6WV6BYRUQQpgu2Cpc6xichjhdyx.tar.zst, and incremental snapshot archive: Some("/home/sol/ledger-snapshots/incremental-snapshot-348118791-348169965-AY6nLJxYC1FCwxoV6jMyvNxexysGv896fjRe3QTubdf.tar.zst")
[2025-07-28T21:51:44.891587812Z ERROR solana_accounts_db::io_uring::memory] Unable to increase the maximum memory lock limit to 2000000000 from 65536
[2025-07-28T21:51:45.160108254Z ERROR agave_validator] Failed to start validator: failed to load bank: I/O error: failed to open snapshot archive '/home/sol/ledger-snapshots/snapshot-348118791-ASwNBbeyz1LjNXNEU6WV6BYRUQQpgu2Cpc6xichjhdyx.tar.zst': unable to set memory lock limit, full snapshot archive: /home/sol/ledger-snapshots/snapshot-348118791-ASwNBbeyz1LjNXNEU6WV6BYRUQQpgu2Cpc6xichjhdyx.tar.zst, incremental snapshot archive: /home/sol/ledger-snapshots/incremental-snapshot-348118791-348169965-AY6nLJxYC1FCwxoV6jMyvNxexysGv896fjRe3QTubdf.tar.zst

brooksprumo · 2025-07-28T21:59:16Z

 ### Validator

 #### Breaking
+* Require increased `memlock` limits - recommended setting is `LimitMEMLOCK=2000000000` in systemd service configuration. Lack of sufficient limit (on Linux) will cause startup error.


cc @willhickey

brooksprumo · 2025-07-28T22:02:15Z

[..] I'm getting these errors

Yes, must set the memlock limit now. I tagged you on the change to the changelog: #6671 (comment)

Here's the message on discord for posterity: https://discord.com/channels/428295358100013066/439194979856809985/1398240774315053056

willhickey · 2025-07-28T22:14:56Z

Thanks!

kskalski changed the title ~~tar unpack~~ Use io_uring for creating files when unpacking snapshot Jun 23, 2025

kskalski force-pushed the ks/dev/tar_unpack branch 2 times, most recently from d820a2e to c99189e Compare June 23, 2025 16:20

kskalski marked this pull request as ready for review June 23, 2025 16:22

kskalski requested review from alessandrod and brooksprumo June 23, 2025 16:22

brooksprumo reviewed Jun 23, 2025

View reviewed changes

Comment thread CHANGELOG.md Outdated

brooksprumo self-requested a review June 25, 2025 18:05

brooksprumo reviewed Jun 26, 2025

View reviewed changes

Comment thread accounts-db/src/buffered_reader.rs

Comment thread io-uring/src/ring.rs

Comment thread accounts-db/src/lib.rs Outdated

kskalski force-pushed the ks/dev/tar_unpack branch 2 times, most recently from adc5203 to d1b38d7 Compare June 30, 2025 05:26

brooksprumo self-requested a review July 1, 2025 13:54

kskalski force-pushed the ks/dev/tar_unpack branch 2 times, most recently from c8dda78 to 4dd2af4 Compare July 8, 2025 06:24

alessandrod reviewed Jul 8, 2025

View reviewed changes

kskalski requested a review from alessandrod July 8, 2025 13:55

alessandrod reviewed Jul 9, 2025

View reviewed changes

kskalski force-pushed the ks/dev/tar_unpack branch from baa10c4 to b6ed59c Compare July 9, 2025 10:06

kskalski requested a review from alessandrod July 9, 2025 15:56

kskalski force-pushed the ks/dev/tar_unpack branch from 64188a2 to 9515fde Compare July 9, 2025 18:53

alessandrod reviewed Jul 11, 2025

View reviewed changes

kskalski added 18 commits July 24, 2025 08:57

Adjust buf size based on unarchive size and count constraints.

c957160

Revert whitespace

ece5288

Allow buffer size to be non-aligned to write capacity.

c46bf43

Addressing PR comments.

1007d2d

Fix compile.

8f70a55

Extract stats to struct. Cleanups.

5eb267f

Add a few comments.

b1b3355

Add unsafe markers and comments.

b981293

Simpler and less bug-prone logic for sizing file creator buffer.

a8c705f

Remove squeue::Flags::ASYNC.

a6de647

Update comments. Turn params.

b4cad55

Remove fallback for sequential file reader not being created.

ae2b062

Address PR comments.

17ae987

Clippy pub use order.

d34d19b

Elide lifetime.

196bd8f

Factor out and extend function checking for fallback to tar. Add comm…

dec2263

…ent.

Address some PR comments.

9cb5aa4

Fix more format strings. Derive Debug for FixedIoBuffer.

9445866

kskalski force-pushed the ks/dev/tar_unpack branch from c2dac3c to 9445866 Compare July 24, 2025 06:57

alessandrod previously approved these changes Jul 25, 2025

View reviewed changes

Merge branch 'master' into ks/dev/tar_unpack

dc14171

kskalski dismissed alessandrod’s stale review via dc14171 July 25, 2025 05:57

alessandrod approved these changes Jul 25, 2025

View reviewed changes

kskalski merged commit 6c80057 into anza-xyz:master Jul 25, 2025
52 checks passed

kskalski deleted the ks/dev/tar_unpack branch July 25, 2025 09:42

brooksprumo reviewed Jul 28, 2025

View reviewed changes

Conversation

kskalski commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

brooksprumo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brooksprumo commented Jun 26, 2025

Uh oh!

brooksprumo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kskalski commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kskalski commented Jun 26, 2025

Uh oh!

alessandrod commented Jun 26, 2025

Uh oh!

kskalski commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brooksprumo commented Jun 26, 2025

Uh oh!

alessandrod left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alessandrod left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kskalski commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alessandrod left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski commented Jun 20, 2025 •

edited

Loading

codecov-commenter commented Jun 24, 2025 •

edited

Loading

kskalski commented Jun 26, 2025 •

edited

Loading

kskalski commented Jun 26, 2025 •

edited

Loading

kskalski Jul 8, 2025 •

edited

Loading

kskalski commented Jul 9, 2025 •

edited

Loading

brooksprumo commented Jul 28, 2025 •

edited

Loading