Support reading sequence of multiple files with a single read-ahead io_uring file reader by kskalski · Pull Request #6878 · anza-xyz/agave

kskalski · 2025-07-08T12:20:20Z

Problem

accounts_db::io_uring::SequentialFileReader supports async read-ahead reads from a specified file, but there are use-cases (accounts storage scan) where we need to read a sequence of many (often small) files.

Creating reader for each file separately would involve:

complexity of managing buffers those readers use (including use of given buffer chunk by other file read once it was consumed for the head file)
extra overhead of creating io_uring queues
difficulty in prioritizing currently scanned file reads over those that will be needed in future

This complexity should be hidden behind separate wrapper or embedded in SequentialFileReader

Summary of Changes

make file reader be created without specifying a file
add APIs for adding (any number of) path / owned file / file references
implement a queue of file states tracking progress of reading and consumption of each file
add move_to_next_file function that allows transitioning BufRead to next file
implement set_file to ensure head file == given file

Performance change

Measuring startup lt hash verification (calculate_accounts_lt_hash_at_startup_from_storages) - there is 14-15% of speedup in hashing rate / thread (i.e. walltime of the whole lt hash calculation time / number of user threads enabled), e.g.

6 threads: PR 96.11s vs master 113.3s
10 threads: PR 58.58s vs master 69.s
15 threads: PR 40.24s vs master 47.77s
24 threads: PR 29.00s vs 32.76s

codecov-commenter · 2025-08-01T11:01:45Z

Codecov Report

❌ Patch coverage is 98.19121% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.1%. Comparing base (43b06cb) to head (ecdab21).

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #6878    +/-   ##
========================================
  Coverage    83.1%    83.1%            
========================================
  Files         810      810            
  Lines      357414   357960   +546     
========================================
+ Hits       297133   297651   +518     
- Misses      60281    60309    +28

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vadorovsky · 2025-08-06T11:39:00Z

-                // the lifetime of the operation
-                self.inner.push(op)?;
+    /// It is required that the previous file is fully read before calling this method.
+    pub fn move_to_next_file(&mut self) -> io::Result<()> {


Perhaps we could implement Iterator and turn this method into its next implementation? Then we could iterate over SequentialFileReader, which I think would be a nice API. If such iterator was yielding some wrapper type, which implements Read, we could do something like:

let mut reader = SequentialFileReader::with_buffer(vec![0; 1024], 512).unwrap(); reader.add_file(temp1.as_file(), 2).unwrap(); reader.add_file(temp2.as_file(), 3).unwrap(); reader.add_file(temp1.as_file(), 4).unwrap(); reader.add_file(temp2.as_file(), 5).unwrap(); for reader in reader { let reader = reader.unwrap(); let mut buf = Vec::new(); reader.read_to_end(&mut buf).unwrap(); [...] }

The current API requires calling move_to_next_file manually and kinda forces developers to be explicit about how many files are there, which could be annoying.

If implementing Iterator is too hard or impossible for some reasons I'm overseeing, perhaps we could add some method like len() or remaining(), so we can still write a loop?

let remaining_files = reader.remaining(); for _ in 0..remaining_files { let mut buf = Vec::new(); reader.read_to_end(&mut buf).unwrap(); [...] reader.move_to_next_file(); )

Hm, I think I would find iterator here to be a bit confusing, since on first sight it's not clear if it should iterate over bytes or files.

As of now I don't foresee using move_to_next_file directly (maybe it should be made private), as accounts_db has the files stored in layered data structure and scan methods iterate over "storages" not files. Because of that the planned use is mostly through set_file, which "ensures that specified file is active == at the front of the queue".
So the plan is to get storages in chunks, add them to read-ahead queue (through add_file) and then let scan control whenever it wants to move to specific file (as long as it moves in the same order as read-ahead order, the simple approach imeplemented here will work).

Btw, it is in fact possible to read_to_end, which is provided by Read trait, because we still stop and return 0 from read (or &[] from fill_buf) when we reach the current file's end. Reading from new files starts only after explicit move_to_next_file (or set_file). Finally, in practice the reader might not actually read until the end of file and we need to support moving to next file for this scenario too.

I rebased on top of the actual definition for FileBufRead that is now used in

agave/accounts-db/src/append_vec.rs

Line 1051 in 5e4c0cb

reader.set_file(file, self.len())?;

vadorovsky · 2025-08-06T13:32:41Z

    inner: Ring<SequentialFileReaderState, ReadOp>,
+    owned_files: VecDeque<File>,


I'm struggling with understanding what is the purpose of this queue.

SequentialFileReaderState uses the raw file descriptors, added through add_file_by_fd.

Then the only usage of owned_files I see is in move_to_next_file, where we first get a file descriptor from the state first:

let Some(mut file_state) = state.files.pop_front() else { return Ok(()); };

to then compare that file descriptor with the owned file from the queue:

if self .owned_files .front() .is_some_and(|f| file_state.is_same_file(f)) { self.owned_files.pop_front(); }

What's the point of this comparison? I was wondering whether it's some kind of integrity check, but if this statement isn't true, nothing happens.

The implementation supports adding owned files (add_path, add_file_owned - this is the only way used for now) and/or file references (add_file / set_file), which ensures file is not closed while it is being read in background while supporting encapsulating API (e.g. you can have fn new_reader(path) -> SequentialFileReader)

What's the point of this comparison? I was wondering whether it's some kind of integrity check, but if this statement isn't true, nothing happens.

That code basically allows mixing the two ways of adding files, the assumption is that when owned file is being read (front of state.files), it is also at front of self.owned_files, so when moving on to next file, we should move both queues.

kskalski · 2025-08-08T19:57:45Z

I think this code is now close to be usable in accounts-db scans - I already have a branch that plugs it in, the final way to do that might involve a small generalization of API defined here (add_file will probably get into the FileBufRead trait, maybe as add_file_readahead or something like that), but I don't forsee major changes to this code.

@brooksprumo let me know if you can do a high level look at this or include others to review - unfortunately this PR rewrites a big part of existing impl, but I still don't feel like it deserves a separate mod / struct, since it would duplicate a lot of code.

brooksprumo · 2025-08-11T18:58:11Z

@brooksprumo let me know if you can do a high level look at this or include others to review - unfortunately this PR rewrites a big part of existing impl, but I still don't feel like it deserves a separate mod / struct, since it would duplicate a lot of code.

I need to do another deeper pass. My initial thought was that the interaction of adding files and moving to the next file felt strange. Like you'd add a file, but it had to already be the next one already in the list to work on. Or you can the same file again too, as long as it was next. I think I need to spend more time looking at the actual uses since I believe I have some details wrong.

brooksprumo · 2025-08-11T15:52:15Z

+    /// Add `file` reference to read. Starts reading the file as soon as a buffer is available.
+    ///
+    /// The read finishes when EOF is reached or `read_limit` bytes are read.
+    /// Multiple files can be added to the reader and they will be read-ahead in FIFO order.
+    ///
+    /// Lifetime of reference is tied to the reader's lifetime.
+    #[allow(unused)]
+    pub fn add_file(&mut self, file: &'a File, read_limit: usize) -> io::Result<()> {
+        self.add_file_by_fd(file.as_raw_fd(), read_limit)
+    }


For example it seems unsafe to add a file here without the actual File getting added to self.owned_files. I see it is used, so the #[allow(unused)] is strange. Maybe this fn is fine, just not as a public method.

In this PR this function is only used in tests.

The intended usage is:

do a chunk of add_file(s)

then repeat the same sequence (or shorter, it works as FIFO) doing set_file, read(s)

repeat

Clearly there are many possible weird uses of those functions, though I think no sequence will result in any error (i.e. set_file takes an absolute priority, it will discard any added files if the sequence is not right), only the above mentioned one is really useful.

Ok, I think the names of those functions could be changed to:

add_readahead_file

activate_file

The safety of add_file operation relies on 'a lifetime, i.e. there is some guarantee that we can use passed FD while our own object is alive, though in theory someone could close the file / FD, but that is forbidden in the doc comment.

adding file reference or owned files is conceptually very similar, but in our codebase it is / will be used in a disjoint way:

we create a single reader with owned snapshot archive file, then we just read that single file until the end

we create a reader for accounts files, add readahead file references, then activate and read one by one

kskalski · 2025-08-12T15:13:29Z

Changed names to add_(owned_)file_to_prefetch and activate_file.
I think add_file_to_prefetch will be moved to FileBufRead trait once we change uses to do read-ahead.

brooksprumo

I've done a few passes over the code, but I think I need another person who can do a proper review of sequential_file_reader.rs.

brooksprumo · 2025-08-13T15:17:15Z

-                // the lifetime of the operation
-                self.inner.push(op)?;
+    /// Lifetime of reference is tied to the reader's lifetime.
+    #[allow(unused)]


nit: Since this method is currently only used by tests, let's update the annotation.

Suggested change

#[allow(unused)]

#[cfg(test)]

Or if this is only intended to ever be called by tests, let's move the method into the tests submodule.

I take back previous comment - it is actually used in a trait function activate_file as a fallback when file specified for activation wasn't ever added to prefetch, which is a valid situation (this is how regular BufferedReader is used)

kskalski · 2025-08-15T14:11:59Z

Quick update on this project / PR - I'm iterating on top of this code to tune performance and beat current master's approach.

The changes I have still don't affect implementation here too much:

the biggest gain in read throughput / CPU is by dropping sqpoll, setting ASYNC flag on read ops and using many kernel worker threads (previously I was testing with shared sqpoll that used only 1-2 worker threads, which all-in-all was not even as good as master)
various buffer and sq queue sizing approaches
some low level optimizations as downsizing usize (or Option<usize>) to u32 / u16, unrolling option transforms, inlining - those provide a small gain, though that is a lot of mechanical changes

So I plan to make above independently for merge into master or after this PR is merged, but if anyone has preference to review a state closer to final version, I can do a series of commits in this PR.

alessandrod · 2025-08-17T16:08:09Z

but if anyone has preference to review a state closer to final version

I'd prefer to review the final state here

brooksprumo · 2025-08-19T18:41:39Z

+                                    let new_indices = storages.take_up_to_capacity(&mut chunk);
+                                    let new_files =
+                                        chunk.range(new_indices).filter_map(|s| s.accounts.file());
+                                    reader.add_files_to_prefetch(new_files).unwrap();


Please no bare unwraps without a SAFETY comment indicating why this can never fail.

changed to expect, similarly as the code below this function panics on scan error instead of returning error up the stack

brooksprumo · 2025-08-19T18:47:30Z

+    /// Return the `File` and size of the underlying `AppendVec` account file.
+    pub fn file(&self) -> Option<(&File, usize)> {
+        match self {
+            Self::AppendVec(av) => Some((av.file(), av.len())),


Note that .len() is not the size of the file. If you want the file size, use .capacity().

Suggested change

Self::AppendVec(av) => Some((av.file(), av.len())),

Self::AppendVec(av) => Some((av.file(), av.capacity())),

brooksprumo · 2025-08-19T18:49:54Z

+        match self.backing {
+            AppendVecFileBacking::File(ref file) => file,
+            AppendVecFileBacking::Mmap(_) => {
+                panic!("Memory-backed AppendVec does not have a file")


The append vec here is not memory backed; there is an underlying file, and we could get it if needed. Do we want to do that? I dunno. Note that RPC providers are still using Mmap file backing in v2.3, so we need to ensure they don't panic.

Suggested change

panic!("Memory-backed AppendVec does not have a file")

panic!("Memory-mapped AppendVec does not have a File")

I was considering an alternative APIs to avoid such pitfalls, add a function
add_file_prefetch_to_reader(&self, reader: impl FileBufRead)
but that will require a different approach for submitting IOops to kernel (probably a separate fn trigger_prefetch() in FileBufRead)

I think getting a file is cleaner, but possibly I could always make it return an Option

brooksprumo · 2025-08-19T18:52:31Z

+        const READ_SIZE: usize = 512 * 1024;
+        // scan accounts implementations will submit operations to kernel using
+        // FileBufRead::add_files_to_prefetch - just make sure queue size can hold all buffers.
+        const RING_QSIZE: u32 = (SCAN_ACCOUNTS_BUFFER_SIZE / READ_SIZE) as u32;


Looks like this truncates. I think we should either (1) assert remainder is zero, or (2) round up.

Suggested change

const RING_QSIZE: u32 = (SCAN_ACCOUNTS_BUFFER_SIZE / READ_SIZE) as u32;

const RING_QSIZE: u32 = SCAN_ACCOUNTS_BUFFER_SIZE.div_ceil(READ_SIZE) as u32;

brooksprumo · 2025-08-19T18:56:40Z

+    pub fn file(&self) -> Option<(&File, usize)> {
+        match self {
+            Self::AppendVec(av) => Some((av.file(), av.len())),
+            Self::TieredStorage(_) => None,


TieredStorage does have an underlying file, so I don't love the None here. Maybe we say unimplemented!() and remove the Option from the return type? (If going that route, need to doc comments indicating as much.)

yeah, hard to say, as mentioned in a other comment - we could go into all Option / all unimplemented!() / a completely different API to prefetch - seems like the easiest will be unimplemented and no Options, but I want to be sure this would early fail when somebody uses that APIs in a wrong way as opposed to late runtime fail.

I reverted back to using Option and renamed this accessor to indicate we want to fetch file_io information - the returned information is used for prefetch that is only applicable to file-io and we need to filter out storages that are not based on file-io, which we may encounter here when storage_access is set to mmap (with the current logic of AppendVec's "reopen as readonly" we may actually be running with a mix of mmap and file-io AppendVecs when storage access is set to mmap, but reopening always returns file-io)

kskalski · 2025-08-19T19:22:24Z

I'm done with optimizations and updating APIs to let the scan do prefetching. Typical percentage of CPU used for hashing (as opposed to buffer ops and syscalls) is now 97%, will post tomorrow some numbers comparing wall time gains relative to master.

Seems like the majority of review will fall on @alessandrod who prefers to look at the whole solution in one PR, so I included now all the code that uses new prefetch APIs and how I changed lt hash verification scan.

@brooksprumo this actually brings a bunch of changes here, that are more specific to accounts-db code, this way at least there are no more "unused" blocks. Thanks for quick pass.

kskalski · 2025-08-20T17:29:44Z

I discovered a small regression in snapshot unpacking after removal of sqpoll from reader, so I tuned ring options for tar archive reader and file creator - they come with explanation in code comments. This seems to be fixed now, though I'm a bit puzzled, since all my tests indicate that mixing reads and writes on the same ring kernel workers is slowing down the writes... (a separate surprise being that without sqpoll both rings created in the same user thread (solTarUnpack) will share the same kernel worker pool).

kskalski · 2025-08-21T08:32:38Z

current performance comparison for `calculate_accounts_lt_hash_at_startup_from_storages`

kskalski · 2025-09-02T09:24:34Z

@brooksprumo - rebased to resolve conflict and tweaked back accounts-db scan code to better support all storage access configs (since we still need to allow running with mmap mode)
@alessandrod - we missed 3.1 cut, the change is large and I suppose won't be accepted in BP, so I hope to get the main code in before 4.0

alessandrod · 2025-09-02T10:38:21Z

@brooksprumo - rebased to resolve conflict and tweaked back accounts-db scan code to better support all storage access configs (since we still need to allow running with mmap mode) @alessandrod - we missed 3.1 cut, the change is large and I suppose won't be accepted in BP, so I hope to get the main code in before 4.0

uh surely we can still aim for 3.1?

kskalski · 2025-09-03T15:29:15Z

uh surely we can still aim for 3.1?

Could be! I got confused - master is marked as 3.1, but there isn't a 3.1 tag yet, so we are now composing the solution that will go out then. In the meantime Brooks consolidated all start-up scans into single pass, I will rebase and update numbers once his change is in.
The io_uring impl shouldn't be affected by this though.

kskalski · 2025-09-05T15:16:26Z

Well, Brooks ate most of the benefits by running single scan on all cores as part of gen index:

Around 30 threads we saturate the disk read bandwidth, which makes io_uring version poll / submit syscall for completions. The end result is that idle CPU time masks any IO-related overhead, since we use a fast work stealing method and other threads will take on work that is delayed by busy threads wasting CPU on syscalls.

There is still comparable benefit as reported earlier if we generate index with less cores. It appears as we don't do any whole accounts-db scans now apart from start-up, so we need to reconsider if we need this utility / approach now.

kskalski force-pushed the ks/multi_files_read branch 2 times, most recently from 8d349ee to 70a6d6d Compare July 14, 2025 16:01

kskalski mentioned this pull request Jul 22, 2025

Move append_vec buffered reader overflow handling to a struct #7058

Merged

kskalski force-pushed the ks/multi_files_read branch 2 times, most recently from 3788f14 to cb9715b Compare July 24, 2025 07:09

kskalski force-pushed the ks/multi_files_read branch from cb9715b to e922afd Compare July 30, 2025 11:00

kskalski changed the title ~~io_uring multi-files reader~~ Support reading sequence of multiple files with a single read-ahead io_uring file reader Jul 31, 2025

kskalski force-pushed the ks/multi_files_read branch 3 times, most recently from 9478b4d to e9b48fd Compare August 1, 2025 09:31

kskalski marked this pull request as ready for review August 1, 2025 11:02

kskalski requested a review from vadorovsky August 1, 2025 11:02

vadorovsky reviewed Aug 7, 2025

View reviewed changes

kskalski force-pushed the ks/multi_files_read branch from e9b48fd to 65953e8 Compare August 7, 2025 20:35

kskalski requested a review from brooksprumo August 8, 2025 19:49

brooksprumo requested a review from vadorovsky August 11, 2025 15:34

brooksprumo reviewed Aug 11, 2025

View reviewed changes

brooksprumo self-requested a review August 11, 2025 18:59

brooksprumo reviewed Aug 12, 2025

View reviewed changes

Comment thread Cargo.toml

brooksprumo self-requested a review August 12, 2025 16:59

kskalski force-pushed the ks/multi_files_read branch from 10aea1f to 7c419f0 Compare August 12, 2025 19:37

alessandrod self-requested a review August 13, 2025 15:23

brooksprumo reviewed Aug 13, 2025

View reviewed changes

brooksprumo reviewed Aug 19, 2025

View reviewed changes

kskalski added the CI Pull Request is ready to enter CI label Aug 19, 2025

anza-team removed the CI Pull Request is ready to enter CI label Aug 19, 2025

brooksprumo self-requested a review August 19, 2025 20:02

kskalski mentioned this pull request Aug 27, 2025

Uses scoped threads to scan storages in generate_index() #7729

Merged

kskalski force-pushed the ks/multi_files_read branch from 86dd01f to 1070a3b Compare September 2, 2025 09:14

kskalski force-pushed the ks/multi_files_read branch from 1070a3b to 50edc58 Compare September 2, 2025 19:37

kskalski force-pushed the ks/multi_files_read branch 2 times, most recently from 5c0ab57 to 62d9650 Compare September 5, 2025 14:38

kskalski mentioned this pull request Sep 17, 2025

Retry IO on short write in io_uring file creator #8053

Merged

Extend sequential file reader to support multiple files.

e8ce318

kskalski force-pushed the ks/multi_files_read branch from ecdab21 to e8ce318 Compare October 3, 2025 05:13

kskalski marked this pull request as draft November 14, 2025 10:59

kskalski mentioned this pull request Dec 5, 2025

refactor(fs): use builder for initialization of SequentialFileReader #9420

Merged

brooksprumo removed their request for review December 12, 2025 03:59

kskalski mentioned this pull request Dec 22, 2025

Implement FileBufRead for SequentialFileReader #9701

Merged

kskalski closed this Jan 14, 2026

kskalski deleted the ks/multi_files_read branch January 27, 2026 08:42

		inner: Ring<SequentialFileReaderState, ReadOp>,
		owned_files: VecDeque<File>,

	Self::AppendVec(av) => Some((av.file(), av.len())),
	Self::AppendVec(av) => Some((av.file(), av.capacity())),

	panic!("Memory-backed AppendVec does not have a file")
	panic!("Memory-mapped AppendVec does not have a File")

	const RING_QSIZE: u32 = (SCAN_ACCOUNTS_BUFFER_SIZE / READ_SIZE) as u32;
	const RING_QSIZE: u32 = SCAN_ACCOUNTS_BUFFER_SIZE.div_ceil(READ_SIZE) as u32;

Conversation

kskalski commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Performance change

Uh oh!

codecov-commenter commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vadorovsky Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski commented Aug 8, 2025

Uh oh!

brooksprumo commented Aug 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski commented Aug 12, 2025

Uh oh!

Uh oh!

brooksprumo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski commented Aug 15, 2025

Uh oh!

alessandrod commented Aug 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski commented Aug 19, 2025

Uh oh!

kskalski commented Aug 20, 2025

Uh oh!

kskalski commented Aug 21, 2025

Uh oh!

kskalski commented Sep 2, 2025

Uh oh!

alessandrod commented Sep 2, 2025

kskalski commented Jul 8, 2025 •

edited

Loading

codecov-commenter commented Aug 1, 2025 •

edited

Loading

vadorovsky Aug 6, 2025 •

edited

Loading

kskalski Aug 7, 2025 •

edited

Loading

kskalski Aug 11, 2025 •

edited

Loading