Skip to content

Support reading sequence of multiple files with a single read-ahead io_uring file reader#6878

Closed
kskalski wants to merge 1 commit into
anza-xyz:masterfrom
kskalski:ks/multi_files_read
Closed

Support reading sequence of multiple files with a single read-ahead io_uring file reader#6878
kskalski wants to merge 1 commit into
anza-xyz:masterfrom
kskalski:ks/multi_files_read

Conversation

@kskalski
Copy link
Copy Markdown

@kskalski kskalski commented Jul 8, 2025

Problem

accounts_db::io_uring::SequentialFileReader supports async read-ahead reads from a specified file, but there are use-cases (accounts storage scan) where we need to read a sequence of many (often small) files.

Creating reader for each file separately would involve:

  • complexity of managing buffers those readers use (including use of given buffer chunk by other file read once it was consumed for the head file)
  • extra overhead of creating io_uring queues
  • difficulty in prioritizing currently scanned file reads over those that will be needed in future

This complexity should be hidden behind separate wrapper or embedded in SequentialFileReader

Summary of Changes

  • make file reader be created without specifying a file
  • add APIs for adding (any number of) path / owned file / file references
  • implement a queue of file states tracking progress of reading and consumption of each file
  • add move_to_next_file function that allows transitioning BufRead to next file
  • implement set_file to ensure head file == given file
Performance change

Measuring startup lt hash verification (calculate_accounts_lt_hash_at_startup_from_storages) - there is 14-15% of speedup in hashing rate / thread (i.e. walltime of the whole lt hash calculation time / number of user threads enabled), e.g.

  • 6 threads: PR 96.11s vs master 113.3s
  • 10 threads: PR 58.58s vs master 69.s
  • 15 threads: PR 40.24s vs master 47.77s
  • 24 threads: PR 29.00s vs 32.76s

@kskalski kskalski force-pushed the ks/multi_files_read branch 2 times, most recently from 8d349ee to 70a6d6d Compare July 14, 2025 16:01
@kskalski kskalski force-pushed the ks/multi_files_read branch 2 times, most recently from 3788f14 to cb9715b Compare July 24, 2025 07:09
@kskalski kskalski force-pushed the ks/multi_files_read branch from cb9715b to e922afd Compare July 30, 2025 11:00
@kskalski kskalski changed the title io_uring multi-files reader Support reading sequence of multiple files with a single read-ahead io_uring file reader Jul 31, 2025
@kskalski kskalski force-pushed the ks/multi_files_read branch 3 times, most recently from 9478b4d to e9b48fd Compare August 1, 2025 09:31
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 1, 2025

Codecov Report

❌ Patch coverage is 98.19121% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.1%. Comparing base (43b06cb) to head (ecdab21).

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #6878    +/-   ##
========================================
  Coverage    83.1%    83.1%            
========================================
  Files         810      810            
  Lines      357414   357960   +546     
========================================
+ Hits       297133   297651   +518     
- Misses      60281    60309    +28     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kskalski kskalski marked this pull request as ready for review August 1, 2025 11:02
@kskalski kskalski requested a review from vadorovsky August 1, 2025 11:02
// the lifetime of the operation
self.inner.push(op)?;
/// It is required that the previous file is fully read before calling this method.
pub fn move_to_next_file(&mut self) -> io::Result<()> {
Copy link
Copy Markdown
Member

@vadorovsky vadorovsky Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could implement Iterator and turn this method into its next implementation? Then we could iterate over SequentialFileReader, which I think would be a nice API. If such iterator was yielding some wrapper type, which implements Read, we could do something like:

let mut reader = SequentialFileReader::with_buffer(vec![0; 1024], 512).unwrap();
reader.add_file(temp1.as_file(), 2).unwrap();
reader.add_file(temp2.as_file(), 3).unwrap();
reader.add_file(temp1.as_file(), 4).unwrap();
reader.add_file(temp2.as_file(), 5).unwrap();

for reader in reader {
    let reader = reader.unwrap();
    let mut buf = Vec::new();
    reader.read_to_end(&mut buf).unwrap();
    [...]
}

The current API requires calling move_to_next_file manually and kinda forces developers to be explicit about how many files are there, which could be annoying.

If implementing Iterator is too hard or impossible for some reasons I'm overseeing, perhaps we could add some method like len() or remaining(), so we can still write a loop?

let remaining_files = reader.remaining();
for _ in 0..remaining_files {
    let mut buf = Vec::new();
    reader.read_to_end(&mut buf).unwrap();
    [...]
    reader.move_to_next_file();
)

Copy link
Copy Markdown
Author

@kskalski kskalski Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think I would find iterator here to be a bit confusing, since on first sight it's not clear if it should iterate over bytes or files.

As of now I don't foresee using move_to_next_file directly (maybe it should be made private), as accounts_db has the files stored in layered data structure and scan methods iterate over "storages" not files. Because of that the planned use is mostly through set_file, which "ensures that specified file is active == at the front of the queue".
So the plan is to get storages in chunks, add them to read-ahead queue (through add_file) and then let scan control whenever it wants to move to specific file (as long as it moves in the same order as read-ahead order, the simple approach imeplemented here will work).

Btw, it is in fact possible to read_to_end, which is provided by Read trait, because we still stop and return 0 from read (or &[] from fill_buf) when we reach the current file's end. Reading from new files starts only after explicit move_to_next_file (or set_file). Finally, in practice the reader might not actually read until the end of file and we need to support moving to next file for this scenario too.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased on top of the actual definition for FileBufRead that is now used in

reader.set_file(file, self.len())?;

inner: Ring<SequentialFileReaderState, ReadOp>,
owned_files: VecDeque<File>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling with understanding what is the purpose of this queue.

SequentialFileReaderState uses the raw file descriptors, added through add_file_by_fd.

Then the only usage of owned_files I see is in move_to_next_file, where we first get a file descriptor from the state first:

        let Some(mut file_state) = state.files.pop_front() else {
            return Ok(());
        };

to then compare that file descriptor with the owned file from the queue:

        if self
            .owned_files
            .front()
            .is_some_and(|f| file_state.is_same_file(f))
        {
            self.owned_files.pop_front();
        }

What's the point of this comparison? I was wondering whether it's some kind of integrity check, but if this statement isn't true, nothing happens.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation supports adding owned files (add_path, add_file_owned - this is the only way used for now) and/or file references (add_file / set_file), which ensures file is not closed while it is being read in background while supporting encapsulating API (e.g. you can have fn new_reader(path) -> SequentialFileReader)

What's the point of this comparison? I was wondering whether it's some kind of integrity check, but if this statement isn't true, nothing happens.

That code basically allows mixing the two ways of adding files, the assumption is that when owned file is being read (front of state.files), it is also at front of self.owned_files, so when moving on to next file, we should move both queues.

@kskalski kskalski force-pushed the ks/multi_files_read branch from e9b48fd to 65953e8 Compare August 7, 2025 20:35
@kskalski kskalski requested a review from brooksprumo August 8, 2025 19:49
@kskalski
Copy link
Copy Markdown
Author

kskalski commented Aug 8, 2025

I think this code is now close to be usable in accounts-db scans - I already have a branch that plugs it in, the final way to do that might involve a small generalization of API defined here (add_file will probably get into the FileBufRead trait, maybe as add_file_readahead or something like that), but I don't forsee major changes to this code.

@brooksprumo let me know if you can do a high level look at this or include others to review - unfortunately this PR rewrites a big part of existing impl, but I still don't feel like it deserves a separate mod / struct, since it would duplicate a lot of code.

@brooksprumo brooksprumo requested a review from vadorovsky August 11, 2025 15:34
@brooksprumo
Copy link
Copy Markdown

@brooksprumo let me know if you can do a high level look at this or include others to review - unfortunately this PR rewrites a big part of existing impl, but I still don't feel like it deserves a separate mod / struct, since it would duplicate a lot of code.

I need to do another deeper pass. My initial thought was that the interaction of adding files and moving to the next file felt strange. Like you'd add a file, but it had to already be the next one already in the list to work on. Or you can the same file again too, as long as it was next. I think I need to spend more time looking at the actual uses since I believe I have some details wrong.

Comment on lines +144 to +153
/// Add `file` reference to read. Starts reading the file as soon as a buffer is available.
///
/// The read finishes when EOF is reached or `read_limit` bytes are read.
/// Multiple files can be added to the reader and they will be read-ahead in FIFO order.
///
/// Lifetime of reference is tied to the reader's lifetime.
#[allow(unused)]
pub fn add_file(&mut self, file: &'a File, read_limit: usize) -> io::Result<()> {
self.add_file_by_fd(file.as_raw_fd(), read_limit)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example it seems unsafe to add a file here without the actual File getting added to self.owned_files. I see it is used, so the #[allow(unused)] is strange. Maybe this fn is fine, just not as a public method.

Copy link
Copy Markdown
Author

@kskalski kskalski Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR this function is only used in tests.

The intended usage is:

  • do a chunk of add_file(s)
  • then repeat the same sequence (or shorter, it works as FIFO) doing set_file, read(s)
  • repeat

Clearly there are many possible weird uses of those functions, though I think no sequence will result in any error (i.e. set_file takes an absolute priority, it will discard any added files if the sequence is not right), only the above mentioned one is really useful.

Ok, I think the names of those functions could be changed to:

  • add_readahead_file
  • activate_file

The safety of add_file operation relies on 'a lifetime, i.e. there is some guarantee that we can use passed FD while our own object is alive, though in theory someone could close the file / FD, but that is forbidden in the doc comment.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding file reference or owned files is conceptually very similar, but in our codebase it is / will be used in a disjoint way:

  • we create a single reader with owned snapshot archive file, then we just read that single file until the end
  • we create a reader for accounts files, add readahead file references, then activate and read one by one

@brooksprumo brooksprumo self-requested a review August 11, 2025 18:59
@kskalski
Copy link
Copy Markdown
Author

Changed names to add_(owned_)file_to_prefetch and activate_file.
I think add_file_to_prefetch will be moved to FileBufRead trait once we change uses to do read-ahead.

Comment thread Cargo.toml
@brooksprumo brooksprumo self-requested a review August 12, 2025 16:59
@kskalski kskalski force-pushed the ks/multi_files_read branch from 10aea1f to 7c419f0 Compare August 12, 2025 19:37
@alessandrod alessandrod self-requested a review August 13, 2025 15:23
Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a few passes over the code, but I think I need another person who can do a proper review of sequential_file_reader.rs.

// the lifetime of the operation
self.inner.push(op)?;
/// Lifetime of reference is tied to the reader's lifetime.
#[allow(unused)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this method is currently only used by tests, let's update the annotation.

Suggested change
#[allow(unused)]
#[cfg(test)]

Or if this is only intended to ever be called by tests, let's move the method into the tests submodule.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take back previous comment - it is actually used in a trait function activate_file as a fallback when file specified for activation wasn't ever added to prefetch, which is a valid situation (this is how regular BufferedReader is used)

@kskalski
Copy link
Copy Markdown
Author

Quick update on this project / PR - I'm iterating on top of this code to tune performance and beat current master's approach.

The changes I have still don't affect implementation here too much:

  • the biggest gain in read throughput / CPU is by dropping sqpoll, setting ASYNC flag on read ops and using many kernel worker threads (previously I was testing with shared sqpoll that used only 1-2 worker threads, which all-in-all was not even as good as master)
  • various buffer and sq queue sizing approaches
  • some low level optimizations as downsizing usize (or Option<usize>) to u32 / u16, unrolling option transforms, inlining - those provide a small gain, though that is a lot of mechanical changes

So I plan to make above independently for merge into master or after this PR is merged, but if anyone has preference to review a state closer to final version, I can do a series of commits in this PR.

@alessandrod
Copy link
Copy Markdown

but if anyone has preference to review a state closer to final version

I'd prefer to review the final state here

Comment thread accounts-db/src/accounts_db.rs Outdated
let new_indices = storages.take_up_to_capacity(&mut chunk);
let new_files =
chunk.range(new_indices).filter_map(|s| s.accounts.file());
reader.add_files_to_prefetch(new_files).unwrap();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no bare unwraps without a SAFETY comment indicating why this can never fail.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to expect, similarly as the code below this function panics on scan error instead of returning error up the stack

Comment thread accounts-db/src/accounts_file.rs Outdated
Comment thread accounts-db/src/accounts_file.rs Outdated
/// Return the `File` and size of the underlying `AppendVec` account file.
pub fn file(&self) -> Option<(&File, usize)> {
match self {
Self::AppendVec(av) => Some((av.file(), av.len())),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that .len() is not the size of the file. If you want the file size, use .capacity().

Suggested change
Self::AppendVec(av) => Some((av.file(), av.len())),
Self::AppendVec(av) => Some((av.file(), av.capacity())),

Comment thread accounts-db/src/append_vec.rs Outdated
match self.backing {
AppendVecFileBacking::File(ref file) => file,
AppendVecFileBacking::Mmap(_) => {
panic!("Memory-backed AppendVec does not have a file")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The append vec here is not memory backed; there is an underlying file, and we could get it if needed. Do we want to do that? I dunno. Note that RPC providers are still using Mmap file backing in v2.3, so we need to ensure they don't panic.

Suggested change
panic!("Memory-backed AppendVec does not have a file")
panic!("Memory-mapped AppendVec does not have a File")

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering an alternative APIs to avoid such pitfalls, add a function
add_file_prefetch_to_reader(&self, reader: impl FileBufRead)
but that will require a different approach for submitting IOops to kernel (probably a separate fn trigger_prefetch() in FileBufRead)

I think getting a file is cleaner, but possibly I could always make it return an Option

Comment thread accounts-db/src/append_vec.rs Outdated
const READ_SIZE: usize = 512 * 1024;
// scan accounts implementations will submit operations to kernel using
// FileBufRead::add_files_to_prefetch - just make sure queue size can hold all buffers.
const RING_QSIZE: u32 = (SCAN_ACCOUNTS_BUFFER_SIZE / READ_SIZE) as u32;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this truncates. I think we should either (1) assert remainder is zero, or (2) round up.

Suggested change
const RING_QSIZE: u32 = (SCAN_ACCOUNTS_BUFFER_SIZE / READ_SIZE) as u32;
const RING_QSIZE: u32 = SCAN_ACCOUNTS_BUFFER_SIZE.div_ceil(READ_SIZE) as u32;

Comment thread accounts-db/src/io_uring/memory.rs
Comment thread accounts-db/src/accounts_file.rs Outdated
pub fn file(&self) -> Option<(&File, usize)> {
match self {
Self::AppendVec(av) => Some((av.file(), av.len())),
Self::TieredStorage(_) => None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TieredStorage does have an underlying file, so I don't love the None here. Maybe we say unimplemented!() and remove the Option from the return type? (If going that route, need to doc comments indicating as much.)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, hard to say, as mentioned in a other comment - we could go into all Option / all unimplemented!() / a completely different API to prefetch - seems like the easiest will be unimplemented and no Options, but I want to be sure this would early fail when somebody uses that APIs in a wrong way as opposed to late runtime fail.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted back to using Option and renamed this accessor to indicate we want to fetch file_io information - the returned information is used for prefetch that is only applicable to file-io and we need to filter out storages that are not based on file-io, which we may encounter here when storage_access is set to mmap (with the current logic of AppendVec's "reopen as readonly" we may actually be running with a mix of mmap and file-io AppendVecs when storage access is set to mmap, but reopening always returns file-io)

@kskalski
Copy link
Copy Markdown
Author

I'm done with optimizations and updating APIs to let the scan do prefetching. Typical percentage of CPU used for hashing (as opposed to buffer ops and syscalls) is now 97%, will post tomorrow some numbers comparing wall time gains relative to master.

Seems like the majority of review will fall on @alessandrod who prefers to look at the whole solution in one PR, so I included now all the code that uses new prefetch APIs and how I changed lt hash verification scan.

@brooksprumo this actually brings a bunch of changes here, that are more specific to accounts-db code, this way at least there are no more "unused" blocks. Thanks for quick pass.

@kskalski kskalski added the CI Pull Request is ready to enter CI label Aug 19, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Aug 19, 2025
@brooksprumo brooksprumo self-requested a review August 19, 2025 20:02
@kskalski
Copy link
Copy Markdown
Author

I discovered a small regression in snapshot unpacking after removal of sqpoll from reader, so I tuned ring options for tar archive reader and file creator - they come with explanation in code comments. This seems to be fixed now, though I'm a bit puzzled, since all my tests indicate that mixing reads and writes on the same ring kernel workers is slowing down the writes... (a separate surprise being that without sqpoll both rings created in the same user thread (solTarUnpack) will share the same kernel worker pool).

@kskalski
Copy link
Copy Markdown
Author

io_uring_scan_lthash current performance comparison for `calculate_accounts_lt_hash_at_startup_from_storages`

@kskalski
Copy link
Copy Markdown
Author

kskalski commented Sep 2, 2025

@brooksprumo - rebased to resolve conflict and tweaked back accounts-db scan code to better support all storage access configs (since we still need to allow running with mmap mode)
@alessandrod - we missed 3.1 cut, the change is large and I suppose won't be accepted in BP, so I hope to get the main code in before 4.0

@alessandrod
Copy link
Copy Markdown

@brooksprumo - rebased to resolve conflict and tweaked back accounts-db scan code to better support all storage access configs (since we still need to allow running with mmap mode) @alessandrod - we missed 3.1 cut, the change is large and I suppose won't be accepted in BP, so I hope to get the main code in before 4.0

uh surely we can still aim for 3.1?

@kskalski kskalski force-pushed the ks/multi_files_read branch from 1070a3b to 50edc58 Compare September 2, 2025 19:37
@kskalski
Copy link
Copy Markdown
Author

kskalski commented Sep 3, 2025

uh surely we can still aim for 3.1?

Could be! I got confused - master is marked as 3.1, but there isn't a 3.1 tag yet, so we are now composing the solution that will go out then. In the meantime Brooks consolidated all start-up scans into single pass, I will rebase and update numbers once his change is in.
The io_uring impl shouldn't be affected by this though.

@kskalski kskalski force-pushed the ks/multi_files_read branch 2 times, most recently from 5c0ab57 to 62d9650 Compare September 5, 2025 14:38
@kskalski
Copy link
Copy Markdown
Author

kskalski commented Sep 5, 2025

Well, Brooks ate most of the benefits by running single scan on all cores as part of gen index:
io_uring_scan_gen_index

Around 30 threads we saturate the disk read bandwidth, which makes io_uring version poll / submit syscall for completions. The end result is that idle CPU time masks any IO-related overhead, since we use a fast work stealing method and other threads will take on work that is delayed by busy threads wasting CPU on syscalls.

There is still comparable benefit as reported earlier if we generate index with less cores. It appears as we don't do any whole accounts-db scans now apart from start-up, so we need to reconsider if we need this utility / approach now.

@kskalski kskalski force-pushed the ks/multi_files_read branch from ecdab21 to e8ce318 Compare October 3, 2025 05:13
@kskalski kskalski marked this pull request as draft November 14, 2025 10:59
@brooksprumo brooksprumo removed their request for review December 12, 2025 03:59
@kskalski kskalski closed this Jan 14, 2026
@kskalski kskalski deleted the ks/multi_files_read branch January 27, 2026 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants