Skip to content

Add support for direct io in SequentialFileReader#9395

Merged
kskalski merged 1 commit into
anza-xyz:masterfrom
dachen0:o-direct-real
Feb 10, 2026
Merged

Add support for direct io in SequentialFileReader#9395
kskalski merged 1 commit into
anza-xyz:masterfrom
dachen0:o-direct-real

Conversation

@dachen0
Copy link
Copy Markdown

@dachen0 dachen0 commented Dec 3, 2025

Summary of Changes

By using direct io when reading snapshots, we gain a 7% speedup. On the same snapshot and incremental snapshot when running ledger-tool verify:

This PR Master
snapshot untar 273.7s 293.8s
incremental snapshot untar 3.1s 2.8s

cc @kskalski

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the O_DIRECT flag when opening files in SequentialFileReader, providing a 7% performance improvement for snapshot operations. To support O_DIRECT's alignment requirements, the implementation switches from Vec-based buffers to page-aligned memory allocated via mmap. A fallback mechanism gracefully handles systems where O_DIRECT is not supported.

Key changes:

  • Replaces LargeBuffer enum with direct use of PageAlignedMemory for all io_uring buffers
  • Adds O_DIRECT flag with fallback to standard file operations when O_DIRECT fails
  • Introduces new_large_buffer() function that allocates regular page-aligned memory as fallback when huge tables aren't available

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
fs/src/io_uring/sequential_file_reader.rs Adds O_DIRECT flag with fallback logic and updates type signatures to use PageAlignedMemory; adds Clone bound to path parameters
fs/src/io_uring/memory.rs Removes LargeBuffer enum wrapper, adds alloc_regular() method for standard page-aligned allocation, implements AsMut<[u8]> for PageAlignedMemory, and converts allocation logic to standalone function
fs/src/io_uring/file_creator.rs Updates type references from LargeBuffer to PageAlignedMemory and adjusts imports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/memory.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/memory.rs Outdated
Comment thread fs/src/io_uring/memory.rs Outdated
@kskalski
Copy link
Copy Markdown

kskalski commented Dec 4, 2025

In general I'm a bit suspicious if O_DIRECT helps for the reader - it disables use of kernel caches, which for sequential reading might actually make things worse (we would not benefit from any read-ahead that kernel does to optimize sequential reading that actually matches the read pattern of the reader).
We do read-ahead on our own though, so I suppose it's possible that O_DIRECT prevents some unnecessary reads or avoids wasting time on populating caches that are going to be one-use only...

In my tests the bottleneck for snapshot untar is actually writing (we write 3-4x more data than we read + writes are in general slower than reads), so to evaluate this change maybe it would be worth to test the performance of reading + decompression (e.g. running this code in isolation

let read_buf_size = MAX_SNAPSHOT_READER_BUF_SIZE.min(read_write_budget_size as u64);
let decompressor = decompressed_tar_reader(archive_format, archive_path, read_buf_size)?;
). If the speedup is reproducible in isolation and for the whole unpacking process, I will try it out and compare profiles.

The memory buffer changes are interesting - it limits use of external buffer, but we don't need that currently and some seems to be cleaner by removing generic B... In any case I think I would prefer that to be done as separate PR.

@dachen0
Copy link
Copy Markdown
Author

dachen0 commented Dec 4, 2025 via email

@dachen0
Copy link
Copy Markdown
Author

dachen0 commented Dec 4, 2025

Benching using the attached diff test sees ~3-4% speedup on decompressor only on debug build.

master-log.txt
pr-log.txt
diff.patch

@alessandrod alessandrod self-requested a review December 5, 2025 01:29
Copy link
Copy Markdown

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

marking as changes requested just so I get to review (after breakpoint)

Copy link
Copy Markdown

@kskalski kskalski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this and it in fact looks promising. We will also want this kind of change for file creator as you mentioned.

I'm leaving a few comments on the code - we should make this PR simpler and also make use of O_DIRECT optional as this hits quite different code paths in the kernel, which historically we used to have trouble with.

Comment thread fs/src/io_uring/memory.rs Outdated
Comment thread fs/src/io_uring/memory.rs Outdated
Comment thread fs/src/io_uring/memory.rs Outdated
Comment thread fs/src/io_uring/memory.rs Outdated
Comment thread fs/src/io_uring/memory.rs
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/memory.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
@dachen0
Copy link
Copy Markdown
Author

dachen0 commented Dec 9, 2025

Just as an update, I am still working on this. However, it's currently finals week for me so I will only really be able to work on this in a week or a week and half. Posting this for clarity.

@dachen0
Copy link
Copy Markdown
Author

dachen0 commented Dec 26, 2025

Rebased to master and used the builder.

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/lib.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/fs_info.rs Outdated
Comment thread fs/src/fs_info.rs Outdated
Comment thread fs/src/fs_info.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
@kskalski
Copy link
Copy Markdown

kskalski commented Jan 6, 2026

also need to rebase, conflicted with some of my recent changes...

@dachen0 dachen0 force-pushed the o-direct-real branch 4 times, most recently from d2bd179 to 6ddc07f Compare January 6, 2026 18:12
@dachen0
Copy link
Copy Markdown
Author

dachen0 commented Jan 6, 2026

Rebase done, test-checks.sh passes.

@kskalski kskalski added the CI Pull Request is ready to enter CI label Jan 7, 2026
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Jan 7, 2026
@kskalski
Copy link
Copy Markdown

kskalski commented Jan 7, 2026

You could update the PR description - aligned memory was done separately and this PR doesn't enable the mode just yet.

@kskalski kskalski added the CI Pull Request is ready to enter CI label Feb 1, 2026
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Feb 1, 2026
Copy link
Copy Markdown

@kskalski kskalski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, it looks clean now. Just small suggestions about wording of the comments.

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated
@dachen0
Copy link
Copy Markdown
Author

dachen0 commented Feb 2, 2026

Done.

kskalski
kskalski previously approved these changes Feb 2, 2026
@kskalski kskalski added the CI Pull Request is ready to enter CI label Feb 2, 2026
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Feb 2, 2026
@kskalski
Copy link
Copy Markdown

kskalski commented Feb 2, 2026

@dachen0
Copy link
Copy Markdown
Author

dachen0 commented Feb 2, 2026

Fixed the whitespace issue. I copied the text from the suggestion and it must have had trailing whitespace for some reason.

@kskalski kskalski added the CI Pull Request is ready to enter CI label Feb 2, 2026
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Feb 2, 2026
@kskalski kskalski requested a review from vadorovsky February 9, 2026 06:14
Copy link
Copy Markdown
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No remarks to the implementation itself. Do I understand correctly, that this PR enables direct IO only in tests, and we need to enable it in the actual snapshot unarchive code separately?

Given the perf results you added in description, I guess the best place to let the caller enable or disable direct IO would be unarchive_snapshot:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1064-L1074

Then passed along to streaming_unarchive_snapshot -> decopmressed_tar_reader -> large_file_buf_leader, which then uses the SequentialFileReaderBuilfer.

So then we can enable it here for non-incremental snapshots:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1037-L1047

And then disable it for incremental ones:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1064-L1074

Can be done separately, of course.

@kskalski
Copy link
Copy Markdown

kskalski commented Feb 9, 2026

Given the perf results you added in description, I guess the best place to let the caller enable or disable direct IO would be unarchive_snapshot:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1064-L1074

Then passed along to streaming_unarchive_snapshot -> decopmressed_tar_reader -> large_file_buf_leader, which then uses the SequentialFileReaderBuilfer.

Yeah, I have that kind of change already in progress on a separate branch - the idea is to have a validator / ledger-tool flag that controls use of direct-io for snapshots (default will be to enable).

It's definitely something for a separate PR, and for now we also should get write change in (#9856)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

fs/src/io_uring/sequential_file_reader.rs:824

  • In direct-IO mode, the retry path for short reads resubmits with file_offset + last_read_len and buf_offset = total_read_len. If last_read_len is not aligned to the O_DIRECT requirements, the next submission will use an unaligned file offset/buffer offset and is likely to fail with EINVAL. The direct-IO implementation should ensure all resubmissions keep required alignment (or avoid the "retry remaining bytes" approach under O_DIRECT and handle partial reads differently).
        if last_read_len > 0 && last_read_len < *read_len {
            // Partial read, retry the op with updated offsets
            let op: ReadOp = ReadOp {
                fd: *fd,
                buf,
                is_direct_io: *is_direct_io,
                buf_offset: total_read_len,
                file_offset: *file_offset + last_read_len as FileSize,
                read_len: *read_len - last_read_len,
                reader_buf_index: *reader_buf_index,
                is_last_read: *is_last_read,
            };

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +83 to +87
#[cfg(test)]
pub fn use_direct_io(mut self, use_direct_io: bool) -> Self {
self.use_direct_io = use_direct_io;
self
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_direct_io is gated behind #[cfg(test)], so production code (e.g., large_file_buf_reader) cannot enable O_DIRECT at all and open_file_flags will always be O_NOATIME in non-test builds. If this PR is intended to speed up snapshot reads in production, this method (or an equivalent config path) needs to be available outside tests and wired through the call sites that construct SequentialFileReaderBuilder.

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +155
assert!(
self.read_capacity
.is_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT),
"read size is not aligned for direct IO({} is not a multiple of \
{DIRECT_IO_READ_LEN_ALIGNMENT})",
self.read_capacity
);
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using assert! for direct-IO alignment will panic at runtime if misconfigured. Since this is user-controlled configuration (and may depend on the underlying filesystem/device), prefer returning an io::Error (e.g., InvalidInput) from build/build_with_buffer so callers can handle it gracefully instead of crashing.

Suggested change
assert!(
self.read_capacity
.is_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT),
"read size is not aligned for direct IO({} is not a multiple of \
{DIRECT_IO_READ_LEN_ALIGNMENT})",
self.read_capacity
);
if !self
.read_capacity
.is_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT)
{
return Err(io::Error::new(
io::ErrorKind::InvalidInput,
format!(
"read size is not aligned for direct IO({} is not a multiple of \
{DIRECT_IO_READ_LEN_ALIGNMENT})",
self.read_capacity
),
));
}

Copilot uses AI. Check for mistakes.
Comment on lines +762 to +771
// Align the read length if necessary
let internal_read_len = if *is_direct_io && *read_len != buf.len() {
// Try to align the read len if possible and fall back to reading
// the full remaining bytes if we can't align the read len.
read_len
.next_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT)
.min(buf.len() - *buf_offset)
} else {
*read_len
};
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal_read_len can be larger than the logical read_len to satisfy O_DIRECT alignment, but complete() and EOF handling still use read_len/is_last_read. This can expose more bytes than read_limit (and/or mark eof_pos beyond the intended limit) when internal_read_len > read_len, breaking the documented "read finishes after read_limit bytes" behavior. Consider tracking both the submitted length vs requested length and clamping the buffer’s readable length/EOF position to the requested read_len (discarding any over-read bytes) so fill_buf() never returns more than requested.

Copilot uses AI. Check for mistakes.

impl<'a> FileBufRead<'a> for SequentialFileReader<'a> {
/// The `SequentialFileReader` must be in direct io mode if passing in direct io files.
/// `read_limit` must be less than the file size if using direct io.
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc inconsistency: here it says read_limit must be "less than the file size" for direct IO, but the earlier docs for add_owned_file_to_prefetch say "less than or equal". Please make these consistent (and ideally enforce the constraint in code if it’s required).

Suggested change
/// `read_limit` must be less than the file size if using direct io.
/// `read_limit` must be less than or equal to the file size if using direct io.

Copilot uses AI. Check for mistakes.
Comment on lines +953 to +965
#[test]
fn test_direct_io_read() {
check_reading_file(2_500, 4096, 4096, true);
check_reading_file(2_500, 16384, 4096, true);
check_reading_file(25_000, 4096, 4096, true);
check_reading_file(25_000, 16384, 4096, true);
check_reading_file(250_000, 4096, 4096, true);
check_reading_file(250_000, 16384, 4096, true);
check_reading_file(4096, 4096, 4096, true);
check_reading_file(4096, 16384, 4096, true);
check_reading_file(16384, 4096, 4096, true);
check_reading_file(16384, 16384, 4096, true);
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_direct_io_read assumes the temp directory’s filesystem supports O_DIRECT. On common Linux setups where tempfile uses /tmp mounted as tmpfs, O_DIRECT open can fail with EINVAL, making this unit test flaky across CI environments. Consider skipping the test when opening with O_DIRECT fails, or creating the temp file in a directory/filesystem known to support direct IO.

Copilot uses AI. Check for mistakes.
@kskalski kskalski added this pull request to the merge queue Feb 10, 2026
Merged via the queue into anza-xyz:master with commit ff3c981 Feb 10, 2026
55 checks passed
@kskalski
Copy link
Copy Markdown

Thanks @dachen0 for the idea and work on this PR.

I created a PR (#10507) to plug enabling direct-io into the validator and lerger tool, but I want to wait for file creator PR (#9856) to get merged.

So then we can enable it here for non-incremental snapshots:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1037-L1047

And then disable it for incremental ones:

@vadorovsky I wonder why do you think it should use different setting for non-incremental and incremental archive? I think when we start-up and read the archives they both are read-once (at least if all goes well), so there isn't any benefit from involving buffer cache when reading them. Or maybe the idea is that it's possible for incremental archive to still be in the cache since the time it was written?
I think it's not worth introducing complexity in the mode used for reading the archives:

  • we will also enable direct-io for writing archives as soon as all the required support code is ready
  • incrementals are small-enough that either mode should really be fine for them
  • in general for read side I think buffer cache isn't very useful for io-uring reader, since it does lots of read-ahead on its own and read disk bandwidth is plentiful

@dachen0
Copy link
Copy Markdown
Author

dachen0 commented Feb 11, 2026

Thank you @kskalski for being patient and many, many code reviews and suggestions! It was a pleasure to work with you on this. Glad to see this finally merged :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants