Add support for direct io in SequentialFileReader by dachen0 · Pull Request #9395 · anza-xyz/agave

dachen0 · 2025-12-03T15:46:19Z

Summary of Changes

By using direct io when reading snapshots, we gain a 7% speedup. On the same snapshot and incremental snapshot when running ledger-tool verify:

	This PR	Master
snapshot untar	273.7s	293.8s
incremental snapshot untar	3.1s	2.8s

cc @kskalski

Copilot

Pull request overview

This PR adds support for the O_DIRECT flag when opening files in SequentialFileReader, providing a 7% performance improvement for snapshot operations. To support O_DIRECT's alignment requirements, the implementation switches from Vec-based buffers to page-aligned memory allocated via mmap. A fallback mechanism gracefully handles systems where O_DIRECT is not supported.

Key changes:

Replaces LargeBuffer enum with direct use of PageAlignedMemory for all io_uring buffers
Adds O_DIRECT flag with fallback to standard file operations when O_DIRECT fails
Introduces new_large_buffer() function that allocates regular page-aligned memory as fallback when huge tables aren't available

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
fs/src/io_uring/sequential_file_reader.rs	Adds O_DIRECT flag with fallback logic and updates type signatures to use `PageAlignedMemory`; adds `Clone` bound to path parameters
fs/src/io_uring/memory.rs	Removes `LargeBuffer` enum wrapper, adds `alloc_regular()` method for standard page-aligned allocation, implements `AsMut<[u8]>` for `PageAlignedMemory`, and converts allocation logic to standalone function
fs/src/io_uring/file_creator.rs	Updates type references from `LargeBuffer` to `PageAlignedMemory` and adjusts imports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kskalski · 2025-12-04T00:44:47Z

In general I'm a bit suspicious if O_DIRECT helps for the reader - it disables use of kernel caches, which for sequential reading might actually make things worse (we would not benefit from any read-ahead that kernel does to optimize sequential reading that actually matches the read pattern of the reader).
We do read-ahead on our own though, so I suppose it's possible that O_DIRECT prevents some unnecessary reads or avoids wasting time on populating caches that are going to be one-use only...

In my tests the bottleneck for snapshot untar is actually writing (we write 3-4x more data than we read + writes are in general slower than reads), so to evaluate this change maybe it would be worth to test the performance of reading + decompression (e.g. running this code in isolation

agave/snapshots/src/unarchive.rs

Lines 42 to 43 in a2a68b8

    
           let read_buf_size = MAX_SNAPSHOT_READER_BUF_SIZE.min(read_write_budget_size as u64); 
        
           let decompressor = decompressed_tar_reader(archive_format, archive_path, read_buf_size)?;

). If the speedup is reproducible in isolation and for the whole unpacking process, I will try it out and compare profiles.

The memory buffer changes are interesting - it limits use of external buffer, but we don't need that currently and some seems to be cleaner by removing generic B... In any case I think I would prefer that to be done as separate PR.

dachen0 · 2025-12-04T01:08:03Z

Yes, I also saw that we were heavily write constrained. I thought the speedup might be since we don't go through the kernel cache for reads, the writes are more efficient since the load on the system kernel cache is reduced. I was also looking into using O_DIRECT for writes because that's the main bottleneck. We shouldnt be touching kernel cache at all IMO since there's so many account storages/ledger entries that are basically single use and shouldn't be cached. If caching is needed it should be implemented in user space. I will write some benches for just snapshot unpacking and see if there's an isolated speedup. I'm not sure if we should put the memory changes in a separate PR -- Vec allocs are faster than mmap, so if aligned memory for the buffer isn't necessary then the alloc is faster with the current impl.

…

-------- Original Message --------

On Wednesday, 12/03/25 at 19:45 Kamil Skalski ***@***.***> wrote: kskalski left a comment [(anza-xyz/agave#9395)](#9395 (comment)) In general I'm a bit suspicious if O_DIRECT helps for the reader - it disables use of kernel caches, which for sequential reading might actually make things worse (we might not benefit from any read-ahead that kernel would do to optimize sequential reading that actually matches the read pattern of the reader). We do read-ahead on our own though, so I suppose it's possible that O_DIRECT prevents some unnecessary reads or avoids wasting time on populating caches that are going to be one-use only... In my tests the bottleneck for snapshot untar is actually writing (we write 3-4x more data than we read, writes are in general slower than reads), so to evaluate this change maybe it would be worth to test the performance of reading + decompression (e.g. running this code in isolation https://github.com/anza-xyz/agave/blob/a2a68b8a3a81e6a69b07015ac27eda27c94b3244/snapshots/src/unarchive.rs#L42-L43). If the speedup is reproducible in isolation and for the whole unpacking process, I will try it out and compare profiles. The memory buffer changes are interesting - it limits use of external buffer, but we don't need that currently and some seems to be cleaner by removing generic B... In any case I think I would prefer that to be done as separate PR. — Reply to this email directly, [view it on GitHub](#9395 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AJVAUIM6AYRS4DZVICINWB33757ZNAVCNFSM6AAAAACN5Z432CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMBZGQZDINBRGI). You are receiving this because you authored the thread.Message ID: ***@***.***>

dachen0 · 2025-12-04T15:25:51Z

Benching using the attached diff test sees ~3-4% speedup on decompressor only on debug build.

master-log.txt
pr-log.txt
diff.patch

alessandrod

thanks!

marking as changes requested just so I get to review (after breakpoint)

kskalski

Thanks for testing this and it in fact looks promising. We will also want this kind of change for file creator as you mentioned.

I'm leaving a few comments on the code - we should make this PR simpler and also make use of O_DIRECT optional as this hits quite different code paths in the kernel, which historically we used to have trouble with.

dachen0 · 2025-12-09T22:12:35Z

Just as an update, I am still working on this. However, it's currently finals week for me so I will only really be able to work on this in a week or a week and half. Posting this for clarity.

dachen0 · 2025-12-26T17:46:03Z

Rebased to master and used the builder.

kskalski · 2026-01-06T16:55:32Z

also need to rebase, conflicted with some of my recent changes...

dachen0 · 2026-01-06T18:13:47Z

Rebase done, test-checks.sh passes.

kskalski · 2026-01-07T01:08:35Z

You could update the PR description - aligned memory was done separately and this PR doesn't enable the mode just yet.

kskalski

Great, it looks clean now. Just small suggestions about wording of the comments.

dachen0 · 2026-02-02T02:25:56Z

Done.

kskalski · 2026-02-02T06:20:21Z

Sanity checks fail on some white-space issue:
https://buildkite.com/anza/agave/builds/39602/steps/canvas?jid=019c1c61-7082-4067-aed1-8a9937b9fe43#019c1c61-7082-4067-aed1-8a9937b9fe43

dachen0 · 2026-02-02T13:10:18Z

Fixed the whitespace issue. I copied the text from the suggestion and it must have had trailing whitespace for some reason.

vadorovsky

No remarks to the implementation itself. Do I understand correctly, that this PR enables direct IO only in tests, and we need to enable it in the actual snapshot unarchive code separately?

Given the perf results you added in description, I guess the best place to let the caller enable or disable direct IO would be unarchive_snapshot:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1064-L1074

Then passed along to streaming_unarchive_snapshot -> decopmressed_tar_reader -> large_file_buf_leader, which then uses the SequentialFileReaderBuilfer.

So then we can enable it here for non-incremental snapshots:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1037-L1047

And then disable it for incremental ones:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1064-L1074

Can be done separately, of course.

kskalski · 2026-02-09T12:15:15Z

Given the perf results you added in description, I guess the best place to let the caller enable or disable direct IO would be unarchive_snapshot:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1064-L1074

Then passed along to streaming_unarchive_snapshot -> decopmressed_tar_reader -> large_file_buf_leader, which then uses the SequentialFileReaderBuilfer.

Yeah, I have that kind of change already in progress on a separate branch - the idea is to have a validator / ledger-tool flag that controls use of direct-io for snapshots (default will be to enable).

It's definitely something for a separate PR, and for now we also should get write change in (#9856)

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

fs/src/io_uring/sequential_file_reader.rs:824

In direct-IO mode, the retry path for short reads resubmits with file_offset + last_read_len and buf_offset = total_read_len. If last_read_len is not aligned to the O_DIRECT requirements, the next submission will use an unaligned file offset/buffer offset and is likely to fail with EINVAL. The direct-IO implementation should ensure all resubmissions keep required alignment (or avoid the "retry remaining bytes" approach under O_DIRECT and handle partial reads differently).

        if last_read_len > 0 && last_read_len < *read_len {
            // Partial read, retry the op with updated offsets
            let op: ReadOp = ReadOp {
                fd: *fd,
                buf,
                is_direct_io: *is_direct_io,
                buf_offset: total_read_len,
                file_offset: *file_offset + last_read_len as FileSize,
                read_len: *read_len - last_read_len,
                reader_buf_index: *reader_buf_index,
                is_last_read: *is_last_read,
            };

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T07:32:32Z

+    #[cfg(test)]
+    pub fn use_direct_io(mut self, use_direct_io: bool) -> Self {
+        self.use_direct_io = use_direct_io;
+        self
+    }


use_direct_io is gated behind #[cfg(test)], so production code (e.g., large_file_buf_reader) cannot enable O_DIRECT at all and open_file_flags will always be O_NOATIME in non-test builds. If this PR is intended to speed up snapshot reads in production, this method (or an equivalent config path) needs to be available outside tests and wired through the call sites that construct SequentialFileReaderBuilder.

Copilot · 2026-02-10T07:32:33Z

+            assert!(
+                self.read_capacity
+                    .is_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT),
+                "read size is not aligned for direct IO({} is not a multiple of \
+                 {DIRECT_IO_READ_LEN_ALIGNMENT})",
+                self.read_capacity
+            );


Using assert! for direct-IO alignment will panic at runtime if misconfigured. Since this is user-controlled configuration (and may depend on the underlying filesystem/device), prefer returning an io::Error (e.g., InvalidInput) from build/build_with_buffer so callers can handle it gracefully instead of crashing.

Suggested change

assert!(

self.read_capacity

.is_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT),

"read size is not aligned for direct IO({} is not a multiple of \

{DIRECT_IO_READ_LEN_ALIGNMENT})",

self.read_capacity

);

if !self

.read_capacity

.is_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT)

{

return Err(io::Error::new(

io::ErrorKind::InvalidInput,

format!(

"read size is not aligned for direct IO({} is not a multiple of \

{DIRECT_IO_READ_LEN_ALIGNMENT})",

self.read_capacity

),

));

}

Copilot · 2026-02-10T07:32:33Z

+        // Align the read length if necessary
+        let internal_read_len = if *is_direct_io && *read_len != buf.len() {
+            // Try to align the read len if possible and fall back to reading
+            // the full remaining bytes if we can't align the read len.
+            read_len
+                .next_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT)
+                .min(buf.len() - *buf_offset)
+        } else {
+            *read_len
+        };


internal_read_len can be larger than the logical read_len to satisfy O_DIRECT alignment, but complete() and EOF handling still use read_len/is_last_read. This can expose more bytes than read_limit (and/or mark eof_pos beyond the intended limit) when internal_read_len > read_len, breaking the documented "read finishes after read_limit bytes" behavior. Consider tracking both the submitted length vs requested length and clamping the buffer’s readable length/EOF position to the requested read_len (discarding any over-read bytes) so fill_buf() never returns more than requested.

Copilot · 2026-02-10T07:32:33Z


 impl<'a> FileBufRead<'a> for SequentialFileReader<'a> {
+    /// The `SequentialFileReader` must be in direct io mode if passing in direct io files.
+    /// `read_limit` must be less than the file size if using direct io.


Doc inconsistency: here it says read_limit must be "less than the file size" for direct IO, but the earlier docs for add_owned_file_to_prefetch say "less than or equal". Please make these consistent (and ideally enforce the constraint in code if it’s required).

Suggested change

/// `read_limit` must be less than the file size if using direct io.

/// `read_limit` must be less than or equal to the file size if using direct io.

Copilot · 2026-02-10T07:32:33Z

+    #[test]
+    fn test_direct_io_read() {
+        check_reading_file(2_500, 4096, 4096, true);
+        check_reading_file(2_500, 16384, 4096, true);
+        check_reading_file(25_000, 4096, 4096, true);
+        check_reading_file(25_000, 16384, 4096, true);
+        check_reading_file(250_000, 4096, 4096, true);
+        check_reading_file(250_000, 16384, 4096, true);
+        check_reading_file(4096, 4096, 4096, true);
+        check_reading_file(4096, 16384, 4096, true);
+        check_reading_file(16384, 4096, 4096, true);
+        check_reading_file(16384, 16384, 4096, true);
+    }


test_direct_io_read assumes the temp directory’s filesystem supports O_DIRECT. On common Linux setups where tempfile uses /tmp mounted as tmpfs, O_DIRECT open can fail with EINVAL, making this unit test flaky across CI environments. Consider skipping the test when opening with O_DIRECT fails, or creating the temp file in a directory/filesystem known to support direct IO.

kskalski · 2026-02-10T08:59:47Z

Thanks @dachen0 for the idea and work on this PR.

I created a PR (#10507) to plug enabling direct-io into the validator and lerger tool, but I want to wait for file creator PR (#9856) to get merged.

So then we can enable it here for non-incremental snapshots:

https://github.com/dachen0/agave/blob/63207cf7a6bc5847cee961191b7fa5145e56411b/runtime/src/snapshot_utils.rs#L1037-L1047

And then disable it for incremental ones:

@vadorovsky I wonder why do you think it should use different setting for non-incremental and incremental archive? I think when we start-up and read the archives they both are read-once (at least if all goes well), so there isn't any benefit from involving buffer cache when reading them. Or maybe the idea is that it's possible for incremental archive to still be in the cache since the time it was written?
I think it's not worth introducing complexity in the mode used for reading the archives:

we will also enable direct-io for writing archives as soon as all the required support code is ready
incrementals are small-enough that either mode should really be fine for them
in general for read side I think buffer cache isn't very useful for io-uring reader, since it does lots of read-ahead on its own and read disk bandwidth is plentiful

dachen0 · 2026-02-11T19:57:41Z

Thank you @kskalski for being patient and many, many code reviews and suggestions! It was a pleasure to work with you on this. Glad to see this finally merged :).

Copilot AI review requested due to automatic review settings December 3, 2025 15:46

Copilot started reviewing on behalf of dachen0 December 3, 2025 15:46 View session

mergify Bot added community need:merge-assist labels Dec 3, 2025

Copilot finished reviewing on behalf of dachen0 December 3, 2025 15:50

Copilot AI reviewed Dec 3, 2025

View reviewed changes

alessandrod self-requested a review December 5, 2025 01:29

alessandrod requested changes Dec 5, 2025

View reviewed changes

kskalski reviewed Dec 5, 2025

View reviewed changes

dachen0 mentioned this pull request Dec 12, 2025

Remove LargeBuffer and use only PageAlignedMemory #9531

Merged

dachen0 force-pushed the o-direct-real branch from 04db62e to ddaf1ee Compare December 26, 2025 17:43

kskalski reviewed Dec 27, 2025

View reviewed changes

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

kskalski reviewed Jan 2, 2026

View reviewed changes

Comment thread fs/src/lib.rs Outdated

kskalski reviewed Jan 2, 2026

View reviewed changes

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

kskalski reviewed Jan 2, 2026

View reviewed changes

Comment thread fs/src/fs_info.rs Outdated

Comment thread fs/src/fs_info.rs Outdated

Comment thread fs/src/fs_info.rs Outdated

kskalski reviewed Jan 6, 2026

View reviewed changes

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

dachen0 force-pushed the o-direct-real branch 4 times, most recently from d2bd179 to 6ddc07f Compare January 6, 2026 18:12

kskalski added the CI Pull Request is ready to enter CI label Jan 7, 2026

anza-team removed the CI Pull Request is ready to enter CI label Jan 7, 2026

dachen0 force-pushed the o-direct-real branch from 764e55b to d96d029 Compare January 30, 2026 17:55

kskalski added the CI Pull Request is ready to enter CI label Feb 1, 2026

anza-team removed the CI Pull Request is ready to enter CI label Feb 1, 2026

kskalski reviewed Feb 2, 2026

View reviewed changes

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

Comment thread fs/src/io_uring/sequential_file_reader.rs Outdated

dachen0 force-pushed the o-direct-real branch from d96d029 to 40061c2 Compare February 2, 2026 02:25

kskalski previously approved these changes Feb 2, 2026

View reviewed changes

kskalski added the CI Pull Request is ready to enter CI label Feb 2, 2026

anza-team removed the CI Pull Request is ready to enter CI label Feb 2, 2026

use direct io in sequential file reader

63207cf

dachen0 dismissed kskalski’s stale review via 63207cf February 2, 2026 13:09

dachen0 force-pushed the o-direct-real branch from 40061c2 to 63207cf Compare February 2, 2026 13:09

kskalski added the CI Pull Request is ready to enter CI label Feb 2, 2026

anza-team removed the CI Pull Request is ready to enter CI label Feb 2, 2026

kskalski approved these changes Feb 3, 2026

View reviewed changes

kskalski requested a review from vadorovsky February 9, 2026 06:14

vadorovsky approved these changes Feb 9, 2026

View reviewed changes

alessandrod requested a review from Copilot February 10, 2026 07:25

Copilot started reviewing on behalf of alessandrod February 10, 2026 07:26 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

alessandrod approved these changes Feb 10, 2026

View reviewed changes

kskalski added this pull request to the merge queue Feb 10, 2026

Merged via the queue into anza-xyz:master with commit ff3c981 Feb 10, 2026
55 checks passed

kskalski mentioned this pull request Feb 23, 2026

feat(fs): support direct IO in file creator #9856

Merged

	/// `read_limit` must be less than the file size if using direct io.
	/// `read_limit` must be less than or equal to the file size if using direct io.

Conversation

dachen0 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kskalski commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dachen0 commented Dec 4, 2025 via email

Uh oh!

dachen0 commented Dec 4, 2025

Uh oh!

alessandrod left a comment

Choose a reason for hiding this comment

Uh oh!

kskalski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dachen0 commented Dec 9, 2025

Uh oh!

dachen0 commented Dec 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kskalski commented Jan 6, 2026

Uh oh!

dachen0 commented Jan 6, 2026

Uh oh!

kskalski commented Jan 7, 2026

Uh oh!

kskalski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dachen0 commented Feb 2, 2026

Uh oh!

kskalski commented Feb 2, 2026

Uh oh!

dachen0 commented Feb 2, 2026

Uh oh!

vadorovsky left a comment

Choose a reason for hiding this comment

Uh oh!

kskalski commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

dachen0 commented Dec 3, 2025 •

edited

Loading

kskalski commented Dec 4, 2025 •

edited

Loading