Add support for direct io in SequentialFileReader#9395
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for the O_DIRECT flag when opening files in SequentialFileReader, providing a 7% performance improvement for snapshot operations. To support O_DIRECT's alignment requirements, the implementation switches from Vec-based buffers to page-aligned memory allocated via mmap. A fallback mechanism gracefully handles systems where O_DIRECT is not supported.
Key changes:
- Replaces
LargeBufferenum with direct use ofPageAlignedMemoryfor all io_uring buffers - Adds O_DIRECT flag with fallback to standard file operations when O_DIRECT fails
- Introduces
new_large_buffer()function that allocates regular page-aligned memory as fallback when huge tables aren't available
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| fs/src/io_uring/sequential_file_reader.rs | Adds O_DIRECT flag with fallback logic and updates type signatures to use PageAlignedMemory; adds Clone bound to path parameters |
| fs/src/io_uring/memory.rs | Removes LargeBuffer enum wrapper, adds alloc_regular() method for standard page-aligned allocation, implements AsMut<[u8]> for PageAlignedMemory, and converts allocation logic to standalone function |
| fs/src/io_uring/file_creator.rs | Updates type references from LargeBuffer to PageAlignedMemory and adjusts imports |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
In general I'm a bit suspicious if In my tests the bottleneck for snapshot untar is actually writing (we write 3-4x more data than we read + writes are in general slower than reads), so to evaluate this change maybe it would be worth to test the performance of reading + decompression (e.g. running this code in isolation agave/snapshots/src/unarchive.rs Lines 42 to 43 in a2a68b8 The memory buffer changes are interesting - it limits use of external buffer, but we don't need that currently and some seems to be cleaner by removing generic |
|
Yes, I also saw that we were heavily write constrained. I thought the speedup might be since we don't go through the kernel cache for reads, the writes are more efficient since the load on the system kernel cache is reduced.
I was also looking into using O_DIRECT for writes because that's the main bottleneck. We shouldnt be touching kernel cache at all IMO since there's so many account storages/ledger entries that are basically single use and shouldn't be cached. If caching is needed it should be implemented in user space. I will write some benches for just snapshot unpacking and see if there's an isolated speedup.
I'm not sure if we should put the memory changes in a separate PR -- Vec allocs are faster than mmap, so if aligned memory for the buffer isn't necessary then the alloc is faster with the current impl.
…-------- Original Message --------
On Wednesday, 12/03/25 at 19:45 Kamil Skalski ***@***.***> wrote:
kskalski left a comment [(anza-xyz/agave#9395)](#9395 (comment))
In general I'm a bit suspicious if O_DIRECT helps for the reader - it disables use of kernel caches, which for sequential reading might actually make things worse (we might not benefit from any read-ahead that kernel would do to optimize sequential reading that actually matches the read pattern of the reader).
We do read-ahead on our own though, so I suppose it's possible that O_DIRECT prevents some unnecessary reads or avoids wasting time on populating caches that are going to be one-use only...
In my tests the bottleneck for snapshot untar is actually writing (we write 3-4x more data than we read, writes are in general slower than reads), so to evaluate this change maybe it would be worth to test the performance of reading + decompression (e.g. running this code in isolation https://github.com/anza-xyz/agave/blob/a2a68b8a3a81e6a69b07015ac27eda27c94b3244/snapshots/src/unarchive.rs#L42-L43). If the speedup is reproducible in isolation and for the whole unpacking process, I will try it out and compare profiles.
The memory buffer changes are interesting - it limits use of external buffer, but we don't need that currently and some seems to be cleaner by removing generic B... In any case I think I would prefer that to be done as separate PR.
—
Reply to this email directly, [view it on GitHub](#9395 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AJVAUIM6AYRS4DZVICINWB33757ZNAVCNFSM6AAAAACN5Z432CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMBZGQZDINBRGI).
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
|
Benching using the attached diff test sees ~3-4% speedup on decompressor only on debug build. |
alessandrod
left a comment
There was a problem hiding this comment.
thanks!
marking as changes requested just so I get to review (after breakpoint)
kskalski
left a comment
There was a problem hiding this comment.
Thanks for testing this and it in fact looks promising. We will also want this kind of change for file creator as you mentioned.
I'm leaving a few comments on the code - we should make this PR simpler and also make use of O_DIRECT optional as this hits quite different code paths in the kernel, which historically we used to have trouble with.
|
Just as an update, I am still working on this. However, it's currently finals week for me so I will only really be able to work on this in a week or a week and half. Posting this for clarity. |
04db62e to
ddaf1ee
Compare
|
Rebased to master and used the builder. |
|
also need to rebase, conflicted with some of my recent changes... |
d2bd179 to
6ddc07f
Compare
|
Rebase done, test-checks.sh passes. |
|
You could update the PR description - aligned memory was done separately and this PR doesn't enable the mode just yet. |
764e55b to
d96d029
Compare
kskalski
left a comment
There was a problem hiding this comment.
Great, it looks clean now. Just small suggestions about wording of the comments.
d96d029 to
40061c2
Compare
|
Done. |
|
Sanity checks fail on some white-space issue: |
40061c2 to
63207cf
Compare
|
Fixed the whitespace issue. I copied the text from the suggestion and it must have had trailing whitespace for some reason. |
vadorovsky
left a comment
There was a problem hiding this comment.
No remarks to the implementation itself. Do I understand correctly, that this PR enables direct IO only in tests, and we need to enable it in the actual snapshot unarchive code separately?
Given the perf results you added in description, I guess the best place to let the caller enable or disable direct IO would be unarchive_snapshot:
Then passed along to streaming_unarchive_snapshot -> decopmressed_tar_reader -> large_file_buf_leader, which then uses the SequentialFileReaderBuilfer.
So then we can enable it here for non-incremental snapshots:
And then disable it for incremental ones:
Can be done separately, of course.
Yeah, I have that kind of change already in progress on a separate branch - the idea is to have a validator / ledger-tool flag that controls use of direct-io for snapshots (default will be to enable). It's definitely something for a separate PR, and for now we also should get write change in (#9856) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
fs/src/io_uring/sequential_file_reader.rs:824
- In direct-IO mode, the retry path for short reads resubmits with
file_offset + last_read_lenandbuf_offset = total_read_len. Iflast_read_lenis not aligned to the O_DIRECT requirements, the next submission will use an unaligned file offset/buffer offset and is likely to fail withEINVAL. The direct-IO implementation should ensure all resubmissions keep required alignment (or avoid the "retry remaining bytes" approach under O_DIRECT and handle partial reads differently).
if last_read_len > 0 && last_read_len < *read_len {
// Partial read, retry the op with updated offsets
let op: ReadOp = ReadOp {
fd: *fd,
buf,
is_direct_io: *is_direct_io,
buf_offset: total_read_len,
file_offset: *file_offset + last_read_len as FileSize,
read_len: *read_len - last_read_len,
reader_buf_index: *reader_buf_index,
is_last_read: *is_last_read,
};
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #[cfg(test)] | ||
| pub fn use_direct_io(mut self, use_direct_io: bool) -> Self { | ||
| self.use_direct_io = use_direct_io; | ||
| self | ||
| } |
There was a problem hiding this comment.
use_direct_io is gated behind #[cfg(test)], so production code (e.g., large_file_buf_reader) cannot enable O_DIRECT at all and open_file_flags will always be O_NOATIME in non-test builds. If this PR is intended to speed up snapshot reads in production, this method (or an equivalent config path) needs to be available outside tests and wired through the call sites that construct SequentialFileReaderBuilder.
| assert!( | ||
| self.read_capacity | ||
| .is_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT), | ||
| "read size is not aligned for direct IO({} is not a multiple of \ | ||
| {DIRECT_IO_READ_LEN_ALIGNMENT})", | ||
| self.read_capacity | ||
| ); |
There was a problem hiding this comment.
Using assert! for direct-IO alignment will panic at runtime if misconfigured. Since this is user-controlled configuration (and may depend on the underlying filesystem/device), prefer returning an io::Error (e.g., InvalidInput) from build/build_with_buffer so callers can handle it gracefully instead of crashing.
| assert!( | |
| self.read_capacity | |
| .is_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT), | |
| "read size is not aligned for direct IO({} is not a multiple of \ | |
| {DIRECT_IO_READ_LEN_ALIGNMENT})", | |
| self.read_capacity | |
| ); | |
| if !self | |
| .read_capacity | |
| .is_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT) | |
| { | |
| return Err(io::Error::new( | |
| io::ErrorKind::InvalidInput, | |
| format!( | |
| "read size is not aligned for direct IO({} is not a multiple of \ | |
| {DIRECT_IO_READ_LEN_ALIGNMENT})", | |
| self.read_capacity | |
| ), | |
| )); | |
| } |
| // Align the read length if necessary | ||
| let internal_read_len = if *is_direct_io && *read_len != buf.len() { | ||
| // Try to align the read len if possible and fall back to reading | ||
| // the full remaining bytes if we can't align the read len. | ||
| read_len | ||
| .next_multiple_of(DIRECT_IO_READ_LEN_ALIGNMENT) | ||
| .min(buf.len() - *buf_offset) | ||
| } else { | ||
| *read_len | ||
| }; |
There was a problem hiding this comment.
internal_read_len can be larger than the logical read_len to satisfy O_DIRECT alignment, but complete() and EOF handling still use read_len/is_last_read. This can expose more bytes than read_limit (and/or mark eof_pos beyond the intended limit) when internal_read_len > read_len, breaking the documented "read finishes after read_limit bytes" behavior. Consider tracking both the submitted length vs requested length and clamping the buffer’s readable length/EOF position to the requested read_len (discarding any over-read bytes) so fill_buf() never returns more than requested.
|
|
||
| impl<'a> FileBufRead<'a> for SequentialFileReader<'a> { | ||
| /// The `SequentialFileReader` must be in direct io mode if passing in direct io files. | ||
| /// `read_limit` must be less than the file size if using direct io. |
There was a problem hiding this comment.
Doc inconsistency: here it says read_limit must be "less than the file size" for direct IO, but the earlier docs for add_owned_file_to_prefetch say "less than or equal". Please make these consistent (and ideally enforce the constraint in code if it’s required).
| /// `read_limit` must be less than the file size if using direct io. | |
| /// `read_limit` must be less than or equal to the file size if using direct io. |
| #[test] | ||
| fn test_direct_io_read() { | ||
| check_reading_file(2_500, 4096, 4096, true); | ||
| check_reading_file(2_500, 16384, 4096, true); | ||
| check_reading_file(25_000, 4096, 4096, true); | ||
| check_reading_file(25_000, 16384, 4096, true); | ||
| check_reading_file(250_000, 4096, 4096, true); | ||
| check_reading_file(250_000, 16384, 4096, true); | ||
| check_reading_file(4096, 4096, 4096, true); | ||
| check_reading_file(4096, 16384, 4096, true); | ||
| check_reading_file(16384, 4096, 4096, true); | ||
| check_reading_file(16384, 16384, 4096, true); | ||
| } |
There was a problem hiding this comment.
test_direct_io_read assumes the temp directory’s filesystem supports O_DIRECT. On common Linux setups where tempfile uses /tmp mounted as tmpfs, O_DIRECT open can fail with EINVAL, making this unit test flaky across CI environments. Consider skipping the test when opening with O_DIRECT fails, or creating the temp file in a directory/filesystem known to support direct IO.
|
Thanks @dachen0 for the idea and work on this PR. I created a PR (#10507) to plug enabling direct-io into the validator and lerger tool, but I want to wait for file creator PR (#9856) to get merged.
@vadorovsky I wonder why do you think it should use different setting for non-incremental and incremental archive? I think when we start-up and read the archives they both are read-once (at least if all goes well), so there isn't any benefit from involving buffer cache when reading them. Or maybe the idea is that it's possible for incremental archive to still be in the cache since the time it was written?
|
|
Thank you @kskalski for being patient and many, many code reviews and suggestions! It was a pleasure to work with you on this. Glad to see this finally merged :). |
Summary of Changes
By using direct io when reading snapshots, we gain a 7% speedup. On the same snapshot and incremental snapshot when running
ledger-tool verify:cc @kskalski