Skip to content

feat(fs): support direct IO in file creator#9856

Merged
kskalski merged 6 commits into
anza-xyz:masterfrom
kskalski:ks/dio
Feb 23, 2026
Merged

feat(fs): support direct IO in file creator#9856
kskalski merged 6 commits into
anza-xyz:masterfrom
kskalski:ks/dio

Conversation

@kskalski
Copy link
Copy Markdown

@kskalski kskalski commented Jan 8, 2026

Problem

Writing data to disk is faster with direct-IO, since it avoids kernel allocating and populating buffer caches. Lack of caching could be a downside if the written data could be fit into free memory and is destined to be read back shortly after. That is however dependent on use case, so ideally direct-IO mode should be made available, but configurable.

Summary of Changes

  • add write_with_direct_io(bool) function to IoUringFileCreatorBuilder
  • support opening files and performing aligned writes using direct IO
  • switch back to non-direct IO mode upon file completion
  • perform EOF non-aligned write after file is switched to non-direct IO mode (if direct IO was used on file open)
Performance numbers

Compared unpacking accounts storages with agave-ledger-tool and validator stopped:

echo 3 | sudo tee /proc/sys/vm/drop_caches
tool-master verify --snapshots ./ledger-snapshots/
...
solana_runtime::snapshot_utils] snapshot untar took 103.2s
...
solana_runtime::serde_snapshot] Building accounts index... Done in 42.842940661s

echo 3 | sudo tee /proc/sys/vm/drop_caches
tool-dio verify --snapshots ./ledger-snapshots/
...
solana_runtime::snapshot_utils] snapshot untar took 77.8s
...
solana_runtime::serde_snapshot] Building accounts index... Done in 51.040653131s

There is visible impact of not having accounts data buffered (~9s slowdown on index generation), however the speedup for unpacking is significantly larger (~25s speedup)

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 97.12230% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.8%. Comparing base (67d3dd1) to head (7776cdb).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9856      +/-   ##
==========================================
- Coverage    83.0%    81.8%    -1.3%     
==========================================
  Files         849      847       -2     
  Lines      318240   307543   -10697     
==========================================
- Hits       264335   251726   -12609     
- Misses      53905    55817    +1912     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kskalski kskalski force-pushed the ks/dio branch 7 times, most recently from f4efe4a to 1f14346 Compare January 21, 2026 07:21
@kskalski kskalski changed the title Support direct IO in file creator feat(fs): support direct IO in file creator Jan 21, 2026
@kskalski kskalski marked this pull request as ready for review January 21, 2026 09:00
@dachen0
Copy link
Copy Markdown

dachen0 commented Jan 21, 2026

so I thought about the "what do we do when we have non aligned writes" problem when working on #10105
I think the best answer is to open the file with both direct io and regular io flags at the same time
just store two file descriptors instead of reopening the file every time you need to do a nonaligned write. This should result in less syscalls and general pain.

@kskalski
Copy link
Copy Markdown
Author

I think the best answer is to open the file with both direct io and regular io flags at the same time just

That's an interesting idea, this might simplify some bits of this PR, since aligned (DIO) and unaligned writes could follow independent paths and the regular open could be done before waiting for direct io path to finish... I'm not sure though if the benefit would justify the re-implementation:

  • we only really need (at most) 1 unaligned that is always < 4096 bytes
  • fcntl syscall to change FD mode is pretty fast - 9 / 20k samples on a profile
  • one difficulty in implementation is that last buffer we read from src contains both aligned and unaligned data, so the writes need to be coordinated or stretch safety by using the same buffer in two independent ops (arguably this could be simplified by giving up direct-io on the whole last buffer...)
  • we still need to coordinate state updates such that writes in given mode start after corresponding open op finishes, then file completion need to wait for both whole paths

This should result in less syscalls and general pain.

Both opens would be done by io-uring as opposed to one open + fcntl from user-space, but the latter is fast and in general we don't save kernel any work (actually might be that open will take kernel more work)

If any reviewer sees current approach, which I think works pretty well, as problematic, I could give the above idea a shot.

For now IMHO the ideas better to pursue:

  • optimize to not use DIO for very small files, e.g. if file is <4096, it doesn't make sense to open in DIO at all, similarly if it's up to some size (e.g. 64KiB), then the need to do aligned + unaligned writes likely outweights any benefit
  • maybe the whole last buffer could be written as non-DIO instead of splitting into aligned vs non-aligned parts

BTW, the typical way to approach the problem of unaligned last write is to write past the file end (simply the whole buffer up to alignment offset) and then truncate. Some issues with that:

  • io-uring supports truncate only on kernel 6.9, so it won't be available in the kernels we support for a long time
  • user-space truncate is really slow - when I tried that inside the file-creator blocking path, it ate all the benefit

Copy link
Copy Markdown
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direct IO part looks good to me.

I was about to write a nitpicky comment, that I would prefer a trait over an enum, but then I've seen #10071, which I think could go in first.

@kskalski
Copy link
Copy Markdown
Author

Ok, sounds good, I rebased on top of #10071

vadorovsky
vadorovsky previously approved these changes Feb 21, 2026
Copy link
Copy Markdown
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for direct IO (O_DIRECT) in the file creator to improve write performance by bypassing kernel buffer caches. The implementation handles the complexity of O_DIRECT requirements (alignment constraints) by performing aligned writes with O_DIRECT enabled, then switching to normal IO mode for any non-aligned data at the end of files.

Changes:

  • Added write_with_direct_io(bool) configuration method to IoUringFileCreatorBuilder (test-only)
  • Implemented aligned write logic that truncates non-aligned EOF data and schedules it for later
  • Added mechanism to disable O_DIRECT via fcntl before writing non-aligned EOF data
  • Added comprehensive test coverage for various file and buffer size combinations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment on lines +105 to +106
#[cfg(test)]
pub fn write_with_direct_io(mut self, enable_direct_io: bool) -> Self {
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states that "ideally direct-IO mode should be made available, but configurable", but the write_with_direct_io method is marked with #[cfg(test)], limiting it to test code only. If the intention is to make this feature available for production use (as the PR description suggests), the #[cfg(test)] attribute should be removed. Otherwise, there's a discrepancy between the PR description and the implementation.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be used in prod code with #10507

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cfg(test)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only adds support for the DIO mode, it's disabled by default and not used in production. The change to enable it in actual clients is in #10507

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's still a public item right? you shouldn't get a warning what am I missing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it's public in io_uring module, but we currently do not widely expose that interface - it's hidden in top level selective exports in file_io

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/file_creator.rs Outdated

// This is conservative write size alignment for use with direct IO, some block devices may have
// relaxed requirements, but detecting it is not trivial.
const DIRECT_IO_WRITE_LEN_ALIGNMENT: IoSize = 4096;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this. Isn't the default 512 pretty much everywhere?

Also can't it be queried with statx?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a bit of discussion for that in #9395 (comment), the current thinking about that is:

  • yes, probably STATX_DIOALIGN would solve that, but it's supported from 6.1, which is above kernel version we want to support right now - should revisit it in the future
  • currently when you use statx and check block size, you get a filesystem block size, which is 4096, not the underlying block device block size
  • 4096 is a conservative value use per https://man7.org/linux/man-pages/man2/open.2.html given how it vaguely mentioned "typically", "most", etc.
most filesystems based on block devices require that the file
offset and the length and memory address of all I/O segments be
multiples of the filesystem block size (typically 4096 bytes).  In
Linux 2.6.0, this was relaxed to the logical block size of the
block device (typically 512 bytes).
  • clearly querying for it at runtime brings a bit of complexity in the code and (I guess tiny) perf impact, let's revisit it in the future, especially when we can use STATX_DIOALIGN

Copy link
Copy Markdown

@alessandrod alessandrod Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Linux 2.6.0, this was relaxed to the logical block size of the
block device (typically 512 bytes).

you missed this part? that was nearly 20 years ago :D I've checked the source and virtually all fs use 512

clearly querying for it at runtime brings a bit of complexity in the code and (I guess tiny) perf impact, let's revisit it in the future, especially when we can use STATX_DIOALIGN

you really only need to query once, all fs drivers proxy to the block device, so this is really about does the block device do 512 or 4k blocks

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I updated the constant (for encrypted fs: the tests still work fine with this value on my laptop's encrypted fs)

Comment on lines +105 to +106
#[cfg(test)]
pub fn write_with_direct_io(mut self, enable_direct_io: bool) -> Self {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cfg(test)?

Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/file_creator.rs Outdated
/// Note: this returns `true` if current stage writes are done, there might still be
/// last write to be scheduled using `non_dio_eof_write`
fn required_writes_done(&self) -> bool {
self.writes_started == self.writes_completed && self.size_on_eof.is_some()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the size_on_eof condition here? seems confusing nothing in the method name
suggests it

also maybe pending_writes_done? unclear what a "required" write is

Copy link
Copy Markdown
Author

@kskalski kskalski Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_on_eof.is_some() is the condition verifying that we finished reading data from the source in write_and_close (basically we can't say that all writes were done until we reach eof reading input, which may be happening concurrently to write ops).
"required" was actually meant to convey that meaning, since we not only wait for any already scheduled / pending writes, but all writes that might still need to be created

Copy link
Copy Markdown
Author

@kskalski kskalski Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factored the size_on_eof.is_some() as a helper function for added doc / readability - still not sure if there is better name than "required":

  • this function works for both "stages" of writing:
    • all the aligned writes before switching to non-dio
    • after we turn off dio and possibly do the final write

In each stage there are required writes to be made before the stage ends and they are not always already scheduled (e.g. while we are still reading the source).

Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/file_creator.rs Outdated
Comment thread fs/src/io_uring/file_creator.rs Outdated
@kskalski kskalski added this pull request to the merge queue Feb 23, 2026
Merged via the queue into anza-xyz:master with commit 1ff9663 Feb 23, 2026
51 checks passed
@kskalski kskalski deleted the ks/dio branch February 23, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants