feat(fs): support direct IO in file creator by kskalski · Pull Request #9856 · anza-xyz/agave

kskalski · 2026-01-08T05:19:32Z

Problem

Writing data to disk is faster with direct-IO, since it avoids kernel allocating and populating buffer caches. Lack of caching could be a downside if the written data could be fit into free memory and is destined to be read back shortly after. That is however dependent on use case, so ideally direct-IO mode should be made available, but configurable.

Summary of Changes

add write_with_direct_io(bool) function to IoUringFileCreatorBuilder
support opening files and performing aligned writes using direct IO
switch back to non-direct IO mode upon file completion
perform EOF non-aligned write after file is switched to non-direct IO mode (if direct IO was used on file open)

Performance numbers

Compared unpacking accounts storages with agave-ledger-tool and validator stopped:

echo 3 | sudo tee /proc/sys/vm/drop_caches
tool-master verify --snapshots ./ledger-snapshots/
...
solana_runtime::snapshot_utils] snapshot untar took 103.2s
...
solana_runtime::serde_snapshot] Building accounts index... Done in 42.842940661s

echo 3 | sudo tee /proc/sys/vm/drop_caches
tool-dio verify --snapshots ./ledger-snapshots/
...
solana_runtime::snapshot_utils] snapshot untar took 77.8s
...
solana_runtime::serde_snapshot] Building accounts index... Done in 51.040653131s

There is visible impact of not having accounts data buffered (~9s slowdown on index generation), however the speedup for unpacking is significantly larger (~25s speedup)

codecov-commenter · 2026-01-20T09:42:16Z

Codecov Report

❌ Patch coverage is 97.12230% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.8%. Comparing base (67d3dd1) to head (7776cdb).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9856      +/-   ##
==========================================
- Coverage    83.0%    81.8%    -1.3%     
==========================================
  Files         849      847       -2     
  Lines      318240   307543   -10697     
==========================================
- Hits       264335   251726   -12609     
- Misses      53905    55817    +1912

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dachen0 · 2026-01-21T18:10:22Z

so I thought about the "what do we do when we have non aligned writes" problem when working on #10105
I think the best answer is to open the file with both direct io and regular io flags at the same time
just store two file descriptors instead of reopening the file every time you need to do a nonaligned write. This should result in less syscalls and general pain.

kskalski · 2026-01-22T00:04:35Z

I think the best answer is to open the file with both direct io and regular io flags at the same time just

That's an interesting idea, this might simplify some bits of this PR, since aligned (DIO) and unaligned writes could follow independent paths and the regular open could be done before waiting for direct io path to finish... I'm not sure though if the benefit would justify the re-implementation:

we only really need (at most) 1 unaligned that is always < 4096 bytes
fcntl syscall to change FD mode is pretty fast - 9 / 20k samples on a profile
one difficulty in implementation is that last buffer we read from src contains both aligned and unaligned data, so the writes need to be coordinated or stretch safety by using the same buffer in two independent ops (arguably this could be simplified by giving up direct-io on the whole last buffer...)
we still need to coordinate state updates such that writes in given mode start after corresponding open op finishes, then file completion need to wait for both whole paths

This should result in less syscalls and general pain.

Both opens would be done by io-uring as opposed to one open + fcntl from user-space, but the latter is fast and in general we don't save kernel any work (actually might be that open will take kernel more work)

If any reviewer sees current approach, which I think works pretty well, as problematic, I could give the above idea a shot.

For now IMHO the ideas better to pursue:

optimize to not use DIO for very small files, e.g. if file is <4096, it doesn't make sense to open in DIO at all, similarly if it's up to some size (e.g. 64KiB), then the need to do aligned + unaligned writes likely outweights any benefit
maybe the whole last buffer could be written as non-DIO instead of splitting into aligned vs non-aligned parts

BTW, the typical way to approach the problem of unaligned last write is to write past the file end (simply the whole buffer up to alignment offset) and then truncate. Some issues with that:

io-uring supports truncate only on kernel 6.9, so it won't be available in the kernels we support for a long time
user-space truncate is really slow - when I tried that inside the file-creator blocking path, it ate all the benefit

vadorovsky

The direct IO part looks good to me.

I was about to write a nitpicky comment, that I would prefer a trait over an enum, but then I've seen #10071, which I think could go in first.

kskalski · 2026-02-20T10:12:18Z

Ok, sounds good, I rebased on top of #10071

vadorovsky

Thanks!

Copilot

Pull request overview

This PR adds support for direct IO (O_DIRECT) in the file creator to improve write performance by bypassing kernel buffer caches. The implementation handles the complexity of O_DIRECT requirements (alignment constraints) by performing aligned writes with O_DIRECT enabled, then switching to normal IO mode for any non-aligned data at the end of files.

Changes:

Added write_with_direct_io(bool) configuration method to IoUringFileCreatorBuilder (test-only)
Implemented aligned write logic that truncates non-aligned EOF data and schedules it for later
Added mechanism to disable O_DIRECT via fcntl before writing non-aligned EOF data
Added comprehensive test coverage for various file and buffer size combinations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-23T00:40:19Z

+    #[cfg(test)]
+    pub fn write_with_direct_io(mut self, enable_direct_io: bool) -> Self {


The PR description states that "ideally direct-IO mode should be made available, but configurable", but the write_with_direct_io method is marked with #[cfg(test)], limiting it to test code only. If the intention is to make this feature available for production use (as the PR description suggests), the #[cfg(test)] attribute should be removed. Otherwise, there's a discrepancy between the PR description and the implementation.

it will be used in prod code with #10507

why cfg(test)?

This PR only adds support for the DIO mode, it's disabled by default and not used in production. The change to enable it in actual clients is in #10507

but it's still a public item right? you shouldn't get a warning what am I missing

well, it's public in io_uring module, but we currently do not widely expose that interface - it's hidden in top level selective exports in file_io

alessandrod · 2026-02-23T01:33:02Z


+// This is conservative write size alignment for use with direct IO, some block devices may have
+// relaxed requirements, but detecting it is not trivial.
+const DIRECT_IO_WRITE_LEN_ALIGNMENT: IoSize = 4096;


I'm a bit confused by this. Isn't the default 512 pretty much everywhere?

Also can't it be queried with statx?

We had a bit of discussion for that in #9395 (comment), the current thinking about that is:

yes, probably STATX_DIOALIGN would solve that, but it's supported from 6.1, which is above kernel version we want to support right now - should revisit it in the future

currently when you use statx and check block size, you get a filesystem block size, which is 4096, not the underlying block device block size

4096 is a conservative value use per https://man7.org/linux/man-pages/man2/open.2.html given how it vaguely mentioned "typically", "most", etc.

most filesystems based on block devices require that the file offset and the length and memory address of all I/O segments be multiples of the filesystem block size (typically 4096 bytes). In Linux 2.6.0, this was relaxed to the logical block size of the block device (typically 512 bytes).

clearly querying for it at runtime brings a bit of complexity in the code and (I guess tiny) perf impact, let's revisit it in the future, especially when we can use STATX_DIOALIGN

In Linux 2.6.0, this was relaxed to the logical block size of the
block device (typically 512 bytes).

you missed this part? that was nearly 20 years ago :D I've checked the source and virtually all fs use 512

clearly querying for it at runtime brings a bit of complexity in the code and (I guess tiny) perf impact, let's revisit it in the future, especially when we can use STATX_DIOALIGN

you really only need to query once, all fs drivers proxy to the block device, so this is really about does the block device do 512 or 4k blocks

ok, I updated the constant (for encrypted fs: the tests still work fine with this value on my laptop's encrypted fs)

alessandrod · 2026-02-23T01:33:35Z

+    #[cfg(test)]
+    pub fn write_with_direct_io(mut self, enable_direct_io: bool) -> Self {


why cfg(test)?

alessandrod · 2026-02-23T02:18:34Z

+    /// Note: this returns `true` if current stage writes are done, there might still be
+    /// last write to be scheduled using `non_dio_eof_write`
+    fn required_writes_done(&self) -> bool {
+        self.writes_started == self.writes_completed && self.size_on_eof.is_some()


why the size_on_eof condition here? seems confusing nothing in the method name
suggests it

also maybe pending_writes_done? unclear what a "required" write is

size_on_eof.is_some() is the condition verifying that we finished reading data from the source in write_and_close (basically we can't say that all writes were done until we reach eof reading input, which may be happening concurrently to write ops).
"required" was actually meant to convey that meaning, since we not only wait for any already scheduled / pending writes, but all writes that might still need to be created

factored the size_on_eof.is_some() as a helper function for added doc / readability - still not sure if there is better name than "required":

this function works for both "stages" of writing:

all the aligned writes before switching to non-dio

after we turn off dio and possibly do the final write

In each stage there are required writes to be made before the stage ends and they are not always already scheduled (e.g. while we are still reading the source).

kskalski force-pushed the ks/dio branch 8 times, most recently from 555216b to 195a001 Compare January 15, 2026 02:14

kskalski force-pushed the ks/dio branch from 195a001 to 9fb3756 Compare January 20, 2026 09:05

kskalski force-pushed the ks/dio branch 7 times, most recently from f4efe4a to 1f14346 Compare January 21, 2026 07:21

kskalski changed the title ~~Support direct IO in file creator~~ feat(fs): support direct IO in file creator Jan 21, 2026

kskalski marked this pull request as ready for review January 21, 2026 09:00

kskalski requested review from alessandrod, brooksprumo and cpubot January 21, 2026 09:00

kskalski mentioned this pull request Jan 21, 2026

Add support for direct io in SequentialFileReader #9395

Merged

kskalski mentioned this pull request Jan 22, 2026

feat(io_uring): generic access to context and push for Ring and Completion #10071

Merged

kskalski requested a review from vadorovsky February 9, 2026 12:15

vadorovsky reviewed Feb 20, 2026

View reviewed changes

kskalski force-pushed the ks/dio branch from 73c07f4 to 8854731 Compare February 20, 2026 10:11

kskalski force-pushed the ks/dio branch from 8854731 to dc0e687 Compare February 21, 2026 07:26

vadorovsky previously approved these changes Feb 21, 2026

View reviewed changes

kskalski added 2 commits February 21, 2026 17:29

Support dio in file creator

ab2d220

Add assert for write_capacity alignment

7b47501

kskalski dismissed vadorovsky’s stale review via 7b47501 February 21, 2026 09:29

kskalski force-pushed the ks/dio branch from dc0e687 to 7b47501 Compare February 21, 2026 09:29

alessandrod requested a review from Copilot February 23, 2026 00:36

Copilot started reviewing on behalf of alessandrod February 23, 2026 00:36 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

Fix fcntl err check. Don't unwrap res. Fix typos.

b5620ac

alessandrod reviewed Feb 23, 2026

View reviewed changes

kskalski added 3 commits February 23, 2026 10:50

Update names

0f8df29

Extract source_fully_read helper to clarify writes done condition.

425644c

Lower DIRECT_IO_WRITE_LEN_ALIGNMENT to 512

7776cdb

alessandrod approved these changes Feb 23, 2026

View reviewed changes

kskalski added this pull request to the merge queue Feb 23, 2026

Merged via the queue into anza-xyz:master with commit 1ff9663 Feb 23, 2026
51 checks passed

kskalski deleted the ks/dio branch February 23, 2026 04:41

kskalski mentioned this pull request Mar 16, 2026

fix(fs): align write size to 4096 to support all NVMEs #11335

Merged

mergify Bot mentioned this pull request Mar 20, 2026

v4.0: fix(fs): align write size to 4096 to support all NVMEs (backport of #11335) #11424

Merged

		#[cfg(test)]
		pub fn write_with_direct_io(mut self, enable_direct_io: bool) -> Self {

Conversation

kskalski commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Performance numbers

Uh oh!

codecov-commenter commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dachen0 commented Jan 21, 2026

Uh oh!

kskalski commented Jan 22, 2026

Uh oh!

vadorovsky left a comment

Choose a reason for hiding this comment

Uh oh!

kskalski commented Feb 20, 2026

Uh oh!

vadorovsky left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alessandrod Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kskalski Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

kskalski commented Jan 8, 2026 •

edited

Loading

codecov-commenter commented Jan 20, 2026 •

edited

Loading

alessandrod Feb 23, 2026 •

edited

Loading

kskalski Feb 23, 2026 •

edited

Loading

kskalski Feb 23, 2026 •

edited

Loading