feat(fs): support direct IO in file creator#9856
Conversation
555216b to
195a001
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #9856 +/- ##
==========================================
- Coverage 83.0% 81.8% -1.3%
==========================================
Files 849 847 -2
Lines 318240 307543 -10697
==========================================
- Hits 264335 251726 -12609
- Misses 53905 55817 +1912 🚀 New features to boost your workflow:
|
f4efe4a to
1f14346
Compare
|
so I thought about the "what do we do when we have non aligned writes" problem when working on #10105 |
That's an interesting idea, this might simplify some bits of this PR, since aligned (DIO) and unaligned writes could follow independent paths and the regular open could be done before waiting for direct io path to finish... I'm not sure though if the benefit would justify the re-implementation:
Both opens would be done by io-uring as opposed to one open + fcntl from user-space, but the latter is fast and in general we don't save kernel any work (actually might be that open will take kernel more work) If any reviewer sees current approach, which I think works pretty well, as problematic, I could give the above idea a shot. For now IMHO the ideas better to pursue:
BTW, the typical way to approach the problem of unaligned last write is to write past the file end (simply the whole buffer up to alignment offset) and then truncate. Some issues with that:
|
vadorovsky
left a comment
There was a problem hiding this comment.
The direct IO part looks good to me.
I was about to write a nitpicky comment, that I would prefer a trait over an enum, but then I've seen #10071, which I think could go in first.
|
Ok, sounds good, I rebased on top of #10071 |
There was a problem hiding this comment.
Pull request overview
This PR adds support for direct IO (O_DIRECT) in the file creator to improve write performance by bypassing kernel buffer caches. The implementation handles the complexity of O_DIRECT requirements (alignment constraints) by performing aligned writes with O_DIRECT enabled, then switching to normal IO mode for any non-aligned data at the end of files.
Changes:
- Added
write_with_direct_io(bool)configuration method toIoUringFileCreatorBuilder(test-only) - Implemented aligned write logic that truncates non-aligned EOF data and schedules it for later
- Added mechanism to disable O_DIRECT via fcntl before writing non-aligned EOF data
- Added comprehensive test coverage for various file and buffer size combinations
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #[cfg(test)] | ||
| pub fn write_with_direct_io(mut self, enable_direct_io: bool) -> Self { |
There was a problem hiding this comment.
The PR description states that "ideally direct-IO mode should be made available, but configurable", but the write_with_direct_io method is marked with #[cfg(test)], limiting it to test code only. If the intention is to make this feature available for production use (as the PR description suggests), the #[cfg(test)] attribute should be removed. Otherwise, there's a discrepancy between the PR description and the implementation.
There was a problem hiding this comment.
This PR only adds support for the DIO mode, it's disabled by default and not used in production. The change to enable it in actual clients is in #10507
There was a problem hiding this comment.
but it's still a public item right? you shouldn't get a warning what am I missing
There was a problem hiding this comment.
well, it's public in io_uring module, but we currently do not widely expose that interface - it's hidden in top level selective exports in file_io
|
|
||
| // This is conservative write size alignment for use with direct IO, some block devices may have | ||
| // relaxed requirements, but detecting it is not trivial. | ||
| const DIRECT_IO_WRITE_LEN_ALIGNMENT: IoSize = 4096; |
There was a problem hiding this comment.
I'm a bit confused by this. Isn't the default 512 pretty much everywhere?
Also can't it be queried with statx?
There was a problem hiding this comment.
We had a bit of discussion for that in #9395 (comment), the current thinking about that is:
- yes, probably
STATX_DIOALIGNwould solve that, but it's supported from 6.1, which is above kernel version we want to support right now - should revisit it in the future - currently when you use statx and check block size, you get a filesystem block size, which is 4096, not the underlying block device block size
- 4096 is a conservative value use per https://man7.org/linux/man-pages/man2/open.2.html given how it vaguely mentioned "typically", "most", etc.
most filesystems based on block devices require that the file
offset and the length and memory address of all I/O segments be
multiples of the filesystem block size (typically 4096 bytes). In
Linux 2.6.0, this was relaxed to the logical block size of the
block device (typically 512 bytes).
- clearly querying for it at runtime brings a bit of complexity in the code and (I guess tiny) perf impact, let's revisit it in the future, especially when we can use
STATX_DIOALIGN
There was a problem hiding this comment.
In Linux 2.6.0, this was relaxed to the logical block size of the
block device (typically 512 bytes).
you missed this part? that was nearly 20 years ago :D I've checked the source and virtually all fs use 512
clearly querying for it at runtime brings a bit of complexity in the code and (I guess tiny) perf impact, let's revisit it in the future, especially when we can use STATX_DIOALIGN
you really only need to query once, all fs drivers proxy to the block device, so this is really about does the block device do 512 or 4k blocks
There was a problem hiding this comment.
ok, I updated the constant (for encrypted fs: the tests still work fine with this value on my laptop's encrypted fs)
| #[cfg(test)] | ||
| pub fn write_with_direct_io(mut self, enable_direct_io: bool) -> Self { |
| /// Note: this returns `true` if current stage writes are done, there might still be | ||
| /// last write to be scheduled using `non_dio_eof_write` | ||
| fn required_writes_done(&self) -> bool { | ||
| self.writes_started == self.writes_completed && self.size_on_eof.is_some() |
There was a problem hiding this comment.
why the size_on_eof condition here? seems confusing nothing in the method name
suggests it
also maybe pending_writes_done? unclear what a "required" write is
There was a problem hiding this comment.
size_on_eof.is_some() is the condition verifying that we finished reading data from the source in write_and_close (basically we can't say that all writes were done until we reach eof reading input, which may be happening concurrently to write ops).
"required" was actually meant to convey that meaning, since we not only wait for any already scheduled / pending writes, but all writes that might still need to be created
There was a problem hiding this comment.
factored the size_on_eof.is_some() as a helper function for added doc / readability - still not sure if there is better name than "required":
- this function works for both "stages" of writing:
- all the aligned writes before switching to non-dio
- after we turn off dio and possibly do the final write
In each stage there are required writes to be made before the stage ends and they are not always already scheduled (e.g. while we are still reading the source).
Problem
Writing data to disk is faster with direct-IO, since it avoids kernel allocating and populating buffer caches. Lack of caching could be a downside if the written data could be fit into free memory and is destined to be read back shortly after. That is however dependent on use case, so ideally direct-IO mode should be made available, but configurable.
Summary of Changes
write_with_direct_io(bool)function toIoUringFileCreatorBuilderPerformance numbers
Compared unpacking accounts storages with agave-ledger-tool and validator stopped:
There is visible impact of not having accounts data buffered (~9s slowdown on index generation), however the speedup for unpacking is significantly larger (~25s speedup)