Use io_uring for creating files when unpacking snapshot#6671
Use io_uring for creating files when unpacking snapshot#6671kskalski merged 55 commits intoanza-xyz:masterfrom
Conversation
d820a2e to
c99189e
Compare
brooksprumo
left a comment
There was a problem hiding this comment.
Can you please fix CI and wait for the PR to make it through until the 'coverage' step before requesting a review, please?
Also, can you add perf numbers for with and without this change?
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6671 +/- ##
=========================================
- Coverage 83.2% 83.2% -0.1%
=========================================
Files 852 853 +1
Lines 373763 374060 +297
=========================================
+ Hits 311290 311492 +202
- Misses 62473 62568 +95 🚀 New features to boost your workflow:
|
|
I'm trying to square these two things: This one, from the problem statement:
And this one from the solution:
Why go through the trouble of io_uring-ifying if performance is not changed? Is the benefit that we only use a single thread instead of all the unpacker threads? If yes, I'd argue that this PR greatly improves performance then :) |
brooksprumo
left a comment
There was a problem hiding this comment.
This PR is very large. Can it be split up? Minimally I need some additional help of where to start. Also a high level overview of what the design is would be much appreciated (I have read the Summary of Changes, so looking for more detail please).
|
If you mean performance as amount and shape of resources used, then there is an improvement, though it's a bit hard to reason about. The change is:
I think the true statement is that we save some CPU on doing syscalls, since we use a more efficient API to kernel, which batches syscalls into once-in-a-while io_uring queue sync. There are a couple of CPU-saving optimizations too (less copying of buffers) and kernel's work is a bit more efficient (use of Fixed buffers and file descriptors). From this point of view we save something like a fraction of 1 core CPU work. From the point of view of start-up time, this is neutral though. Depending on hardware (disks), e.g. when we do hit zstd decompression bottleneck, this should in theory be faster, queing model is also simpler. |
I guess it should be easy to split the new files creator trait and implementations from its actual use for untarring and removal of chunker. |
Please no. PRs (and commits) are supposed to be atomic: merging code that isn't used makes no sense - how can you possibly review that it's correct if you don't see how it's used? What matters is making atomic things, not minimizing diffs [man-standing.jpeg] |
I agree. Wasn't looking for the PR to be broken up into horizontal slices. Sometimes PRs contain multiple vertical slices that can be broken up into separate atomic PRs. If that's not the case for this PR, that's OK too.
Thanks! |
adc5203 to
d1b38d7
Compare
c8dda78 to
4dd2af4
Compare
alessandrod
left a comment
There was a problem hiding this comment.
thanks! generally look good, left a few comments
| cursor: Cursor::new(buf.sub_buf_to(total_read_len)), | ||
| io_buf_index: *io_buf_index, | ||
| }; | ||
| reader_state.buffers[*reader_buf_index] = |
There was a problem hiding this comment.
same here, you can use the existing buf and advance the size field in place
There was a problem hiding this comment.
advance the size field in place
Changed sub_buf_to to consume self, so it now basically does the size shortening in place.
In the future PR I'm getting rid of cursor and sub_buf_to completely just preserving the buffer (https://github.com/anza-xyz/agave/pull/6878/files#diff-c34f1749c5990606c2430a4289eee1207e1f227f9824b85dd9371d15381da2d8R451-R453), since for reading multiple files I need to re-use the full buffer instead of permanently shortening it.
alessandrod
left a comment
There was a problem hiding this comment.
we've discussed some on slack, and here's some more comments.
Once you change the register/mlock stuff I'll do a final pass on the io-uring code
baa10c4 to
b6ed59c
Compare
|
Some changes that came up when making memlock a requirement:
|
64188a2 to
9515fde
Compare
alessandrod
left a comment
There was a problem hiding this comment.
ok another pass. I haven't done the io-uring file creator yet I'll do it now
| .read(true) | ||
| .custom_flags(libc::O_NOATIME) | ||
| .open(path)?; | ||
| let buffers = IoFixedBuffer::split_buffer_chunks(buffer, read_capacity) |
There was a problem hiding this comment.
split_buffer_chunks/register_buffer are pretty ugly imo.
I would do something like
let buffers = IoFixedBuffer::split_buffer_chunks(buffer, read_capacity);
IoFixedBuffer::register(buffers, ring);
There was a problem hiding this comment.
Do you mean we should register each buffer as separate fixed buffer with its own index in the kernel or just to rename stuff a bit? I'm changing the names as you suggested, but I guess it's better to register a whole (original) buffer.
There was a problem hiding this comment.
hm no that's not what I meant. Registering one buffer is faster.
My suggestion was wrong tho - I missed that we were chunking buffers by write_capacity, but then obviously the chunks registered as fixed buffers in io-uring are chunked by something different (FIXED_BUFFER_LEN).
I'll think about it, the API still looks pretty ugly to me.
| } | ||
| /// Split buffer into `chunk_size` sized `IoFixedBuffer` buffers for use as registered | ||
| /// buffer in io_uring operations. | ||
| pub fn split_buffer_chunks<'a>( |
There was a problem hiding this comment.
I don't think that this code should be here, because it makes it impossible to
track the lifetime of the memory.
I would move this chunking/registering to what actually owns the buffer, so it's
clear from there that even thought we're downgrading to pointers, they won't be dangling
There was a problem hiding this comment.
This function is actually just a factory of IoFixedBuffer items and the caller, which is the owner of the input buffer too, manages the result - the code here doesn't leak the unsafe pointers anywhere else.
Also, the code was moved here because it's shared between file creator and file sequential reader.
There was a problem hiding this comment.
This function is actually just a factory of IoFixedBuffer items and the caller, which is the owner of the input buffer too, manages the result
This is exactly the problem tho: the code returns values that embed pointers, but it's up to the caller to guarantee that the backing memory for those pointers remain valid.
Or in other words: this is a safe API, that can be misused to trigger use after free. Safe APIs should never allow use after free.
There was a problem hiding this comment.
I suppose for that reason it will be best to mark those functions as unsafe. I prefer to keep the constructor with the struct being constructed (also copy-pasting unsafe code to each use place seem quite counter-productive), but indeed this operation is unsafe.
| } | ||
|
|
||
| /// Registed provided buffer as fixed buffer in `io_uring`. | ||
| pub fn register_buffer<S, E: RingOp<S>>( |
There was a problem hiding this comment.
same here, move to what owns the buffer?
| }, | ||
| }; | ||
|
|
||
| const DEFAULT_WRITE_SIZE: usize = 1024 * 1024; |
There was a problem hiding this comment.
Added some comments, there is a dd experimental truth and theoretical truth, they don't match...
| struct PendingFile { | ||
| path: PathBuf, | ||
| completed_open: bool, | ||
| backlog: SmallVec<[PendingWrite; 8]>, |
There was a problem hiding this comment.
I remember writing this backlog thing, but I don't remember where 8 comes from?
Have you measured?
There was a problem hiding this comment.
I looked at the sizes of files in accounts directory, most of them fall within 5-6 MB, so 8*1MB write capacity will cover most cases without alloc
There was a problem hiding this comment.
99.9% files are <8000000 bytes
c2dac3c to
9445866
Compare
|
Got a small merge conflict in imports, the CI is green again, @alessandrod @brooksprumo whenever one of you gets to it, please re-approve |
|
This seems to have broken v2.2 -> master upgrade compatibility. Is that expected? I'm getting these errors |
| ### Validator | ||
|
|
||
| #### Breaking | ||
| * Require increased `memlock` limits - recommended setting is `LimitMEMLOCK=2000000000` in systemd service configuration. Lack of sufficient limit (on Linux) will cause startup error. |
Yes, must set the memlock limit now. I tagged you on the change to the changelog: #6671 (comment) Here's the message on discord for posterity: https://discord.com/channels/428295358100013066/439194979856809985/1398240774315053056 |
|
Thanks! |
Problem
Unpacking snapshot uses
tarcrateunpackfor each entry, which calls sync IO and copy data into intermediate buffer before performing writes. This blocks and spends a lot of CPU time on syscalls.Summary of Changes
IoUringFileCreator(plus compatibility trait for non-linux platforms) and use it for creating files while unpacking snapshot.ArchiveChunkerand perform whole unpacking in single thread - all IO is done in background kernel threads (with io_uring) and unless we run out of disk write bandwidth this thread will spend time on decompression.entry_processorintofile_path_processorand only execute it for files (such thatis_file()call can be avoided in the only non-trivial call site that is filtering for files)This change is more or less performance neutral - untar times highly depend on achieved disk read / write throughput (and data layout on attached disks).
Observed timings: