Retry IO on short write in io_uring file creator#8053
Conversation
|
I also found some panics from a few days ago in JitoLabs: |
…ing total written for stats
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8053 +/- ##
=========================================
- Coverage 82.9% 82.9% -0.1%
=========================================
Files 823 823
Lines 360428 360447 +19
=========================================
- Hits 299071 298998 -73
- Misses 61357 61449 +92 🚀 New features to boost your workflow:
|
|
Pls don’t merge this yet give me a chance to look at the original issue |
so far only on regular devbox, which never experienced such errors, so for now I treat this PR as a reference for anyone to test if they see them. |
|
Solana seems to be stuck for >10 mins during startup on this line when using this patch on top of 3.0.2 Without the patch (3.0.2 release) the error comes up immediately going back to 2.3.6 we can see next log message comes up within 14 seconds |
|
If it's stuck, it probably hits this condition repeatedly:
which is unfortunate, as it looks like the kernel requires IO to be submitted with ASYNC flag. It might be somehow related to fixed (file-descriptor) writes we are using, it is actually because of weird scheduling into non-blocking workers for fixed writes that we are not marking all IOs as ASYNC. Maybe we can selectively mark as ASYNC when encountering |
|
Ok, confirmed on triton's node (5.15 kernel) that adding async flag makes unpacking proceed at more or less normal pace and succeed (I didn't check though if the issue happens for all or only some write ops). |
|
Also confirmed that |
Can you please add a test for this? Something that does whatever setup is needed to trigger the error. |
we need to do the smallest possible change that fixes whatever we've broken for people (I'm assuming 3.0?) |
There was a bit of discussion about that on slack and the suggestion was to do a one-off fix without support for it in the test automation - the problem is that this only happens with a zram disk, which is not something normally available in the test environment. Unless we get a repro with vanilla disks, I'm not sure how to better test it.
Agree, I believe this PR is the least resistance change, i.e. it fixes the issue without affecting runtime / performance of the baseline, while fixing it in a different way could bring some unknowns. |
how sure are we about this? Is it "only people with zram have complained" or "we're 100% positive it's only zram"? |
This is why I'm not comfortable with this patch. I suspect that we're doing all submissions twice when we hit this bug: non async => EAGAIN => async. Also I don't think this is a case of short writes. Short write = written < requested_write, but here we're writing 0 and getting EAGAIN. The patch is kinda confusing since we're restoring the short write offet code (which we should implement to be clear), but in this case we always hit WouldBlock so written=0 so we never update offsets. I think we need to add a test for zram, RCA, make sure that either everyone who's hitting this bug is doing so because of zram or RCA further. |
Short writes / EAGAIN do happen in the wild, though not often, e.g. I've seen a panic from JitoLabs from a few weeks ago, now I also see one for 335NaZ18GDW4rEmcoTb3Fae5CwKyqY8iVp6BtPAWc8A7 (using 3.1 release) from Sep-17.
Yes, this is exactly what happens.
True, this patch fixes both short writes and would block in one go (both cases were actually reported from user with zram), maybe it would be less confusing if
Well, the test itself is quite simple: though technically it would only test that the fix is a proper one for zram. |
|
Ok I've gone pretty deep on this, here's my findings so far:
So I think we must always handle EAGAIN in our code. And we must make sure we don't loop indefinitely retrying always the same write - for example if we run out of memory (tmpfs) or disk (although maybe we get ENOSPC in that case?). I haven't looked at what's happening with zram specifically yet, I'll do that next. EDIT: zram doesn't seem to be have differently from ext4.
I am now skeptical that ASYNC is fixing any issues here? |
Thanks for the in-depth research.
I think this is the culprit of the problem - my guess is 5.15 had a bug in this path. Do you know a specific point in code where this happens? If we really want to get to the root cause, we should compare how it was in 5.15. On the other hand it's just a couple more months while we need to support that kernel - not knowing what exactly is happening will probably keep this code here indefinitely. :/
FWIW I tested enabling sqpoll on the machine with reproduction, and it didn't fix the issue, it wasnt IOPOLL / O_DIRECT though, so I guess it wouldn't apply.
Ok, I added the check that we only retry
Hm, not sure, my bet is 5.15 had some serious bug. Also, from the report it seems ZFS version might matter too, there is a comment Irrespective of that - this PR is kind of implementing the standard IO API, i.e. when you see short write, retry with remaining payload, when you see |
This is not how it works tho. EAGAIN has nothing to do with ASYNC. I don't think we should be submitting with ASYNC in case of short write. I think we should watch the offset/written amount, resubmit, if we detect no progress bail. |
803b1cf to
5c0590f
Compare
|
Created #8161 for removing I will try to add a testcase for disk-full in a separate PR, since I think it will contain a lot of set-up code / changes that should be separate (for purpose of more confident back-porting). |
|
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
* Support short writes in io_uring file creator * Handle resource busy err as short write. Update comment. Fix calculating total written for stats * Remove handling of busy error * Add warn for short write (cherry picked from commit a10a2c8)

Problem
As reported in #8036 (comment) ubuntu 22 with 5.15 kernel and ZFS filesystem experiences error due to short writes.
It's not completely clear which kernel / FS combination give 100% guarantee of not returning
EAGAINor short write, this comment axboe/liburing#766 (comment) suggests 5.15 fixed some issues, but given reports from the wild it's worth fixing.Summary of Changes
Re-submit write when:
EAGAINakaWouldBlock)Okwith 0-size write happens, which is treated as hard error)Fixes #8036 (it's not clear where the resource busy errors come from, but write completion is the most likely place)