Retry IO on short write in io_uring file creator by kskalski · Pull Request #8053 · anza-xyz/agave

kskalski · 2025-09-15T09:30:26Z

Problem

As reported in #8036 (comment) ubuntu 22 with 5.15 kernel and ZFS filesystem experiences error due to short writes.
It's not completely clear which kernel / FS combination give 100% guarantee of not returning EAGAIN or short write, this comment axboe/liburing#766 (comment) suggests 5.15 fixed some issues, but given reports from the wild it's worth fixing.

Summary of Changes

Re-submit write when:

getting resource temporarily unavailable error (EAGAIN aka WouldBlock)
getting short write (unless Ok with 0-size write happens, which is treated as hard error)

Fixes #8036 (it's not clear where the resource busy errors come from, but write completion is the most likely place)

kskalski · 2025-09-15T10:01:51Z

I also found some panics from a few days ago in JitoLabs:

ELwVbNrN4q5UhwvVtz94jBxGZULsYQ4bm7XbK8mSDJSV 3.0.1 (src:8a176bcf; feat:128318206, client:JitoLabs)
assertion `left == right` failed: short write
  left: 109056
 right: 524288">panicked at accounts-db/src/io_uring/file_creator.rs:475:9:

…ing total written for stats

codecov-commenter · 2025-09-15T12:30:27Z

Codecov Report

❌ Patch coverage is 52.00000% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.9%. Comparing base (8b52ec8) to head (709e257).
⚠️ Report is 62 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #8053     +/-   ##
=========================================
- Coverage    82.9%    82.9%   -0.1%     
=========================================
  Files         823      823             
  Lines      360428   360447     +19     
=========================================
- Hits       299071   298998     -73     
- Misses      61357    61449     +92

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

brooksprumo

How was this change tested?

alessandrod · 2025-09-15T17:13:17Z

Pls don’t merge this yet give me a chance to look at the original issue

kskalski · 2025-09-15T18:10:02Z

How was this change tested?

so far only on regular devbox, which never experienced such errors, so for now I treat this PR as a reference for anyone to test if they see them.

Lusitaniae · 2025-09-16T15:15:33Z

Solana seems to be stuck for >10 mins during startup on this line when using this patch on top of 3.0.2

incremental-snapshot-367177550-367226598-5ECRCsPkt8C1DfTP3Qsm66wKof1wtwzZ34vSiQUDenj4.tar.zst

Without the patch (3.0.2 release) the error comes up immediately

going back to 2.3.6 we can see next log message comes up within 14 seconds

Sep 16 15:31:22 tyo162 solana-rpc.sh[3772]: [2025-09-16T15:31:22.412669606Z INFO  solana_runtime::snapshot_bank_utils] Loading bank from full snapshot archive: /solana/snapshots/snapshot-367177550-3GAd3xdqzzvvqzrMNaJf8eQwGSDpLm3a8EK49GNhp5jr.tar.zst, and incremental snapshot archive: Some("/solana/snapshots/incremental-snapshot-367177550-367226598-5ECRCsPkt8C1DfTP3Qsm66wKof1wtwzZ34vSiQUDenj4.tar.zst")
Sep 16 15:31:36 tyo162 solana-rpc.sh[3772]: [2025-09-16T15:31:36.198758883Z INFO  solana_runtime::snapshot_utils::snapshot_storage_rebuilder] rebuilt storages for 41553/428935 slots with 0 collisions

kskalski · 2025-09-16T22:00:30Z

If it's stuck, it probably hits this condition repeatedly:

Err(err) if err.kind() == io::ErrorKind::WouldBlock => 0, // treat as a kind of short write

which is unfortunate, as it looks like the kernel requires IO to be submitted with ASYNC flag. It might be somehow related to fixed (file-descriptor) writes we are using, it is actually because of weird scheduling into non-blocking workers for fixed writes that we are not marking all IOs as ASYNC.

Maybe we can selectively mark as ASYNC when encountering WouldBlock error - this won't hurt performance / the way IOs are scheduled to workers for common case / newer kernels, but won't block / crash on old kernel.

kskalski · 2025-09-17T13:47:20Z

Ok, confirmed on triton's node (5.15 kernel) that adding async flag makes unpacking proceed at more or less normal pace and succeed (I didn't check though if the issue happens for all or only some write ops).
I think at some point we should re-check setting async as default for writes, but I would prefer to do it after all pending PRs on io_uring as merged (specifically #6878), since they changed the mode of io_uring writes (e.g. putting them in sqpoll).

kskalski · 2025-09-18T09:01:40Z

Also confirmed that EAGAIN error occurs for all writes when they are in directory backed by /dev/zram0, which is reproducible using our file_io::tests::test_create_writes_contents test when tempdir is pointed to that mount.

alessandrod · 2025-09-20T09:18:01Z

Also confirmed that EAGAIN error occurs for all writes when they are in directory backed by /dev/zram0, which is reproducible using our file_io::tests::test_create_writes_contents test when tempdir is pointed to that mount.

Can you please add a test for this? Something that does whatever setup is needed to trigger the error.

alessandrod · 2025-09-20T09:31:21Z

but I would prefer to do it after all pending PRs on io_uring as merged (specifically #6878), since they changed the mode of io_uring writes (e.g. putting them in sqpoll).

we need to do the smallest possible change that fixes whatever we've broken for people (I'm assuming 3.0?)

kskalski · 2025-09-20T10:26:51Z

Can you please add a test for this? Something that does whatever setup is needed to trigger the error.

There was a bit of discussion about that on slack and the suggestion was to do a one-off fix without support for it in the test automation - the problem is that this only happens with a zram disk, which is not something normally available in the test environment. Unless we get a repro with vanilla disks, I'm not sure how to better test it.

we need to do the smallest possible change that fixes whatever we've broken for people (I'm assuming 3.0?)

Agree, I believe this PR is the least resistance change, i.e. it fixes the issue without affecting runtime / performance of the baseline, while fixing it in a different way could bring some unknowns.

alessandrod · 2025-09-21T03:21:21Z

the problem is that this only happens with a zram disk

how sure are we about this? Is it "only people with zram have complained" or "we're 100% positive it's only zram"?

alessandrod · 2025-09-21T05:32:54Z

Also confirmed that EAGAIN error occurs for all writes when they are in directory backed by /dev/zram0, which is reproducible using our file_io::tests::test_create_writes_contents test when tempdir is pointed to that mount.

This is why I'm not comfortable with this patch. I suspect that we're doing all submissions twice when we hit this bug: non async => EAGAIN => async.

Also I don't think this is a case of short writes. Short write = written < requested_write, but here we're writing 0 and getting EAGAIN. The patch is kinda confusing since we're restoring the short write offet code (which we should implement to be clear), but in this case we always hit WouldBlock so written=0 so we never update offsets.

I think we need to add a test for zram, RCA, make sure that either everyone who's hitting this bug is doing so because of zram or RCA further.

kskalski · 2025-09-22T01:01:38Z

how sure are we about this? Is it "only people with zram have complained" or "we're 100% positive it's only zram"?

Short writes / EAGAIN do happen in the wild, though not often, e.g. I've seen a panic from JitoLabs from a few weeks ago, now I also see one for 335NaZ18GDW4rEmcoTb3Fae5CwKyqY8iVp6BtPAWc8A7 (using 3.1 release) from Sep-17.
Reported problems are with:

zram - happen on 100% of writes, reproducible with a unittest
tmpfs - so far I wasn't able to reproduce with test on our CI
none of the Anza servers, including those running 5.15 kernel, experienced this,
upgrading the kernel always fixes the issue, which suggests it is actual kernel bug that was fixed somewhere in >5.15

This is why I'm not comfortable with this patch. I suspect that we're doing all submissions twice when we hit this bug: non async => EAGAIN => async.

Yes, this is exactly what happens.

Also I don't think this is a case of short writes. Short write = written < requested_write, but here we're writing 0 and getting EAGAIN. The patch is kinda confusing since we're restoring the short write offet code (which we should implement to be clear), but in this case we always hit WouldBlock so written=0 so we never update offsets.

True, this patch fixes both short writes and would block in one go (both cases were actually reported from user with zram), maybe it would be less confusing if use_async val were derived directly from error kind == would block... I find the current code more succinct though as written is calculated in one place and all cases of retry are gated by checking it, but can change it in any way.

I think we need to add a test for zram, RCA, make sure that either everyone who's hitting this bug is doing so because of zram or RCA further.

Well, the test itself is quite simple:

    #[test]
    fn test_create_writes_contents() -> io::Result<()> {
-        let temp_dir = tempfile::tempdir()?;
+        let temp_dir = tempfile::tempdir_in("/zram/")?;

though technically it would only test that the fix is a proper one for zram.
Since the fix is a generic one for all cases of short write / EAGAIN irrespective of their cause, in order to verify that the issue doesn't happen in other context we would need to add some metric to detect the situation and dig on each of its elevated occurrence.

alessandrod · 2025-09-23T02:27:24Z

Ok I've gone pretty deep on this, here's my findings so far:

virtually no filesystems implement NOWAIT so submitting in non-blocking (non-async) mode doesn't make much sense, we just execute extra code then io-uring will fallback ASYNC anyway
brtfs (and xfs) can do NOWAIT if you write to something that is already in the page cache - this is probably why we get brtfs specific bugs
tmpfs can also do NOWAIT but only when it doesn't require allocations
when EAGAIN is retried
- the first time when an entry is submitted with NOWAIT (when we don't set the ASYNC flag), EAGAIN causes re-submission in a worker
- indefinitely in a worker ONLY IF IORING_SETUP_IOPOLL is set (only works for O_DIRECT)
- only once in a worker if the ASYNC flag is set

See torvalds/linux@e0deb6a

So I think we must always handle EAGAIN in our code. And we must make sure we don't loop indefinitely retrying always the same write - for example if we run out of memory (tmpfs) or disk (although maybe we get ENOSPC in that case?).

I haven't looked at what's happening with zram specifically yet, I'll do that next.

EDIT: zram doesn't seem to be have differently from ext4.

Ok, confirmed on triton's node (5.15 kernel) that adding async flag makes unpacking proceed at more or less normal pace and succeed

I am now skeptical that ASYNC is fixing any issues here?

kskalski · 2025-09-23T03:29:55Z

Ok I've gone pretty deep on this, here's my findings so far:

Thanks for the in-depth research.

  * the first time when an entry is submitted with NOWAIT (when we don't set the ASYNC flag), EAGAIN causes re-submission in a worker

I think this is the culprit of the problem - my guess is 5.15 had a bug in this path. Do you know a specific point in code where this happens? If we really want to get to the root cause, we should compare how it was in 5.15. On the other hand it's just a couple more months while we need to support that kernel - not knowing what exactly is happening will probably keep this code here indefinitely. :/

  * indefinitely in a worker ONLY IF IORING_SETUP_IOPOLL is set (only works for O_DIRECT)

FWIW I tested enabling sqpoll on the machine with reproduction, and it didn't fix the issue, it wasnt IOPOLL / O_DIRECT though, so I guess it wouldn't apply.

So I think we must always handle EAGAIN in our code. And we must make sure we don't loop indefinitely retrying always the same write - for example if we run out of memory (tmpfs) or disk (although maybe we get ENOSPC in that case?).

Ok, I added the check that we only retry WouldBlock when original write wasn't async.

I am now skeptical that ASYNC is fixing any issues here?

Hm, not sure, my bet is 5.15 had some serious bug. Also, from the report it seems ZFS version might matter too, there is a comment

1st server (accounts in ZFS): Upgrading kernel (5.15.0 to 6.8.0) + zfs (2.2.6 to latest 2.3.4) seems to have solved it
2nd server (accounts ZRAM): Upgrading kernel (5.15.0 to 6.8.0) alone worked

Irrespective of that - this PR is kind of implementing the standard IO API, i.e. when you see short write, retry with remaining payload, when you see WouldBlock, retry in async mode (usually EAGAIN and EWOULDBLOCK are the same errors, so let's say it does relate only to async)

alessandrod · 2025-09-23T07:03:16Z

when you see WouldBlock, retry in async mode

This is not how it works tho. EAGAIN has nothing to do with ASYNC. I don't think we should be submitting with ASYNC in case of short write.

I think we should watch the offset/written amount, resubmit, if we detect no progress bail.

alessandrod · 2025-09-23T14:47:14Z

too lazy to type but

we should remove O_NONBLOCK from FileCreator (separate PR), handle short writes making sure we don't loop if we don't make progress, and remove the ASYNC resubmission from this PR

kskalski · 2025-09-24T01:07:04Z

Created #8161 for removing O_NONBLOCK flag and changed this PR to only handle short-writes, which I think is also necessary to fix #8036 completely.

I will try to add a testcase for disk-full in a separate PR, since I think it will contain a lot of set-up code / changes that should be separate (for purpose of more confident back-porting).

brooksprumo

mergify · 2025-09-24T14:02:59Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

* Support short writes in io_uring file creator * Handle resource busy err as short write. Update comment. Fix calculating total written for stats * Remove handling of busy error * Add warn for short write (cherry picked from commit a10a2c8)

…8053) (#8173) Retry IO on short write in io_uring file creator (#8053) * Support short writes in io_uring file creator * Add warn for short write (cherry picked from commit a10a2c8) Co-authored-by: Kamil Skalski <kamil.skalski@gmail.com>

Support short writes in io_uring file creator

4da37c1

Handle resource busy err as short write. Update comment. Fix calculat…

4f4b561

…ing total written for stats

alessandrod self-requested a review September 15, 2025 12:40

kskalski marked this pull request as ready for review September 15, 2025 12:46

kskalski requested a review from brooksprumo September 15, 2025 16:26

kskalski mentioned this pull request Sep 15, 2025

validator wont start on kernel 5.15 with agave v3.0.0 #8036

Closed

brooksprumo reviewed Sep 15, 2025

View reviewed changes

kskalski changed the title ~~Support short writes in io_uring file creator~~ Retry IO on short write or EAGAIN in io_uring file creator Sep 21, 2025

kskalski mentioned this pull request Sep 24, 2025

Create files without O_NONBLOCK flag #8161

Merged

Remove handling of busy error

5c0590f

kskalski force-pushed the ks/short_write branch from 803b1cf to 5c0590f Compare September 24, 2025 01:00

kskalski changed the title ~~Retry IO on short write or EAGAIN in io_uring file creator~~ Retry IO on short write in io_uring file creator Sep 24, 2025

alessandrod reviewed Sep 24, 2025

View reviewed changes

Comment thread accounts-db/src/io_uring/file_creator.rs

Add warn for short write

709e257

alessandrod approved these changes Sep 24, 2025

View reviewed changes

kskalski requested a review from brooksprumo September 24, 2025 13:44

mergify Bot mentioned this pull request Sep 24, 2025

v3.0: Create files without O_NONBLOCK flag (backport of #8161) #8172

Merged

brooksprumo approved these changes Sep 24, 2025

View reviewed changes

kskalski merged commit a10a2c8 into anza-xyz:master Sep 24, 2025
43 checks passed

kskalski deleted the ks/short_write branch September 24, 2025 13:58

kskalski added the v3.0 label Sep 24, 2025

mergify Bot mentioned this pull request Sep 24, 2025

v3.0: Retry IO on short write in io_uring file creator (backport of #8053) #8173

Merged

Conversation

kskalski commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

kskalski commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brooksprumo left a comment

Choose a reason for hiding this comment

Uh oh!

alessandrod commented Sep 15, 2025

Uh oh!

kskalski commented Sep 15, 2025

Uh oh!

Lusitaniae commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kskalski commented Sep 16, 2025

Uh oh!

kskalski commented Sep 17, 2025

Uh oh!

kskalski commented Sep 18, 2025

Uh oh!

alessandrod commented Sep 20, 2025

Uh oh!

alessandrod commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kskalski commented Sep 20, 2025

Uh oh!

alessandrod commented Sep 21, 2025

Uh oh!

alessandrod commented Sep 21, 2025

Uh oh!

kskalski commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alessandrod commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kskalski commented Sep 23, 2025

Uh oh!

alessandrod commented Sep 23, 2025

Uh oh!

alessandrod commented Sep 23, 2025

Uh oh!

kskalski commented Sep 24, 2025

Uh oh!

Uh oh!

brooksprumo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify Bot commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kskalski commented Sep 15, 2025 •

edited

Loading

kskalski commented Sep 15, 2025 •

edited

Loading

codecov-commenter commented Sep 15, 2025 •

edited

Loading

Lusitaniae commented Sep 16, 2025 •

edited

Loading

alessandrod commented Sep 20, 2025 •

edited

Loading

kskalski commented Sep 22, 2025 •

edited

Loading

alessandrod commented Sep 23, 2025 •

edited

Loading