Skip to content

Retry IO on short write in io_uring file creator#8053

Merged
kskalski merged 4 commits into
anza-xyz:masterfrom
kskalski:ks/short_write
Sep 24, 2025
Merged

Retry IO on short write in io_uring file creator#8053
kskalski merged 4 commits into
anza-xyz:masterfrom
kskalski:ks/short_write

Conversation

@kskalski
Copy link
Copy Markdown

@kskalski kskalski commented Sep 15, 2025

Problem

As reported in #8036 (comment) ubuntu 22 with 5.15 kernel and ZFS filesystem experiences error due to short writes.
It's not completely clear which kernel / FS combination give 100% guarantee of not returning EAGAIN or short write, this comment axboe/liburing#766 (comment) suggests 5.15 fixed some issues, but given reports from the wild it's worth fixing.

Summary of Changes

Re-submit write when:

  • getting resource temporarily unavailable error (EAGAIN aka WouldBlock)
  • getting short write (unless Ok with 0-size write happens, which is treated as hard error)

Fixes #8036 (it's not clear where the resource busy errors come from, but write completion is the most likely place)

@kskalski
Copy link
Copy Markdown
Author

kskalski commented Sep 15, 2025

I also found some panics from a few days ago in JitoLabs:

ELwVbNrN4q5UhwvVtz94jBxGZULsYQ4bm7XbK8mSDJSV 3.0.1 (src:8a176bcf; feat:128318206, client:JitoLabs)
assertion `left == right` failed: short write
  left: 109056
 right: 524288">panicked at accounts-db/src/io_uring/file_creator.rs:475:9:

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 15, 2025

Codecov Report

❌ Patch coverage is 52.00000% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.9%. Comparing base (8b52ec8) to head (709e257).
⚠️ Report is 62 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #8053     +/-   ##
=========================================
- Coverage    82.9%    82.9%   -0.1%     
=========================================
  Files         823      823             
  Lines      360428   360447     +19     
=========================================
- Hits       299071   298998     -73     
- Misses      61357    61449     +92     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alessandrod alessandrod self-requested a review September 15, 2025 12:40
@kskalski kskalski marked this pull request as ready for review September 15, 2025 12:46
Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was this change tested?

@alessandrod
Copy link
Copy Markdown

Pls don’t merge this yet give me a chance to look at the original issue

@kskalski
Copy link
Copy Markdown
Author

How was this change tested?

so far only on regular devbox, which never experienced such errors, so for now I treat this PR as a reference for anyone to test if they see them.

@Lusitaniae
Copy link
Copy Markdown

Lusitaniae commented Sep 16, 2025

Solana seems to be stuck for >10 mins during startup on this line when using this patch on top of 3.0.2

incremental-snapshot-367177550-367226598-5ECRCsPkt8C1DfTP3Qsm66wKof1wtwzZ34vSiQUDenj4.tar.zst

Without the patch (3.0.2 release) the error comes up immediately

going back to 2.3.6 we can see next log message comes up within 14 seconds

Sep 16 15:31:22 tyo162 solana-rpc.sh[3772]: [2025-09-16T15:31:22.412669606Z INFO  solana_runtime::snapshot_bank_utils] Loading bank from full snapshot archive: /solana/snapshots/snapshot-367177550-3GAd3xdqzzvvqzrMNaJf8eQwGSDpLm3a8EK49GNhp5jr.tar.zst, and incremental snapshot archive: Some("/solana/snapshots/incremental-snapshot-367177550-367226598-5ECRCsPkt8C1DfTP3Qsm66wKof1wtwzZ34vSiQUDenj4.tar.zst")
Sep 16 15:31:36 tyo162 solana-rpc.sh[3772]: [2025-09-16T15:31:36.198758883Z INFO  solana_runtime::snapshot_utils::snapshot_storage_rebuilder] rebuilt storages for 41553/428935 slots with 0 collisions

@kskalski
Copy link
Copy Markdown
Author

If it's stuck, it probably hits this condition repeatedly:

Err(err) if err.kind() == io::ErrorKind::WouldBlock => 0, // treat as a kind of short write

which is unfortunate, as it looks like the kernel requires IO to be submitted with ASYNC flag. It might be somehow related to fixed (file-descriptor) writes we are using, it is actually because of weird scheduling into non-blocking workers for fixed writes that we are not marking all IOs as ASYNC.

Maybe we can selectively mark as ASYNC when encountering WouldBlock error - this won't hurt performance / the way IOs are scheduled to workers for common case / newer kernels, but won't block / crash on old kernel.

@kskalski
Copy link
Copy Markdown
Author

Ok, confirmed on triton's node (5.15 kernel) that adding async flag makes unpacking proceed at more or less normal pace and succeed (I didn't check though if the issue happens for all or only some write ops).
I think at some point we should re-check setting async as default for writes, but I would prefer to do it after all pending PRs on io_uring as merged (specifically #6878), since they changed the mode of io_uring writes (e.g. putting them in sqpoll).

@kskalski
Copy link
Copy Markdown
Author

Also confirmed that EAGAIN error occurs for all writes when they are in directory backed by /dev/zram0, which is reproducible using our file_io::tests::test_create_writes_contents test when tempdir is pointed to that mount.

@alessandrod
Copy link
Copy Markdown

Also confirmed that EAGAIN error occurs for all writes when they are in directory backed by /dev/zram0, which is reproducible using our file_io::tests::test_create_writes_contents test when tempdir is pointed to that mount.

Can you please add a test for this? Something that does whatever setup is needed to trigger the error.

@alessandrod
Copy link
Copy Markdown

alessandrod commented Sep 20, 2025

but I would prefer to do it after all pending PRs on io_uring as merged (specifically #6878), since they changed the mode of io_uring writes (e.g. putting them in sqpoll).

we need to do the smallest possible change that fixes whatever we've broken for people (I'm assuming 3.0?)

@kskalski
Copy link
Copy Markdown
Author

Can you please add a test for this? Something that does whatever setup is needed to trigger the error.

There was a bit of discussion about that on slack and the suggestion was to do a one-off fix without support for it in the test automation - the problem is that this only happens with a zram disk, which is not something normally available in the test environment. Unless we get a repro with vanilla disks, I'm not sure how to better test it.

we need to do the smallest possible change that fixes whatever we've broken for people (I'm assuming 3.0?)

Agree, I believe this PR is the least resistance change, i.e. it fixes the issue without affecting runtime / performance of the baseline, while fixing it in a different way could bring some unknowns.

@alessandrod
Copy link
Copy Markdown

the problem is that this only happens with a zram disk

how sure are we about this? Is it "only people with zram have complained" or "we're 100% positive it's only zram"?

@alessandrod
Copy link
Copy Markdown

Also confirmed that EAGAIN error occurs for all writes when they are in directory backed by /dev/zram0, which is reproducible using our file_io::tests::test_create_writes_contents test when tempdir is pointed to that mount.

This is why I'm not comfortable with this patch. I suspect that we're doing all submissions twice when we hit this bug: non async => EAGAIN => async.

Also I don't think this is a case of short writes. Short write = written < requested_write, but here we're writing 0 and getting EAGAIN. The patch is kinda confusing since we're restoring the short write offet code (which we should implement to be clear), but in this case we always hit WouldBlock so written=0 so we never update offsets.

I think we need to add a test for zram, RCA, make sure that either everyone who's hitting this bug is doing so because of zram or RCA further.

@kskalski kskalski changed the title Support short writes in io_uring file creator Retry IO on short write or EAGAIN in io_uring file creator Sep 21, 2025
@kskalski
Copy link
Copy Markdown
Author

kskalski commented Sep 22, 2025

how sure are we about this? Is it "only people with zram have complained" or "we're 100% positive it's only zram"?

Short writes / EAGAIN do happen in the wild, though not often, e.g. I've seen a panic from JitoLabs from a few weeks ago, now I also see one for 335NaZ18GDW4rEmcoTb3Fae5CwKyqY8iVp6BtPAWc8A7 (using 3.1 release) from Sep-17.
Reported problems are with:

  • zram - happen on 100% of writes, reproducible with a unittest
  • tmpfs - so far I wasn't able to reproduce with test on our CI
  • none of the Anza servers, including those running 5.15 kernel, experienced this,
  • upgrading the kernel always fixes the issue, which suggests it is actual kernel bug that was fixed somewhere in >5.15

This is why I'm not comfortable with this patch. I suspect that we're doing all submissions twice when we hit this bug: non async => EAGAIN => async.

Yes, this is exactly what happens.

Also I don't think this is a case of short writes. Short write = written < requested_write, but here we're writing 0 and getting EAGAIN. The patch is kinda confusing since we're restoring the short write offet code (which we should implement to be clear), but in this case we always hit WouldBlock so written=0 so we never update offsets.

True, this patch fixes both short writes and would block in one go (both cases were actually reported from user with zram), maybe it would be less confusing if use_async val were derived directly from error kind == would block... I find the current code more succinct though as written is calculated in one place and all cases of retry are gated by checking it, but can change it in any way.

I think we need to add a test for zram, RCA, make sure that either everyone who's hitting this bug is doing so because of zram or RCA further.

Well, the test itself is quite simple:

    #[test]
    fn test_create_writes_contents() -> io::Result<()> {
-        let temp_dir = tempfile::tempdir()?;
+        let temp_dir = tempfile::tempdir_in("/zram/")?;

though technically it would only test that the fix is a proper one for zram.
Since the fix is a generic one for all cases of short write / EAGAIN irrespective of their cause, in order to verify that the issue doesn't happen in other context we would need to add some metric to detect the situation and dig on each of its elevated occurrence.

@alessandrod
Copy link
Copy Markdown

alessandrod commented Sep 23, 2025

Ok I've gone pretty deep on this, here's my findings so far:

  • virtually no filesystems implement NOWAIT so submitting in non-blocking (non-async) mode doesn't make much sense, we just execute extra code then io-uring will fallback ASYNC anyway
  • brtfs (and xfs) can do NOWAIT if you write to something that is already in the page cache - this is probably why we get brtfs specific bugs
  • tmpfs can also do NOWAIT but only when it doesn't require allocations
  • when EAGAIN is retried
    • the first time when an entry is submitted with NOWAIT (when we don't set the ASYNC flag), EAGAIN causes re-submission in a worker
    • indefinitely in a worker ONLY IF IORING_SETUP_IOPOLL is set (only works for O_DIRECT)
    • only once in a worker if the ASYNC flag is set

See torvalds/linux@e0deb6a

So I think we must always handle EAGAIN in our code. And we must make sure we don't loop indefinitely retrying always the same write - for example if we run out of memory (tmpfs) or disk (although maybe we get ENOSPC in that case?).

I haven't looked at what's happening with zram specifically yet, I'll do that next.

EDIT: zram doesn't seem to be have differently from ext4.

Ok, confirmed on triton's node (5.15 kernel) that adding async flag makes unpacking proceed at more or less normal pace and succeed

I am now skeptical that ASYNC is fixing any issues here?

@kskalski
Copy link
Copy Markdown
Author

Ok I've gone pretty deep on this, here's my findings so far:

Thanks for the in-depth research.

  * the first time when an entry is submitted with NOWAIT (when we don't set the ASYNC flag), EAGAIN causes re-submission in a worker

I think this is the culprit of the problem - my guess is 5.15 had a bug in this path. Do you know a specific point in code where this happens? If we really want to get to the root cause, we should compare how it was in 5.15. On the other hand it's just a couple more months while we need to support that kernel - not knowing what exactly is happening will probably keep this code here indefinitely. :/

  * indefinitely in a worker ONLY IF IORING_SETUP_IOPOLL is set (only works for O_DIRECT)

FWIW I tested enabling sqpoll on the machine with reproduction, and it didn't fix the issue, it wasnt IOPOLL / O_DIRECT though, so I guess it wouldn't apply.

So I think we must always handle EAGAIN in our code. And we must make sure we don't loop indefinitely retrying always the same write - for example if we run out of memory (tmpfs) or disk (although maybe we get ENOSPC in that case?).

Ok, I added the check that we only retry WouldBlock when original write wasn't async.

I am now skeptical that ASYNC is fixing any issues here?

Hm, not sure, my bet is 5.15 had some serious bug. Also, from the report it seems ZFS version might matter too, there is a comment

1st server (accounts in ZFS): Upgrading kernel (5.15.0 to 6.8.0) + zfs (2.2.6 to latest 2.3.4) seems to have solved it
2nd server (accounts ZRAM): Upgrading kernel (5.15.0 to 6.8.0) alone worked

Irrespective of that - this PR is kind of implementing the standard IO API, i.e. when you see short write, retry with remaining payload, when you see WouldBlock, retry in async mode (usually EAGAIN and EWOULDBLOCK are the same errors, so let's say it does relate only to async)

@alessandrod
Copy link
Copy Markdown

when you see WouldBlock, retry in async mode

This is not how it works tho. EAGAIN has nothing to do with ASYNC. I don't think we should be submitting with ASYNC in case of short write.

I think we should watch the offset/written amount, resubmit, if we detect no progress bail.

@alessandrod
Copy link
Copy Markdown

too lazy to type but

Screenshot 2025-09-23 at 9 45 02 pm

we should remove O_NONBLOCK from FileCreator (separate PR), handle short writes making sure we don't loop if we don't make progress, and remove the ASYNC resubmission from this PR

@kskalski kskalski changed the title Retry IO on short write or EAGAIN in io_uring file creator Retry IO on short write in io_uring file creator Sep 24, 2025
@kskalski
Copy link
Copy Markdown
Author

Created #8161 for removing O_NONBLOCK flag and changed this PR to only handle short-writes, which I think is also necessary to fix #8036 completely.

I will try to add a testcase for disk-full in a separate PR, since I think it will contain a lot of set-up code / changes that should be separate (for purpose of more confident back-porting).

Comment thread accounts-db/src/io_uring/file_creator.rs
Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@kskalski kskalski merged commit a10a2c8 into anza-xyz:master Sep 24, 2025
43 checks passed
@kskalski kskalski deleted the ks/short_write branch September 24, 2025 13:58
@kskalski kskalski added the v3.0 label Sep 24, 2025
@mergify
Copy link
Copy Markdown

mergify Bot commented Sep 24, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify Bot pushed a commit that referenced this pull request Sep 24, 2025
* Support short writes in io_uring file creator

* Handle resource busy err as short write. Update comment. Fix calculating total written for stats

* Remove handling of busy error

* Add warn for short write

(cherry picked from commit a10a2c8)
kskalski added a commit that referenced this pull request Sep 26, 2025
…8053) (#8173)

Retry IO on short write in io_uring file creator (#8053)

* Support short writes in io_uring file creator
* Add warn for short write

(cherry picked from commit a10a2c8)

Co-authored-by: Kamil Skalski <kamil.skalski@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

validator wont start on kernel 5.15 with agave v3.0.0

5 participants