Scrub stability fix #1054

DocMAX · 2025-11-02T11:42:03Z

This PR enhances the scrub command to handle temporary device disconnections
more gracefully, improving stability and preventing data loss:

Preserve maximum last_physical position during resume after interruptions
Handle device disconnection errors (ENODEV, ENOTCONN, EIO) as interrupted
rather than canceled, allowing scrub to resume from the last position
Add retry logic for temporarily unavailable devices with automatic reconnection
Force more frequent progress saving when device issues are detected
Preserve progress data during temporary disconnections (e.g., USB hub resets)

These changes ensure that scrub operations can survive transient hardware issues
and resume properly without losing progress, making the filesystem maintenance
more robust in environments with less reliable storage devices.

This commit enhances the scrub command to handle temporary device disconnections more gracefully, improving stability and preventing data loss: - Preserve maximum last_physical position during resume after interruptions - Handle device disconnection errors (ENODEV, ENOTCONN, EIO) as interrupted rather than canceled, allowing scrub to resume from the last position - Add retry logic for temporarily unavailable devices with automatic reconnection - Force more frequent progress saving when device issues are detected - Preserve progress data during temporary disconnections (e.g., USB hub resets) These changes ensure that scrub operations can survive transient hardware issues and resume properly without losing progress, making the filesystem maintenance more robust in environments with less reliable storage devices.

DocMAX · 2025-11-02T12:29:30Z

I think the check does not fail because of this commit

adam900710 · 2025-11-02T21:43:07Z

Even if a scrub is interrupted, it should still properly updates its last_physical correctly.
If last_physical is reset to 0, it normally means scrub failed to scrub any byte.

This can be a problem for resume cases, but still in that case it's a bug in kernel, as btrfs modules always reset the whole sctx->process structure to zero, thus resulting the 0 last_physical problem.

It's better to fix it in the kernel, but your last_physical workaround is more or less acceptable, especially for older progs.
Although it's acceptable, it's not good enough, you should not use max() to update last_physical, but just a zero check is enough.

But unfortunately, your positive contribute ends there.

Incorrect return value check
EIO is the most common errors, it can be of all kinds of reasons, from really critical to minor ones, it should not be checked for interruption.
ENOTCONN is only returned when cancelling a scrub/balance but there is no running one.
ENODEV is more close, but still not correct. It's returned when scrubbing a missing device.
But until v6.19 kernel, a lost device at runtime won't be marked as missing, so your ENODEV check doesn't make any sense either.

This is not your first time contributing, and you still didn't dig into the problem deep enough. This behavior will not help you to push more patches into btrfs-progs.
Unnecessary new lines
Mixed definition and code
This can be detected by kernel check_patch.pl, but you never really seem to bother.
Retry behavior
It's hard coded to 3, with hardcoded sleep.
If you really want to retry (which is not a good idea), do it in your own damn script.
Don't push your own (and incorrect) policy to every btrfs-progs users.
Duplicated and unnecessary comments
There are comments added without any meanings, e.g. For device removal (ENODEV), preserve the last_physical part that has no effective code around.
And most of them are duplicated, even incorrect (check the ENODEV and ENOTCONN ones).

If you don't know what you're doing, meaningless comments won't help.

If you can not read kernel codes to explain the behavior correctly, please dig deeper before sending out patches like this.

DocMAX · 2025-11-04T13:37:12Z

OK, thank you for your review. The last_pyhsical reset is really a serious issue as the scrub simply resumes at the last percentage but then exceeds the 100% mark later.

DocMAX force-pushed the scrub-stability-fix branch from df30667 to aa7819b Compare November 2, 2025 11:48

adam900710 closed this Nov 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scrub stability fix #1054

Scrub stability fix #1054

Uh oh!

DocMAX commented Nov 2, 2025

Uh oh!

DocMAX commented Nov 2, 2025

Uh oh!

adam900710 commented Nov 2, 2025

Uh oh!

DocMAX commented Nov 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Scrub stability fix #1054

Scrub stability fix #1054

Uh oh!

Conversation

DocMAX commented Nov 2, 2025

Uh oh!

DocMAX commented Nov 2, 2025

Uh oh!

adam900710 commented Nov 2, 2025

Uh oh!

DocMAX commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DocMAX commented Nov 4, 2025 •

edited

Loading