Skip to content

Conversation

@DocMAX
Copy link

@DocMAX DocMAX commented Nov 2, 2025

This PR enhances the scrub command to handle temporary device disconnections
more gracefully, improving stability and preventing data loss:

  • Preserve maximum last_physical position during resume after interruptions
  • Handle device disconnection errors (ENODEV, ENOTCONN, EIO) as interrupted
    rather than canceled, allowing scrub to resume from the last position
  • Add retry logic for temporarily unavailable devices with automatic reconnection
  • Force more frequent progress saving when device issues are detected
  • Preserve progress data during temporary disconnections (e.g., USB hub resets)

These changes ensure that scrub operations can survive transient hardware issues
and resume properly without losing progress, making the filesystem maintenance
more robust in environments with less reliable storage devices.

This commit enhances the scrub command to handle temporary device disconnections
more gracefully, improving stability and preventing data loss:

- Preserve maximum last_physical position during resume after interruptions
- Handle device disconnection errors (ENODEV, ENOTCONN, EIO) as interrupted
  rather than canceled, allowing scrub to resume from the last position
- Add retry logic for temporarily unavailable devices with automatic reconnection
- Force more frequent progress saving when device issues are detected
- Preserve progress data during temporary disconnections (e.g., USB hub resets)

These changes ensure that scrub operations can survive transient hardware issues
and resume properly without losing progress, making the filesystem maintenance
more robust in environments with less reliable storage devices.
@DocMAX DocMAX force-pushed the scrub-stability-fix branch from df30667 to aa7819b Compare November 2, 2025 11:48
@DocMAX
Copy link
Author

DocMAX commented Nov 2, 2025

I think the check does not fail because of this commit

@adam900710
Copy link
Collaborator

Even if a scrub is interrupted, it should still properly updates its last_physical correctly.
If last_physical is reset to 0, it normally means scrub failed to scrub any byte.

This can be a problem for resume cases, but still in that case it's a bug in kernel, as btrfs modules always reset the whole sctx->process structure to zero, thus resulting the 0 last_physical problem.

It's better to fix it in the kernel, but your last_physical workaround is more or less acceptable, especially for older progs.
Although it's acceptable, it's not good enough, you should not use max() to update last_physical, but just a zero check is enough.

But unfortunately, your positive contribute ends there.

  • Incorrect return value check
    EIO is the most common errors, it can be of all kinds of reasons, from really critical to minor ones, it should not be checked for interruption.
    ENOTCONN is only returned when cancelling a scrub/balance but there is no running one.
    ENODEV is more close, but still not correct. It's returned when scrubbing a missing device.
    But until v6.19 kernel, a lost device at runtime won't be marked as missing, so your ENODEV check doesn't make any sense either.

    This is not your first time contributing, and you still didn't dig into the problem deep enough. This behavior will not help you to push more patches into btrfs-progs.

  • Unnecessary new lines

  • Mixed definition and code
    This can be detected by kernel check_patch.pl, but you never really seem to bother.

  • Retry behavior
    It's hard coded to 3, with hardcoded sleep.
    If you really want to retry (which is not a good idea), do it in your own damn script.
    Don't push your own (and incorrect) policy to every btrfs-progs users.

  • Duplicated and unnecessary comments
    There are comments added without any meanings, e.g. For device removal (ENODEV), preserve the last_physical part that has no effective code around.
    And most of them are duplicated, even incorrect (check the ENODEV and ENOTCONN ones).

    If you don't know what you're doing, meaningless comments won't help.

If you can not read kernel codes to explain the behavior correctly, please dig deeper before sending out patches like this.

@adam900710 adam900710 closed this Nov 2, 2025
@DocMAX
Copy link
Author

DocMAX commented Nov 4, 2025

OK, thank you for your review. The last_pyhsical reset is really a serious issue as the scrub simply resumes at the last percentage but then exceeds the 100% mark later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants