Skip to content

Two dRAID spares for one vdev #16547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tonyhutter opened this issue Sep 19, 2024 · 2 comments · Fixed by #17231
Closed

Two dRAID spares for one vdev #16547

tonyhutter opened this issue Sep 19, 2024 · 2 comments · Fixed by #17231
Assignees
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@tonyhutter
Copy link
Contributor

System information

Type Version/Name
Distribution Name RHEL
Distribution Version 8.10
Kernel Version 4.18
Architecture x86-64
OpenZFS Version 2.2.4

Describe the problem you're observing

We've seen cases where two spares were assigned to the same failed vdev:

     NAME                  STATE     READ WRITE CKSUM
     tank20                DEGRADED     0     0     0
       draid2:8d:90c:2s-0  DEGRADED     0     0     0
         L0                ONLINE       0     0     0
         L1                ONLINE       0     0     0
         L2                ONLINE       0     0     0
         L3                ONLINE       0     0     0
         L4                ONLINE       0     0     0
         L5                ONLINE       0     0     0
         spare-6           DEGRADED     0     0 13.2K
           replacing-0     DEGRADED     0     0     0
             spare-0       DEGRADED     0     0     0
               L6/old      FAULTED      0     0     0  external device fault
               draid2-0-1  ONLINE       0     0     0
             L6            ONLINE       0     0     0
           draid2-0-0      ONLINE       0     0     0
         L7                ONLINE       0     0     0 

Detaching the spares got the pools back to being healthy again. Here is the procedure our admins used to get the pool back to normal:

1. zpool detach <GUID of L6/old>
    1. It detached, but still left with 2 ONLINE spares
2. zpool detach draid2-0-1
    1. Spare detached and the good L6 decreased one indentation level
       but draid2-0-0 didn't auto-detach
3. zpool detach draid2-0-0
    1. Spare detached leaving everything looking normal
4. Started a scrub 

Describe how to reproduce the problem

We will need to develop a test case to reproduce this. I think it would be roughly:

  1. Create a dRAID pool with 2 spares.
  2. Fault one of the disks, call it disk1
  3. Let the dRAID spare kick in.
  4. Replace disk1 with a new disk, called disk1-new
  5. While it's resilvering to disk1-new, fault disk1-new.
  6. See if the 2nd spare kicks in

Include any warning/errors/backtraces from the system logs

@tonyhutter tonyhutter added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 19, 2024
@tonyhutter tonyhutter self-assigned this Sep 19, 2024
@tonyhutter
Copy link
Contributor Author

Somewhat related: #17226

@tonyhutter
Copy link
Contributor Author

Fix: #17231

behlendorf pushed a commit that referenced this issue May 2, 2025
It's possible for two spares to get attached to a single failed vdev.
This happens when you have a failed disk that is spared, and then you
replace the failed disk with a new disk, but during the resilver
the new disk fails, and ZED kicks in a spare for the failed new
disk.  This commit checks for that condition and disallows it.

Reviewed-by: Akash B <[email protected]>
Reviewed-by: Ameer Hamza <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes: #16547
Closes: #17231
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant