Two dRAID spares for one vdev #16547

tonyhutter · 2024-09-19T16:24:49Z

System information

Type	Version/Name
Distribution Name	RHEL
Distribution Version	8.10
Kernel Version	4.18
Architecture	x86-64
OpenZFS Version	2.2.4

Describe the problem you're observing

We've seen cases where two spares were assigned to the same failed vdev:

     NAME                  STATE     READ WRITE CKSUM
     tank20                DEGRADED     0     0     0
       draid2:8d:90c:2s-0  DEGRADED     0     0     0
         L0                ONLINE       0     0     0
         L1                ONLINE       0     0     0
         L2                ONLINE       0     0     0
         L3                ONLINE       0     0     0
         L4                ONLINE       0     0     0
         L5                ONLINE       0     0     0
         spare-6           DEGRADED     0     0 13.2K
           replacing-0     DEGRADED     0     0     0
             spare-0       DEGRADED     0     0     0
               L6/old      FAULTED      0     0     0  external device fault
               draid2-0-1  ONLINE       0     0     0
             L6            ONLINE       0     0     0
           draid2-0-0      ONLINE       0     0     0
         L7                ONLINE       0     0     0

Detaching the spares got the pools back to being healthy again. Here is the procedure our admins used to get the pool back to normal:

1. zpool detach <GUID of L6/old>
    1. It detached, but still left with 2 ONLINE spares
2. zpool detach draid2-0-1
    1. Spare detached and the good L6 decreased one indentation level
       but draid2-0-0 didn't auto-detach
3. zpool detach draid2-0-0
    1. Spare detached leaving everything looking normal
4. Started a scrub

Describe how to reproduce the problem

We will need to develop a test case to reproduce this. I think it would be roughly:

Create a dRAID pool with 2 spares.
Fault one of the disks, call it disk1
Let the dRAID spare kick in.
Replace disk1 with a new disk, called disk1-new
While it's resilvering to disk1-new, fault disk1-new.
See if the 2nd spare kicks in

Include any warning/errors/backtraces from the system logs

The text was updated successfully, but these errors were encountered:

tonyhutter · 2025-04-07T23:55:13Z

Somewhat related: #17226

tonyhutter · 2025-04-09T00:17:52Z

Fix: #17231

It's possible for two spares to get attached to a single failed vdev. This happens when you have a failed disk that is spared, and then you replace the failed disk with a new disk, but during the resilver the new disk fails, and ZED kicks in a spare for the failed new disk. This commit checks for that condition and disallows it. Reviewed-by: Akash B <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes: #16547 Closes: #17231

tonyhutter added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 19, 2024

tonyhutter self-assigned this Sep 19, 2024

tonyhutter mentioned this issue Apr 9, 2025

Fix double spares for failed vdev #17231

Merged

13 tasks

behlendorf closed this as completed in #17231 May 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two dRAID spares for one vdev #16547

Two dRAID spares for one vdev #16547

tonyhutter commented Sep 19, 2024

tonyhutter commented Apr 7, 2025

tonyhutter commented Apr 9, 2025

Two dRAID spares for one vdev #16547

Two dRAID spares for one vdev #16547

Comments

tonyhutter commented Sep 19, 2024

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

tonyhutter commented Apr 7, 2025

tonyhutter commented Apr 9, 2025