Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btrfs-progs: balance: add extra delay if converting with a missing de… #946

Open
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

adam900710
Copy link
Collaborator

…vice

[BUG]
There is a reproducer that can trigger btrfs to flips RO:

mkfs.btrfs -f -mraid1 -draid1 /dev/sdd /dev/sde

mount /dev/sdd /mnt/btrfs

echo 1 > /sys/block/sde/device/delete

btrfs balance start -mconvert=dup -dconvert=single /mnt/btrfs

ERROR: error during balancing '.': Input/output error
There may be more info in syslog - try dmesg | tail

Then btrfs will flip read-only with the following errors:

btrfs: attempt to access beyond end of device
sde: rw=6145, sector=21696, nr_sectors = 32 limit=0
btrfs: attempt to access beyond end of device
sde: rw=6145, sector=21728, nr_sectors = 32 limit=0
btrfs: attempt to access beyond end of device
sde: rw=6145, sector=21760, nr_sectors = 32 limit=0
BTRFS error (device sdd): bdev /dev/sde errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
BTRFS error (device sdd): bdev /dev/sde errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 1, corrupt 0, gen 0
btrfs: attempt to access beyond end of device
sde: rw=145409, sector=128, nr_sectors = 8 limit=0
BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5)
BTRFS error (device sdd): bdev /dev/sde errs: wr 4, rd 0, flush 1, corrupt 0, gen 0
btrfs: attempt to access beyond end of device
sde: rw=14337, sector=131072, nr_sectors = 8 limit=0
BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5)
BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 1, corrupt 0, gen 0
BTRFS error (device sdd): error writing primary super block to device 2
BTRFS info (device sdd): balance: start -dconvert=single -mconvert=dup -sconvert=dup
BTRFS info (device sdd): relocating block group 1372585984 flags data|raid1
BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 2, corrupt 0, gen 0
BTRFS warning (device sdd): chunk 2446327808 missing 1 devices, max tolerance is 0 for writable mount
BTRFS: error (device sdd) in write_all_supers:4044: errno=-5 IO failure (errors while submitting device barriers.)
BTRFS info (device sdd state E): forced readonly
BTRFS warning (device sdd state E): Skipping commit of aborted transaction.
BTRFS error (device sdd state EA): Transaction aborted (error -5)
BTRFS: error (device sdd state EA) in cleanup_transaction:2017: errno=-5 IO failure
BTRFS info (device sdd state EA): balance: ended with status: -5

[CAUSE]
The root cause is that, deleting devices using sysfs interface normally will trigger the shutdown callback for the fs.

But btrfs doesn't handle that callback at all, thus it can not really know that device is no longer avaialble, thus btrfs will still try to do usual read/write on that device.

This is fine if the user do nothing, as RAID1 can handle it properly.

But if we try to convert to SINGLE/DUP, btrfs will still use that device to allocate new data/metadata chunks.
And if a new metadata chunk is allocated to the removed device, all the write will be lost, and trigger the super block write/barrier errors above.

[USER SPACE ENHANCEMENT]
For now, add extra missing devices check at btrfs-balance command. If there is a missing devices, btrfs balance will add a 10 seconds delay and warn the possible dangerous.

The root fix is to introduce a failing/removed device detection for btrfs, but that will be a pretty big feature and will take quite some time before landing it upstream.

Reported-by: Jeff Siddall [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/

…vice

[BUG]
There is a reproducer that can trigger btrfs to flips RO:

 # mkfs.btrfs -f -mraid1 -draid1 /dev/sdd /dev/sde
 # mount /dev/sdd /mnt/btrfs
 # echo 1 > /sys/block/sde/device/delete
 # btrfs balance start -mconvert=dup -dconvert=single /mnt/btrfs
 ERROR: error during balancing '.': Input/output error
 There may be more info in syslog - try dmesg | tail

Then btrfs will flip read-only with the following errors:

 btrfs: attempt to access beyond end of device
 sde: rw=6145, sector=21696, nr_sectors = 32 limit=0
 btrfs: attempt to access beyond end of device
 sde: rw=6145, sector=21728, nr_sectors = 32 limit=0
 btrfs: attempt to access beyond end of device
 sde: rw=6145, sector=21760, nr_sectors = 32 limit=0
 BTRFS error (device sdd): bdev /dev/sde errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device sdd): bdev /dev/sde errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 1, corrupt 0, gen 0
 btrfs: attempt to access beyond end of device
 sde: rw=145409, sector=128, nr_sectors = 8 limit=0
 BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5)
 BTRFS error (device sdd): bdev /dev/sde errs: wr 4, rd 0, flush 1, corrupt 0, gen 0
 btrfs: attempt to access beyond end of device
 sde: rw=14337, sector=131072, nr_sectors = 8 limit=0
 BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5)
 BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 1, corrupt 0, gen 0
 BTRFS error (device sdd): error writing primary super block to device 2
 BTRFS info (device sdd): balance: start -dconvert=single -mconvert=dup -sconvert=dup
 BTRFS info (device sdd): relocating block group 1372585984 flags data|raid1
 BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 2, corrupt 0, gen 0
 BTRFS warning (device sdd): chunk 2446327808 missing 1 devices, max tolerance is 0 for writable mount
 BTRFS: error (device sdd) in write_all_supers:4044: errno=-5 IO failure (errors while submitting device barriers.)
 BTRFS info (device sdd state E): forced readonly
 BTRFS warning (device sdd state E): Skipping commit of aborted transaction.
 BTRFS error (device sdd state EA): Transaction aborted (error -5)
 BTRFS: error (device sdd state EA) in cleanup_transaction:2017: errno=-5 IO failure
 BTRFS info (device sdd state EA): balance: ended with status: -5

[CAUSE]
The root cause is that, deleting devices using sysfs interface normally
will trigger the shutdown callback for the fs.

But btrfs doesn't handle that callback at all, thus it can not really
know that device is no longer avaialble, thus btrfs will still try to do
usual read/write on that device.

This is fine if the user do nothing, as RAID1 can handle it properly.

But if we try to convert to SINGLE/DUP, btrfs will still use that device
to allocate new data/metadata chunks.
And if a new metadata chunk is allocated to the removed device, all the
write will be lost, and trigger the super block write/barrier errors
above.

[USER SPACE ENHANCEMENT]
For now, add extra missing devices check at btrfs-balance command.
If there is a missing devices, `btrfs balance` will add a 10 seconds
delay and warn the possible dangerous.

The root fix is to introduce a failing/removed device detection for
btrfs, but that will be a pretty big feature and will take quite some
time before landing it upstream.

Reported-by: Jeff Siddall <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Signed-off-by: Qu Wenruo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant