Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS write regression on kernel 6.12 in Fedora 41 #17034

Open
eyedvabny opened this issue Feb 7, 2025 · 5 comments
Open

ZFS write regression on kernel 6.12 in Fedora 41 #17034

eyedvabny opened this issue Feb 7, 2025 · 5 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@eyedvabny
Copy link

eyedvabny commented Feb 7, 2025

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 41
Kernel Version 6.8.5-301.fc40.x86_64 vs 6.12.11-200.fc41.x86_64
Architecture x86_64
OpenZFS Version zfs-2.3.0-1 zfs-kmod-2.3.0-1

Describe the problem you're observing

I just upgraded my installation of Fedora 40 running kernel 6.8.5 to Fedora 41 running kernel 6.12.11 and noticed a major decrease in write throughput.

I am using a single 17 GB file that I'm copying from an ext4 SSD onto a raidz-1 hdd array of 4 disks. The command I'm running is rsync --info=progress2 test_file <destination> (can rerun with fio if that's preferred).

Baseline: SSD -> SSD

# rsync --info=progress2 test_file test/
 18,199,210,856 100%  409.03MB/s    0:00:42 (xfr#1, to-chk=0/1)

Putting this here just to show that the file source is not the bottleneck.

SSD -> ZFS raidz1 on kernel 6.12.11

# rsync --info=progress2 test_file /export/media/misc/
 18,199,210,856 100%   12.65MB/s    0:22:52 (xfr#1, to-chk=0/1)

Results of zpool iostat -vyl 30 1 taken in the middle of the transfer:

                                                capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool                                          alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
--------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
datastore                                     9.36T  12.4T      0    128      0  12.6M      -  136ms      -  116ms      -      -      -   20ms      -      -      -
  raidz1-0                                    9.36T  12.4T      0    128      0  12.6M      -  136ms      -  116ms      -      -      -   20ms      -      -      -
    ata-WDC_WD120EFBX-68B0EN0_5QKTE22B            -      -      0     28      0  3.14M      -  261ms      -  219ms      -      -      -   43ms      -      -      -
    ata-WDC_WD60EFRX-68L0BN1_WD-WX11D388CR46      -      -      0     28      0  3.14M      -  257ms      -  215ms      -      -      -   43ms      -      -      -
    ata-WDC_WD120EFBX-68B0EN0_5QKTW1GB            -      -      0     44      0  3.14M      -   15ms      -   14ms      -      -      -    1ms      -      -      -
    ata-WDC_WD60EFAX-68SHWN0_WD-WX31D2992DH8      -      -      0     28      0  3.14M      -   82ms      -   76ms      -      -      -    6ms      -      -      -
--------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

It should not take 23 minutes to copy 17 GB. The disk waits are very long.

SSD -> ZFS raidz1 on kernel 6.8.5

I reboot the machine back onto 6.8.5 kernel that was retained from the upgrade. Everything else is the same.

# rsync --info=progress2 test_file /export/media/misc/
 18,199,210,856 100%   75.07MB/s    0:03:51 (xfr#1, to-chk=0/1)
                                                capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool                                          alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
--------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
datastore                                     9.38T  12.4T      3    509  74.4K  91.1M   24ms   42ms   24ms   35ms    1us   11ms      -    6ms      -      -      -
  raidz1-0                                    9.38T  12.4T      3    509  74.4K  91.1M   24ms   42ms   24ms   35ms    1us   11ms      -    6ms      -      -      -
    ata-WDC_WD120EFBX-68B0EN0_5QKTE22B            -      -      0    143  20.5K  22.8M   13ms   14ms   13ms   13ms    1us    1ms      -    1ms      -      -      -
    ata-WDC_WD60EFRX-68L0BN1_WD-WX11D388CR46      -      -      0    139  17.5K  22.8M   19ms   23ms   19ms   21ms    2us    2ms      -    2ms      -      -      -
    ata-WDC_WD120EFBX-68B0EN0_5QKTW1GB            -      -      0    144  19.9K  22.8M   10ms   16ms   10ms   14ms    2us    1ms      -    1ms      -      -      -
    ata-WDC_WD60EFAX-68SHWN0_WD-WX31D2992DH8      -      -      0     83  16.5K  22.8M   63ms  166ms   63ms  134ms    1us   38ms      -   30ms      -      -      -
--------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

I can see my one WD60EFAX disk is the write bottleneck but the overall disk bandwidth is much better than on 6.12.11. I don't have empiric pre-upgrade numbers but these are on par with how the system behaved in the past.

Is there any known change between kernels 6.8 and 6.12 that could explain such a drop in write bandwidth?

ZFS config

NAME       PROPERTY              VALUE                  SOURCE
datastore  type                  filesystem             -
datastore  creation              Wed Aug 15 22:19 2018  -
datastore  used                  6.80T                  -
datastore  available             8.85T                  -
datastore  referenced            140K                   -
datastore  compressratio         1.02x                  -
datastore  mounted               yes                    -
datastore  quota                 none                   default
datastore  reservation           none                   default
datastore  recordsize            128K                   default
datastore  mountpoint            /datastore             default
datastore  sharenfs              off                    default
datastore  checksum              on                     default
datastore  compression           on                     local
datastore  atime                 off                    local
datastore  devices               on                     default
datastore  exec                  on                     default
datastore  setuid                on                     default
datastore  readonly              off                    default
datastore  zoned                 off                    default
datastore  snapdir               hidden                 default
datastore  aclmode               discard                default
datastore  aclinherit            restricted             default
datastore  createtxg             1                      -
datastore  canmount              on                     default
datastore  xattr                 on                     local
datastore  copies                1                      default
datastore  version               5                      -
datastore  utf8only              off                    -
datastore  normalization         none                   -
datastore  casesensitivity       sensitive              -
datastore  vscan                 off                    default
datastore  nbmand                off                    default
datastore  sharesmb              off                    default
datastore  refquota              none                   default
datastore  refreservation        none                   default
datastore  guid                  6385073794792376818    -
datastore  primarycache          all                    default
datastore  secondarycache        all                    default
datastore  usedbysnapshots       0B                     -
datastore  usedbydataset         140K                   -
datastore  usedbychildren        6.80T                  -
datastore  usedbyrefreservation  0B                     -
datastore  logbias               latency                default
datastore  objsetid              51                     -
datastore  dedup                 off                    default
datastore  mlslabel              none                   default
datastore  sync                  standard               default
datastore  dnodesize             legacy                 default
datastore  refcompressratio      1.00x                  -
datastore  written               140K                   -
datastore  logicalused           6.96T                  -
datastore  logicalreferenced     38.5K                  -
datastore  volmode               default                default
datastore  filesystem_limit      none                   default
datastore  snapshot_limit        none                   default
datastore  filesystem_count      none                   default
datastore  snapshot_count        none                   default
datastore  snapdev               hidden                 default
datastore  acltype               posix                  local
datastore  context               none                   default
datastore  fscontext             none                   default
datastore  defcontext            none                   default
datastore  rootcontext           none                   default
datastore  relatime              on                     default
datastore  redundant_metadata    all                    default
datastore  overlay               on                     default
datastore  encryption            off                    default
datastore  keylocation           none                   default
datastore  keyformat             none                   default
datastore  pbkdf2iters           0                      default
datastore  special_small_blocks  0                      default
datastore  prefetch              all                    default
datastore  direct                standard               default
datastore  longname              off                    default
NAME       PROPERTY                       VALUE                          SOURCE
datastore  size                           21.8T                          -
datastore  capacity                       43%                            -
datastore  altroot                        -                              default
datastore  health                         ONLINE                         -
datastore  guid                           16342129409258430014           -
datastore  version                        -                              default
datastore  bootfs                         -                              default
datastore  delegation                     on                             default
datastore  autoreplace                    off                            default
datastore  cachefile                      -                              default
datastore  failmode                       wait                           default
datastore  listsnapshots                  off                            default
datastore  autoexpand                     on                             local
datastore  dedupratio                     1.00x                          -
datastore  free                           12.4T                          -
datastore  allocated                      9.36T                          -
datastore  readonly                       off                            -
datastore  ashift                         0                              default
datastore  comment                        -                              default
datastore  expandsize                     -                              -
datastore  freeing                        0                              -
datastore  fragmentation                  4%                             -
datastore  leaked                         0                              -
datastore  multihost                      off                            default
datastore  checkpoint                     -                              -
datastore  load_guid                      3769170939608627919            -
datastore  autotrim                       off                            default
datastore  compatibility                  off                            default
datastore  bcloneused                     0                              -
datastore  bclonesaved                    0                              -
datastore  bcloneratio                    1.00x                          -
datastore  dedup_table_size               0                              -
datastore  dedup_table_quota              auto                           default
datastore  last_scrubbed_txg              0                              -
datastore  feature@async_destroy          enabled                        local
datastore  feature@empty_bpobj            active                         local
datastore  feature@lz4_compress           active                         local
datastore  feature@multi_vdev_crash_dump  enabled                        local
datastore  feature@spacemap_histogram     active                         local
datastore  feature@enabled_txg            active                         local
datastore  feature@hole_birth             active                         local
datastore  feature@extensible_dataset     active                         local
datastore  feature@embedded_data          active                         local
datastore  feature@bookmarks              enabled                        local
datastore  feature@filesystem_limits      enabled                        local
datastore  feature@large_blocks           enabled                        local
datastore  feature@large_dnode            enabled                        local
datastore  feature@sha512                 enabled                        local
datastore  feature@skein                  enabled                        local
datastore  feature@edonr                  enabled                        local
datastore  feature@userobj_accounting     active                         local
datastore  feature@encryption             enabled                        local
datastore  feature@project_quota          active                         local
datastore  feature@device_removal         enabled                        local
datastore  feature@obsolete_counts        enabled                        local
datastore  feature@zpool_checkpoint       enabled                        local
datastore  feature@spacemap_v2            active                         local
datastore  feature@allocation_classes     enabled                        local
datastore  feature@resilver_defer         enabled                        local
datastore  feature@bookmark_v2            enabled                        local
datastore  feature@redaction_bookmarks    enabled                        local
datastore  feature@redacted_datasets      enabled                        local
datastore  feature@bookmark_written       enabled                        local
datastore  feature@log_spacemap           active                         local
datastore  feature@livelist               enabled                        local
datastore  feature@device_rebuild         enabled                        local
datastore  feature@zstd_compress          enabled                        local
datastore  feature@draid                  enabled                        local
datastore  feature@zilsaxattr             active                         local
datastore  feature@head_errlog            active                         local
datastore  feature@blake3                 enabled                        local
datastore  feature@block_cloning          enabled                        local
datastore  feature@vdev_zaps_v2           active                         local
datastore  feature@redaction_list_spill   enabled                        local
datastore  feature@raidz_expansion        enabled                        local
datastore  feature@fast_dedup             enabled                        local
datastore  feature@longname               enabled                        local
datastore  feature@large_microzap         enabled                        local
datastore:
    version: 5000
    name: 'datastore'
    state: 0
    txg: 40251566
    pool_guid: 16342129409258430014
    errata: 0
    hostname: '<redacted>'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 16342129409258430014
        com.klarasystems:vdev_zap_root: 262
        children[0]:
            type: 'raidz'
            id: 0
            guid: 10961123018087333276
            nparity: 1
            metaslab_array: 134
            metaslab_shift: 37
            ashift: 12
            asize: 24004641423360
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 11253126901744516510
                path: '/dev/disk/by-id/ata-WDC_WD120EFBX-68B0EN0_5QKTE22B-part1'
                devid: 'ata-WDC_WD120EFBX-68B0EN0_5QKTE22B-part1'
                phys_path: 'pci-0000:00:1f.2-ata-1.0'
                whole_disk: 1
                DTL: 394
                create_txg: 4
                com.delphix:vdev_zap_leaf: 390
            children[1]:
                type: 'disk'
                id: 1
                guid: 7030484083305717953
                path: '/dev/disk/by-id/ata-WDC_WD60EFRX-68L0BN1_WD-WX11D388CR46-part1'
                devid: 'ata-WDC_WD60EFRX-68L0BN1_WD-WX11D388CR46-part1'
                phys_path: 'pci-0000:00:1f.2-ata-2.0'
                whole_disk: 1
                DTL: 254
                create_txg: 4
                com.delphix:vdev_zap_leaf: 131
            children[2]:
                type: 'disk'
                id: 2
                guid: 14179193750736776206
                path: '/dev/disk/by-id/ata-WDC_WD120EFBX-68B0EN0_5QKTW1GB-part1'
                devid: 'ata-WDC_WD120EFBX-68B0EN0_5QKTW1GB-part1'
                phys_path: 'pci-0000:00:1f.2-ata-3.0'
                whole_disk: 1
                DTL: 9669
                create_txg: 4
                com.delphix:vdev_zap_leaf: 10910
            children[3]:
                type: 'disk'
                id: 3
                guid: 16638866350254836057
                path: '/dev/disk/by-id/ata-WDC_WD60EFAX-68SHWN0_WD-WX31D2992DH8-part1'
                devid: 'ata-WDC_WD60EFAX-68SHWN0_WD-WX31D2992DH8-part1'
                phys_path: 'pci-0000:00:1f.2-ata-4.0'
                whole_disk: 1
                DTL: 408
                create_txg: 4
                com.delphix:vdev_zap_leaf: 400
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
        com.klarasystems:vdev_zaps_v2
@eyedvabny eyedvabny added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 7, 2025
@xeghia
Copy link

xeghia commented Feb 7, 2025

Fedora 41/6.12 got my attention as I've just upgraded myself.

WD60EFAX are SMR drives which will tank performance, but the kernel change is interesting. It could well be booting back to 6.8 gives this drive more time to soak up writes before performance inevitably drops off.

Try both kernels but monitor performance closely, if performance is good for a bit and then suddenly drops off then it will just be the joy of using an SMR drive unfortunately. The other one WD60EFRX is CMR though so it should be fine.

Looks like you're in the process of replacing them all with WD120EFBX? If so, replace that slow drive next or switch it for a WD60EFRX if you have one spare.

@eyedvabny
Copy link
Author

I was expecting the SMR drive to be the bottleneck and it is under the 6.8 kernel, so I probably could get even faster throughput once I replace it. But for some reason under 6.12 it's the least-slow drive. The first two drives in the array have 100+ ms write waits.

@amotin
Copy link
Member

amotin commented Feb 7, 2025

But for some reason under 6.12 it's the least-slow drive.

Considering how complicated and unpredictable can be the SMR drive firmware, to me it is not obvious that the kernel version has anything to do about it, unless you tested switching there and back dozen times in unpredictable order. Otherwise it can just be that different drives started some internal housekeeping at different times.

@wiesl
Copy link

wiesl commented Feb 7, 2025

I cannot reproduce it on a raidz2-0 with 10 physical harddrives. RAM is less than written data.

zfs version
zfs-2.2.7-1
zfs-kmod-2.2.7-1

[root@hostname ~]# uname -a
Linux hostname 6.12.12-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Feb  1 19:02:08 UTC 2025 x86_64 GNU/Linux

[root@hostname ~]# dd if=/dev/urandom of=/shares/path/test.bin bs=131072 count=330000
330000+0 records in
330000+0 records out
43253760000 bytes (43 GB, 40 GiB) copied, 96.8703 s, 447 MB/s

@mattbat45
Copy link

Reproducible on my F41 machine with these kernel/zfs combos:

  1. zfs-2.2.7-1.fc41.x86_64 and kernel-6.12.5-200.fc41.x86_64
  2. zfs-2.2.7-1.fc41.x86_64 with older kernel kernel-6.10.11-200.fc40.x86_64
  3. latest zfs zfs-2.3.0-1.fc41.x86_64 with latest kernel kernel-6.12.11-200.fc41.x86_64

I have a mirrored pool setup with 18TB Exos drives, and exhibited similar performance to OP with asyncq_wait in the hundreds of ms:

$> zpool iostat -vyl 30 1
                                                      capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool                                                alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
--------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
datapool                                            38.0T  76.5T      0     56    682  44.7M   16ms  213ms   16ms   12ms    3us    1us      -  200ms      -      -      -
  mirror-0                                          5.44T  10.9T      0      8      0  6.10M      -  195ms      -   12ms      -    1us      -  177ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      4      0  3.05M      -  191ms      -   12ms      -    1us      -  174ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      4      0  3.05M      -  198ms      -   13ms      -    1us      -  180ms      -      -      -
  mirror-1                                          5.44T  10.9T      0      7      0  6.23M      -  213ms      -   11ms      -    1us      -  201ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      3      0  3.11M      -  211ms      -   11ms      -    1us      -  196ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      3      0  3.11M      -  216ms      -   11ms      -    1us      -  206ms      -      -      -
  mirror-2                                          5.45T  10.9T      0      7      0  5.95M      -  128ms      -   10ms      -    1us      -  125ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      3      0  2.97M      -  122ms      -   10ms      -    1us      -  119ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      3      0  2.97M      -  135ms      -   10ms      -    1us      -  130ms      -      -      -
  mirror-3                                          5.45T  10.9T      0      8      0  6.15M      -  254ms      -   11ms      -      -      -  235ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      4      0  3.08M      -  246ms      -   11ms      -      -      -  216ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      3      0  3.08M      -  262ms      -   12ms      -      -      -  254ms      -      -      -
  mirror-4                                          5.30T  11.1T      0      9    136  7.77M   12ms  283ms   12ms   13ms    3us      -      -  272ms      -      -      -
    ata-ST18000NM0092-3CX103_...-part1                  -      -      0      4    136  3.88M   12ms  280ms   12ms   13ms    3us      -      -  271ms      -      -      -
    ata-ST18000NM0092-3CX103_...-part1                  -      -      0      4      0  3.88M      -  286ms      -   13ms      -      -      -  274ms      -      -      -
  mirror-5                                          5.45T  10.9T      0      7    409  6.26M   12ms  227ms   12ms   12ms    3us      -      -  199ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      3      0  3.13M      -  200ms      -   11ms      -      -      -  151ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      3    409  3.13M   12ms  254ms   12ms   13ms    3us      -      -  246ms      -      -      -
  mirror-6                                          5.44T  10.9T      0      6    136  6.23M   25ms  173ms   25ms   12ms    3us      -      -  169ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      3    136  3.11M   25ms  174ms   25ms   11ms    3us      -      -  169ms      -      -      -
    ata-ST18000NM000J-2TV103_...-part1                  -      -      0      3      0  3.11M      -  173ms      -   12ms      -      -      -  169ms      -      -      -
--------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

I don't have any old enough kernels to test where the performance was back to normal, but I'm struggling to write more than ~40MB/s on a pool that used to have several hundred MB/s write throughput. Read speed seems normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants