Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS 2.3.0 ignores zfs_arc_max, exhausts system memory #17052

Open
brian-maloney opened this issue Feb 13, 2025 · 11 comments
Open

ZFS 2.3.0 ignores zfs_arc_max, exhausts system memory #17052

brian-maloney opened this issue Feb 13, 2025 · 11 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@brian-maloney
Copy link

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version Rolling Release
Kernel Version 6.12.13
Architecture x86_64
OpenZFS Version 2.3.0

Describe the problem you're observing

Possibly a duplicate of #16325, but I didn't see the behavior with the versions listed in that bug.

After upgrading from kernel 6.6.63 with ZFS 2.2.6 to kernel 6.12.13 with ZFS 2.3.0, I am experiencing issues with ARC memory utilization when my system's daily backups occur. My system has 16GB of RAM and I have zfs_arc_max set to 2147483648 (2 GiB). When the backup occurs now, the ARC grows to a large size and the system locks up due to memory exhaustion.

I can help the system get through the backup using /proc/sys/vm/drop_caches to clear the cache manually. I'm attempting to tune other ZFS settings to see if there's another way around it but I've never seen it exceed zfs_arc_max by this much before.

Describe how to reproduce the problem

This is reproducible any time I run a backup task.

Include any warning/errors/backtraces from the system logs

Here's an arc_summary -d run exhibiting the issue:


------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Feb 12 14:34:22 2025
Linux 6.12.13-1-lts                                              2.3.0-1
Machine: industry (x86_64)                                       2.3.0-1

ARC status:
        Total memory size:                                      15.6 GiB
        Min target size:                                3.1 %  499.6 MiB
        Max target size:                               12.8 %    2.0 GiB
        Target size (adaptive):                       229.6 %    2.0 GiB
        Current size:                                 229.6 %    4.6 GiB
        Free memory size:                                        1.4 GiB
        Available memory size:                                 804.8 MiB

ARC structural breakdown (current size):                         4.6 GiB
        Compressed size:                               47.5 %    2.2 GiB
        Overhead size:                                 26.0 %    1.2 GiB
        Bonus size:                                     9.8 %  459.5 MiB
        Dnode size:                                    11.2 %  526.9 MiB
        Dbuf size:                                      4.9 %  229.5 MiB
        Header size:                                    0.6 %   29.1 MiB
        L2 header size:                                 0.0 %    0 Bytes
        ABD chunk waste size:                         < 0.1 %  148.0 KiB

ARC types breakdown (compressed + overhead):                     3.4 GiB
        Data size:                                      2.5 %   84.8 MiB
        Metadata size:                                 97.5 %    3.3 GiB

ARC states breakdown (compressed + overhead):                    3.4 GiB
        Anonymous data size:                            0.3 %   11.3 MiB
        Anonymous metadata size:                      < 0.1 %  942.0 KiB
        MFU data target:                               35.3 %    1.2 GiB
        MFU data size:                                  1.9 %   65.5 MiB
        MFU evictable data size:                        0.1 %    2.5 MiB
        MFU ghost data size:                                   460.4 MiB
        MFU metadata target:                           13.4 %  461.9 MiB
        MFU metadata size:                             80.3 %    2.7 GiB
        MFU evictable metadata size:                  < 0.1 %   56.0 KiB
        MFU ghost metadata size:                               329.7 MiB
        MRU data target:                               36.6 %    1.2 GiB
        MRU data size:                                  0.2 %    8.1 MiB
        MRU evictable data size:                        0.0 %    0 Bytes
        MRU ghost data size:                                   915.5 MiB
        MRU metadata target:                           14.8 %  511.6 MiB
        MRU metadata size:                             17.2 %  595.5 MiB
        MRU evictable metadata size:                  < 0.1 %  176.0 KiB
        MRU ghost metadata size:                               191.4 MiB
        Uncached data size:                             0.0 %    0 Bytes
        Uncached metadata size:                         0.0 %    0 Bytes

ARC hash breakdown:
        Elements:                                                 117.1k
        Collisions:                                                14.2k
        Chain max:                                                     3
        Chains:                                                     3.2k

ARC misc:
        Memory throttles:                                              0
        Memory direct reclaims:                                        0
        Memory indirect reclaims:                                      0
        Deleted:                                                  169.2k
        Mutex misses:                                               9.2k
        Eviction skips:                                            14.3M
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                   4.7 GiB
        L2 eligible MFU evictions:                     19.0 %  916.8 MiB
        L2 eligible MRU evictions:                     81.0 %    3.8 GiB
        L2 ineligible evictions:                               716.9 MiB

ARC total accesses:                                                10.1M
        Total hits:                                    96.8 %       9.8M
        Total I/O hits:                                 0.3 %      26.7k
        Total misses:                                   3.0 %     300.4k

ARC demand data accesses:                              63.7 %       6.4M
        Demand data hits:                              97.3 %       6.3M
        Demand data I/O hits:                         < 0.1 %       3.2k
        Demand data misses:                             2.7 %     172.7k

ARC demand metadata accesses:                          35.1 %       3.5M
        Demand metadata hits:                          99.2 %       3.5M
        Demand metadata I/O hits:                       0.3 %       9.7k
        Demand metadata misses:                         0.5 %      18.6k

ARC prefetch data accesses:                             0.3 %      26.6k
        Prefetch data hits:                             5.4 %       1.4k
        Prefetch data I/O hits:                       < 0.1 %          5
        Prefetch data misses:                          94.6 %      25.2k

ARC prefetch metadata accesses:                         1.0 %      99.5k
        Prefetch metadata hits:                         1.8 %       1.8k
        Prefetch metadata I/O hits:                    13.8 %      13.8k
        Prefetch metadata misses:                      84.4 %      83.9k

ARC predictive prefetches:                             99.9 %     125.9k
        Demand hits after predictive:                  62.2 %      78.3k
        Demand I/O hits after predictive:              14.7 %      18.6k
        Never demanded after predictive:               23.0 %      29.0k

ARC prescient prefetches:                               0.1 %        182
        Demand hits after prescient:                   65.4 %        119
        Demand I/O hits after prescient:               34.6 %         63
        Never demanded after prescient:                 0.0 %          0

ARC states hits of all accesses:
        Most frequently used (MFU):                    84.7 %       8.6M
        Most recently used (MRU):                      12.1 %       1.2M
        Most frequently used (MFU) ghost:               0.1 %       6.6k
        Most recently used (MRU) ghost:                 0.1 %      12.6k
        Uncached:                                       0.0 %          0

DMU predictive prefetcher calls:                                    2.7M
        Stream hits:                                   67.2 %       1.8M
        Hits ahead of stream:                           3.2 %      88.5k
        Hits behind stream:                            24.9 %     682.7k
        Stream misses:                                  4.6 %     126.5k
        Streams limit reached:                         51.6 %      65.3k
        Stream strides:                                              992
        Prefetches issued                                          26.9k

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        # The system hostid.
        spl_hostid                                                     0
        # The system hostid file (/etc/hostid)
        spl_hostid_path                                      /etc/hostid
        # Maximum size in bytes for a kmem_alloc()
        spl_kmem_alloc_max                                       1048576
        # Warning threshold in bytes for a kmem_alloc()
        spl_kmem_alloc_warn                                        65536
        # Number of spl_kmem_cache threads
        spl_kmem_cache_kmem_threads                                    4
        # Default magazine size (2-256), set automatically (0)
        spl_kmem_cache_magazine_size                                   0
        # Maximum size of slab in MB
        spl_kmem_cache_max_size                                       32
        # Number of objects per slab
        spl_kmem_cache_obj_per_slab                                    8
        # Objects less than N bytes use the Linux slab
        spl_kmem_cache_slab_limit                                  16384
        # Cause kernel panic on assertion failures
        spl_panic_halt                                                 0
        # schedule_hrtimeout_range() delta/slack value in us, default
        spl_schedule_hrtimeout_slack_us                                0
        # Write nonzero to kick stuck taskqs to spawn more threads
        spl_taskq_kick                                                 0
        # Bind taskq thread to CPU by default
        spl_taskq_thread_bind                                          0
        # Allow dynamic taskq threads
        spl_taskq_thread_dynamic                                       1
        # Allow non-default priority for taskq threads
        spl_taskq_thread_priority                                      1
        # Create new taskq threads after N sequential tasks
        spl_taskq_thread_sequential                                    4
        # Minimum idle threads exit interval for dynamic taskqs
        spl_taskq_thread_timeout_ms                                 5000

Tunables:
        # BRT ZAP leaf blockshift
        brt_zap_default_bs                                            12
        # BRT ZAP indirect blockshift
        brt_zap_default_ibs                                           12
        # Enable prefetching of BRT ZAP entries
        brt_zap_prefetch                                               1
        # Percentage over dbuf_cache_max_bytes for direct dbuf eviction.
        dbuf_cache_hiwater_pct                                        10
        # Percentage below dbuf_cache_max_bytes when dbuf eviction stops.
        dbuf_cache_lowater_pct                                        10
        # Maximum size in bytes of the dbuf cache.
        dbuf_cache_max_bytes                        18446744073709551615
        # Set size of dbuf cache to log2 fraction of arc size.
        dbuf_cache_shift                                               5
        # Maximum size in bytes of dbuf metadata cache.
        dbuf_metadata_cache_max_bytes               18446744073709551615
        # Set size of dbuf metadata cache to log2 fraction of arc size.
        dbuf_metadata_cache_shift                                      6
        # Set size of dbuf cache mutex array as log2 shift.
        dbuf_mutex_cache_shift                                         0
        # DDT ZAP leaf blockshift
        ddt_zap_default_bs                                            15
        # DDT ZAP indirect blockshift
        ddt_zap_default_ibs                                           15
        # Override copies= for dedup objects
        dmu_ddt_copies                                                 0
        # CPU-specific allocator grabs 2^N objects at once
        dmu_object_alloc_chunk_shift                                   7
        # Limit one prefetch call to this size
        dmu_prefetch_max                                       134217728
        # Select aes implementation.
        icp_aes_impl                      cycle [fastest] generic x86_64
        # How many bytes to process while owning the FPU
        icp_gcm_avx_chunk_size                                     32736
        # Select gcm implementation.
        icp_gcm_impl                             cycle [fastest] generic
        # Alias for send_holes_without_birth_time
        ignore_hole_birth                                              1
        # Exclude dbufs on special vdevs from being cached to L2ARC if set.
        l2arc_exclude_special                                          0
        # Turbo L2ARC warmup
        l2arc_feed_again                                               1
        # Min feed interval in milliseconds
        l2arc_feed_min_ms                                            200
        # Seconds between L2ARC writing
        l2arc_feed_secs                                                1
        # Number of max device writes to precache
        l2arc_headroom                                                 8
        # Compressed l2arc_headroom multiplier
        l2arc_headroom_boost                                         200
        # Percent of ARC size allowed for L2ARC-only headers
        l2arc_meta_percent                                            33
        # Cache only MFU data from ARC into L2ARC
        l2arc_mfuonly                                                  0
        # Skip caching prefetched buffers
        l2arc_noprefetch                                               1
        # No reads during writes
        l2arc_norw                                                     0
        # Min size in bytes to write rebuild log blocks in L2ARC
        l2arc_rebuild_blocks_min_l2size                       1073741824
        # Rebuild the L2ARC when importing a pool
        l2arc_rebuild_enabled                                          1
        # TRIM ahead L2ARC write size multiplier
        l2arc_trim_ahead                                               0
        # Extra write bytes during device warmup
        l2arc_write_boost                                       33554432
        # Max write bytes per interval
        l2arc_write_max                                         33554432
        # Allocation granularity
        metaslab_aliquot                                         1048576
        # Enable metaslab group biasing
        metaslab_bias_enabled                                          1
        # Load all metaslabs when pool is first opened
        metaslab_debug_load                                            0
        # Prevent metaslabs from being unloaded
        metaslab_debug_unload                                          0
        # Max distance
        metaslab_df_max_search                                  16777216
        # When looking in size tree, use largest segment instead of exact fit
        metaslab_df_use_largest_segment                                0
        # Blocks larger than this size are sometimes forced to be gang blocks
        metaslab_force_ganging                                  16777217
        # Percentage of large blocks that will be forced to be gang blocks
        metaslab_force_ganging_pct                                     3
        # Use the fragmentation metric to prefer less fragmented metaslabs
        metaslab_fragmentation_factor_enabled                          1
        # Prefer metaslabs with lower LBAs
        metaslab_lba_weighting_enabled                                 1
        # Preload potential metaslabs during reassessment
        metaslab_preload_enabled                                       1
        # Max number of metaslabs per group to preload
        metaslab_preload_limit                                        10
        # Percentage of CPUs to run a metaslab preload taskq
        metaslab_preload_pct                                          50
        # Delay in txgs after metaslab was last used before unloading
        metaslab_unload_delay                                         32
        # Delay in milliseconds after metaslab was last used before unloading
        metaslab_unload_delay_ms                                  600000
        # Max amount of concurrent i/o for RAIDZ expansion
        raidz_expand_max_copy_bytes                            167772160
        # For testing, pause RAIDZ expansion after reflowing this many bytes
        raidz_expand_max_reflow_bytes                                  0
        # For expanded RAIDZ, aggregate reads that have more rows than this
        raidz_io_aggregate_rows                                        4
        # Ignore hole_birth txg for zfs send
        send_holes_without_birth_time                                  1
        # SPA size estimate multiplication factor
        spa_asize_inflation                                           24
        # SPA config file
        spa_config_path                             /etc/zfs/zpool.cache
        # Minimum number of CPUs per allocators
        spa_cpus_per_allocator                                         4
        # Print vdev tree to zfs_dbgmsg during pool import
        spa_load_print_vdev_tree                                       0
        # Set to traverse data on pool import
        spa_load_verify_data                                           1
        # Set to traverse metadata on pool import
        spa_load_verify_metadata                                       1
        # log2 fraction of arc that can be used by inflight I/Os when verifying pool during import
        spa_load_verify_shift                                          4
        # Number of allocators per spa
        spa_num_allocators                                             4
        # Reserved free space in pool
        spa_slop_shift                                                 5
        # Limit the number of errors which will be upgraded to the new on-disk error log when enabling head_errlog
        spa_upgrade_errlog_limit                                       0
        # Logical ashift for file-based devices
        vdev_file_logical_ashift                                       9
        # Physical ashift for file-based devices
        vdev_file_physical_ashift                                      9
        # Largest span of free chunks a remap segment can span
        vdev_removal_max_span                                      32768
        # Bypass vdev_validate
        vdev_validate_skip                                             0
        # When iterating ZAP object, prefetch it
        zap_iterate_prefetch                                           1
        # Maximum micro ZAP size before converting to a fat ZAP, in bytes
        zap_micro_max_size                                        131072
        # Enable ZAP shrinking
        zap_shrink_enabled                                             1
        # Max log2 fraction of holes in a stream
        zfetch_hole_shift                                              2
        # Max bytes to prefetch per stream
        zfetch_max_distance                                     67108864
        # Max bytes to prefetch indirects for per stream
        zfetch_max_idistance                                   134217728
        # Max request reorder distance within a stream
        zfetch_max_reorder                                      16777216
        # Max time before stream delete
        zfetch_max_sec_reap                                            2
        # Max number of streams per zfetch
        zfetch_max_streams                                             8
        # Min bytes to prefetch per stream
        zfetch_min_distance                                      4194304
        # Min time before stream reclaim
        zfetch_min_sec_reap                                            1
        # Toggle whether ABD allocations must be linear.
        zfs_abd_scatter_enabled                                        1
        # Maximum order allocation used for a scatter ABD.
        zfs_abd_scatter_max_order                                      9
        # Minimum size of scatter allocations.
        zfs_abd_scatter_min_size                                    1536
        # SPA active allocator
        zfs_active_allocator                                     dynamic
        # Enable mkdir/rmdir/mv in .zfs/snapshot
        zfs_admin_snapshot                                             0
        # Allow mounting of redacted datasets
        zfs_allow_redacted_dataset_mount                               0
        # Target average block size
        zfs_arc_average_blocksize                                   8192
        # Minimum bytes of dnodes in ARC
        zfs_arc_dnode_limit                                            0
        # Percent of ARC meta buffers for dnodes
        zfs_arc_dnode_limit_percent                                   10
        # Percentage of excess dnodes to try to unpin
        zfs_arc_dnode_reduce_percent                                  10
        # The number of headers to evict per sublist before moving to the next
        zfs_arc_evict_batch_limit                                     10
        # When full, ARC allocation waits for eviction of this % of alloc size
        zfs_arc_eviction_pct                                         200
        # Seconds before growing ARC size
        zfs_arc_grow_retry                                             0
        # System free memory I/O throttle in bytes
        zfs_arc_lotsfree_percent                                      10
        # Maximum ARC size in bytes
        zfs_arc_max                                           2147483648
        # Balance between metadata and data on ghost hits.
        zfs_arc_meta_balance                                         500
        # Minimum ARC size in bytes
        zfs_arc_min                                                    0
        # Min life of prefetch block in ms
        zfs_arc_min_prefetch_ms                                        0
        # Min life of prescient prefetched block in ms
        zfs_arc_min_prescient_prefetch_ms                              0
        # Percent of pagecache to reclaim ARC to
        zfs_arc_pc_percent                                             0
        # Number of arc_prune threads
        zfs_arc_prune_task_threads                                     1
        # log2
        zfs_arc_shrink_shift                                           0
        # Limit on number of pages that ARC shrinker can reclaim at once
        zfs_arc_shrinker_limit                                         0
        # Relative cost of ARC eviction vs other kernel subsystems
        zfs_arc_shrinker_seeks                                         2
        # System free memory target size in bytes
        zfs_arc_sys_free                                               0
        # Max number of blocks freed in one txg
        zfs_async_block_max_blocks                  18446744073709551615
        # Disable pool import at module load
        zfs_autoimport_disable                                         1
        # Enable block cloning
        zfs_bclone_enabled                                             1
        # Wait for dirty blocks when cloning
        zfs_bclone_wait_dirty                                          0
        # Select BLAKE3 implementation.
        zfs_blake3_impl                     cycle [fastest] generic sse2
        # Enable btree verification. Levels above 4 require ZFS be built with debugging
        zfs_btree_verify_intensity                                     0
        # Rate limit checksum events to this many checksum errors per second
        zfs_checksum_events_per_second                                20
        # ZIL block open timeout percentage
        zfs_commit_timeout_pct                                        10
        # Disable compressed ARC buffers
        zfs_compressed_arc_enabled                                     1
        # Used by tests to ensure certain actions happen in the middle of a condense. A maximum value of 1 should be sufficient.
        zfs_condense_indirect_commit_entry_delay_ms                    0
        # Minimum obsolete percent of bytes in the mapping to attempt condensing
        zfs_condense_indirect_obsolete_pct                            25
        # Whether to attempt condensing indirect vdev mappings
        zfs_condense_indirect_vdevs_enable                             1
        # Minimum size obsolete spacemap to attempt condensing
        zfs_condense_max_obsolete_bytes                       1073741824
        # Don't bother condensing if the mapping uses less than this amount of memory
        zfs_condense_min_mapping_bytes                            131072
        # Enable ZFS debug message log
        zfs_dbgmsg_enable                                              1
        # Maximum ZFS debug log size
        zfs_dbgmsg_maxsize                                       4194304
        # Calculate arc header index
        zfs_dbuf_state_index                                           0
        # Place DDT data into the special class
        zfs_ddt_data_is_special                                        1
        # Dead I/O check interval in milliseconds
        zfs_deadman_checktime_ms                                   60000
        # Enable deadman timer
        zfs_deadman_enabled                                            1
        # Rate limit hung IO
        zfs_deadman_events_per_second                                  1
        # Failmode for deadman timer
        zfs_deadman_failmode                                        wait
        # Pool sync expiration time in milliseconds
        zfs_deadman_synctime_ms                                   600000
        # IO expiration time in milliseconds
        zfs_deadman_ziotime_ms                                    300000
        # Min number of log entries to flush each transaction
        zfs_dedup_log_flush_entries_min                             1000
        # Number of txgs to average flow rates across
        zfs_dedup_log_flush_flow_rate_txgs                            10
        # Min time to spend on incremental dedup log flush each transaction
        zfs_dedup_log_flush_min_time_ms                             1000
        # Max number of incremental dedup log flush passes per transaction
        zfs_dedup_log_flush_passes_max                                 8
        # Max memory for dedup logs
        zfs_dedup_log_mem_max                                  167635763
        # Max memory for dedup logs, as % of total memory
        zfs_dedup_log_mem_max_percent                                  1
        # Max transactions before starting to flush dedup logs
        zfs_dedup_log_txg_max                                          8
        # Enable prefetching dedup-ed blks
        zfs_dedup_prefetch                                             0
        # Default dnode block shift
        zfs_default_bs                                                 9
        # Default dnode indirect block shift
        zfs_default_ibs                                               17
        # Transaction delay threshold
        zfs_delay_min_dirty_percent                                   60
        # How quickly delay approaches infinity
        zfs_delay_scale                                           500000
        # Delete files larger than N blocks async
        zfs_delete_blocks                                          20480
        # Enable Direct I/O
        zfs_dio_enabled                                                1
        # Rate Direct I/O write verify events to this many per second
        zfs_dio_write_verify_events_per_second                        20
        # Determines the dirty space limit
        zfs_dirty_data_max                                    1676357632
        # zfs_dirty_data_max upper bound in bytes
        zfs_dirty_data_max_max                                4190894080
        # zfs_dirty_data_max upper bound as % of RAM
        zfs_dirty_data_max_max_percent                                25
        # Max percent of RAM allowed to be dirty
        zfs_dirty_data_max_percent                                    10
        # Dirty data txg sync threshold as a percentage of zfs_dirty_data_max
        zfs_dirty_data_sync_percent                                   20
        # Set to allow raw receives without IVset guids
        zfs_disable_ivset_guid_check                                   0
        # Enable forcing txg sync to find holes
        zfs_dmu_offset_next_sync                                       1
        # Minimum number of metaslabs required to dedicate one for log blocks
        zfs_embedded_slog_min_ms                                      64
        # Seconds to expire .zfs/snapshot
        zfs_expire_snapshot                                          300
        # Percentage of length to use for the available capacity check
        zfs_fallocate_reserve_percent                                110
        # Set additional debugging flags
        zfs_flags                                                      0
        # Select fletcher 4 implementation.
        zfs_fletcher_4_impl [fastest] scalar superscalar superscalar4 sse2 ssse3
        # Enable processing of the free_bpobj
        zfs_free_bpobj_enabled                                         1
        # Set to ignore IO errors during free and permanently leak the space
        zfs_free_leak_on_eio                                           0
        # Min millisecs to free per txg
        zfs_free_min_time_ms                                        1000
        # Maximum size in bytes of ZFS ioctl output that will be logged
        zfs_history_output_max                                   1048576
        # Largest data block to write to zil
        zfs_immediate_write_sz                                     32768
        # Size in bytes of writes by zpool initialize
        zfs_initialize_chunk_size                                1048576
        # Value written during zpool initialize
        zfs_initialize_value                        16045690984833335022
        # Prevent the log spacemaps from being flushed and destroyed during pool export/destroy
        zfs_keep_log_spacemaps_at_export                               0
        # Max number of times a salt value can be used for generating encryption keys before it is rotated
        zfs_key_max_salt_uses                                  400000000
        # Whether extra ALLOC blkptrs were added to a livelist entry while it was being condensed
        zfs_livelist_condense_new_alloc                                0
        # Whether livelist condensing was canceled in the synctask
        zfs_livelist_condense_sync_cancel                              0
        # Set the livelist condense synctask to pause
        zfs_livelist_condense_sync_pause                               0
        # Whether livelist condensing was canceled in the zthr function
        zfs_livelist_condense_zthr_cancel                              0
        # Set the livelist condense zthr to pause
        zfs_livelist_condense_zthr_pause                               0
        # Size to start the next sub-livelist in a livelist
        zfs_livelist_max_entries                                  500000
        # Threshold at which livelist is disabled
        zfs_livelist_min_percent_shared                               75
        # Max instruction limit that can be specified for a channel program
        zfs_lua_max_instrlimit                                 100000000
        # Max memory limit that can be specified for a channel program
        zfs_lua_max_memlimit                                   104857600
        # Max number of dedup blocks freed in one txg
        zfs_max_async_dedup_frees                                 100000
        # Limit to the amount of nesting a path can have. Defaults to 50.
        zfs_max_dataset_nesting                                       50
        # The number of past TXGs that the flushing algorithm of the log spacemap feature uses to estimate incoming log blocks
        zfs_max_log_walking                                            5
        # Maximum number of rows allowed in the summary of the spacemap log
        zfs_max_logsm_summary_length                                  10
        # Allow importing pool with up to this number of missing top-level vdevs
        zfs_max_missing_tvds                                           0
        # Maximum size in bytes allowed for src nvlist passed with ZFS ioctls
        zfs_max_nvlist_src_size                                        0
        # Max allowed record size
        zfs_max_recordsize                                      16777216
        # Normally only consider this many of the best metaslabs in each vdev
        zfs_metaslab_find_max_tries                                  100
        # Fragmentation for metaslab to allow allocation
        zfs_metaslab_fragmentation_threshold                          70
        # How long to trust the cached max chunk size of a metaslab
        zfs_metaslab_max_size_cache_sec                             3600
        # Percentage of memory that can be used to store metaslab range trees
        zfs_metaslab_mem_limit                                        25
        # Enable segment-based metaslab selection
        zfs_metaslab_segment_weight_enabled                            1
        # Segment-based metaslab selection maximum buckets before switching
        zfs_metaslab_switch_threshold                                  2
        # Try hard to allocate before ganging
        zfs_metaslab_try_hard_before_gang                              0
        # Percentage of metaslab group size that should be considered eligible for allocations unless all metaslab groups within the metaslab class have also crossed this threshold
        zfs_mg_fragmentation_threshold                                95
        # Percentage of metaslab group size that should be free to make it eligible for allocation
        zfs_mg_noalloc_threshold                                       0
        # Minimum number of metaslabs to flush per dirty TXG
        zfs_min_metaslabs_to_flush                                     1
        # Max allowed period without a successful mmp write
        zfs_multihost_fail_intervals                                  10
        # Historical statistics for last N multihost writes
        zfs_multihost_history                                          0
        # Number of zfs_multihost_interval periods to wait for activity
        zfs_multihost_import_intervals                                20
        # Milliseconds between mmp writes to each leaf
        zfs_multihost_interval                                      1000
        # Number of sublists used in each multilist
        zfs_multilist_num_sublists                                     0
        # Set to disable scrub I/O
        zfs_no_scrub_io                                                0
        # Set to disable scrub prefetching
        zfs_no_scrub_prefetch                                          0
        # Disable cache flushes
        zfs_nocacheflush                                               0
        # Enable NOP writes
        zfs_nopwrite_enabled                                           1
        # Size of znode hold array
        zfs_object_mutex_size                                         64
        # Min millisecs to obsolete per txg
        zfs_obsolete_min_time_ms                                     500
        # Override block size estimate with fixed size
        zfs_override_estimate_recordsize                               0
        # Max number of bytes to prefetch
        zfs_pd_bytes_max                                        52428800
        # Percentage of dirtied blocks from frees in one TXG
        zfs_per_txg_dirty_frees_percent                               30
        # Disable all ZFS prefetching
        zfs_prefetch_disable                                           0
        # Historical statistics for the last N reads
        zfs_read_history                                               0
        # Include cache hits in read history
        zfs_read_history_hits                                          0
        # Max segment size in bytes of rebuild reads
        zfs_rebuild_max_segment                                  1048576
        # Automatically scrub after sequential resilver completes
        zfs_rebuild_scrub_enabled                                      1
        # Max bytes in flight per leaf vdev for sequential resilvers
        zfs_rebuild_vdev_limit                                  67108864
        # Maximum number of combinations when reconstructing split segments
        zfs_reconstruct_indirect_combinations_max                   4096
        # Set to attempt to recover from fatal errors
        zfs_recover                                                    0
        # Ignore errors during corrective receive
        zfs_recv_best_effort_corrective                                0
        # Receive queue fill fraction
        zfs_recv_queue_ff                                             20
        # Maximum receive queue length
        zfs_recv_queue_length                                   16777216
        # Maximum amount of writes to batch into one transaction
        zfs_recv_write_batch_size                                1048576
        # Ignore hard IO errors when removing device
        zfs_removal_ignore_errors                                      0
        # Pause device removal after this many bytes are copied
        zfs_removal_suspend_progress                                   0
        # Largest contiguous segment to allocate when removing device
        zfs_remove_max_segment                                  16777216
        # Issued IO percent complete after which resilvers are deferred
        zfs_resilver_defer_percent                                    10
        # Process all resilvers immediately
        zfs_resilver_disable_defer                                     0
        # Min millisecs to resilver per txg
        zfs_resilver_min_time_ms                                    3000
        # Enable block statistics calculation during scrub
        zfs_scan_blkstats                                              0
        # Scan progress on-disk checkpointing interval
        zfs_scan_checkpoint_intval                                  7200
        # Tunable to adjust bias towards more filled segments during scans
        zfs_scan_fill_weight                                           3
        # Ignore errors during resilver/scrub
        zfs_scan_ignore_errors                                         0
        # IO issuing strategy during scrubbing. 0 = default, 1 = LBA, 2 = size
        zfs_scan_issue_strategy                                        0
        # Scrub using legacy non-sequential method
        zfs_scan_legacy                                                0
        # Max gap in bytes between sequential scrub / resilver I/Os
        zfs_scan_max_ext_gap                                     2097152
        # Fraction of RAM for scan hard limit
        zfs_scan_mem_lim_fact                                         20
        # Fraction of hard limit used as soft limit
        zfs_scan_mem_lim_soft_fact                                    20
        # Tunable to report resilver performance over the last N txgs
        zfs_scan_report_txgs                                           0
        # Tunable to attempt to reduce lock contention
        zfs_scan_strict_mem_lim                                        0
        # Set to prevent scans from progressing
        zfs_scan_suspend_progress                                      0
        # Max bytes in flight per leaf vdev for scrubs and resilvers
        zfs_scan_vdev_limit                                     16777216
        # For expanded RAIDZ, automatically start a pool scrub when expansion completes
        zfs_scrub_after_expand                                         1
        # Error blocks to be scrubbed in one txg
        zfs_scrub_error_blocks_per_txg                              4096
        # Min millisecs to scrub per txg
        zfs_scrub_min_time_ms                                       1000
        # Allow sending corrupt data
        zfs_send_corrupt_data                                          0
        # Send queue fill fraction for non-prefetch queues
        zfs_send_no_prefetch_queue_ff                                 20
        # Maximum send queue length for non-prefetch queues
        zfs_send_no_prefetch_queue_length                        1048576
        # Send queue fill fraction
        zfs_send_queue_ff                                             20
        # Maximum send queue length
        zfs_send_queue_length                                   16777216
        # Send unmodified spill blocks
        zfs_send_unmodified_spill_blocks                               1
        # Select SHA256 implementation.
        zfs_sha256_impl                cycle [fastest] generic x64 ssse3
        # Select SHA512 implementation.
        zfs_sha512_impl                      cycle [fastest] generic x64
        # Rate limit slow IO
        zfs_slow_io_events_per_second                                 20
        # Include snapshot events in pool history/events
        zfs_snapshot_history_enabled                                   1
        # Disable setuid/setgid for automounts in .zfs/snapshot
        zfs_snapshot_no_setuid                                         0
        # Limit for memory used in prefetching the checkpoint space map done on each vdev while discarding the checkpoint
        zfs_spa_discard_memory_limit                            16777216
        # Small file blocks in special vdevs depends on this much free space available
        zfs_special_class_metadata_reserve_pct                        25
        # Defer frees starting in this pass
        zfs_sync_pass_deferred_free                                    2
        # Don't compress starting in this pass
        zfs_sync_pass_dont_compress                                    8
        # Rewrite new bps starting in this pass
        zfs_sync_pass_rewrite                                          2
        # Traverse prefetch number of blocks pointed by indirect block
        zfs_traverse_indirect_prefetch_limit                          32
        # Max size of TRIM commands, larger will be split
        zfs_trim_extent_bytes_max                              134217728
        # Min size of TRIM commands, smaller will be skipped
        zfs_trim_extent_bytes_min                                  32768
        # Skip metaslabs which have never been initialized
        zfs_trim_metaslab_skip                                         0
        # Max queued TRIMs outstanding per leaf vdev
        zfs_trim_queue_limit                                          10
        # Min number of txgs to aggregate frees before issuing TRIM
        zfs_trim_txg_batch                                            32
        # Historical statistics for the last N txgs
        zfs_txg_history                                              100
        # Max seconds worth of delta per txg
        zfs_txg_timeout                                                5
        # Hard limit
        zfs_unflushed_log_block_max                               131072
        # Lower-bound limit for the maximum amount of blocks allowed in log spacemap
        zfs_unflushed_log_block_min                                 1000
        # Tunable used to determine the number of blocks that can be used for the spacemap log, expressed as a percentage of the total number of metaslabs in the pool
        zfs_unflushed_log_block_pct                                  400
        # Hard limit
        zfs_unflushed_log_txg_max                                   1000
        # Specific hard-limit in memory that ZFS allows to be used for unflushed changes
        zfs_unflushed_max_mem_amt                             1073741824
        # Percentage of the overall system memory that ZFS allows to be used for unflushed changes
        zfs_unflushed_max_mem_ppm                                   1000
        # Set to prevent async unlinks (debug - leaks space into the unlinked set)
        zfs_unlink_suspend_progress                                    0
        # Place user data indirect blocks into the special class
        zfs_user_indirect_is_special                                   1
        # Max vdev I/O aggregation size
        zfs_vdev_aggregation_limit                               1048576
        # Max vdev I/O aggregation size for non-rotating media
        zfs_vdev_aggregation_limit_non_rotating                   131072
        # Max active async read I/Os per vdev
        zfs_vdev_async_read_max_active                                 3
        # Min active async read I/Os per vdev
        zfs_vdev_async_read_min_active                                 1
        # Async write concurrency max threshold
        zfs_vdev_async_write_active_max_dirty_percent                 60
        # Async write concurrency min threshold
        zfs_vdev_async_write_active_min_dirty_percent                 30
        # Max active async write I/Os per vdev
        zfs_vdev_async_write_max_active                               10
        # Min active async write I/Os per vdev
        zfs_vdev_async_write_min_active                                2
        # Default queue depth for each allocator
        zfs_vdev_def_queue_depth                                      32
        # Target number of metaslabs per top-level vdev
        zfs_vdev_default_ms_count                                    200
        # Default lower limit for metaslab size
        zfs_vdev_default_ms_shift                                     29
        # Direct I/O writes will perform for checksum verification before commiting write
        zfs_vdev_direct_write_verify                                   1
        # Use classic BIO submission method
        zfs_vdev_disk_classic                                          0
        # Maximum number of data segments to add to an IO request
        zfs_vdev_disk_max_segs                                         0
        # Defines failfast mask: 1 - device, 2 - transport, 4 - driver
        zfs_vdev_failfast_mask                                         1
        # Max active initializing I/Os per vdev
        zfs_vdev_initializing_max_active                               1
        # Min active initializing I/Os per vdev
        zfs_vdev_initializing_min_active                               1
        # Maximum number of active I/Os per vdev
        zfs_vdev_max_active                                         1000
        # Maximum ashift used when optimizing for logical -> physical sector size on new top-level vdevs
        zfs_vdev_max_auto_ashift                                      14
        # Default upper limit for metaslab size
        zfs_vdev_max_ms_shift                                         34
        # Minimum ashift used when creating new top-level vdevs
        zfs_vdev_min_auto_ashift                                       9
        # Minimum number of metaslabs per top-level vdev
        zfs_vdev_min_ms_count                                         16
        # Non-rotating media load increment for non-seeking I/Os
        zfs_vdev_mirror_non_rotating_inc                               0
        # Non-rotating media load increment for seeking I/Os
        zfs_vdev_mirror_non_rotating_seek_inc                          1
        # Rotating media load increment for non-seeking I/Os
        zfs_vdev_mirror_rotating_inc                                   0
        # Rotating media load increment for seeking I/Os
        zfs_vdev_mirror_rotating_seek_inc                              5
        # Offset in bytes from the last I/O which triggers a reduced rotating media seek increment
        zfs_vdev_mirror_rotating_seek_offset                     1048576
        # Practical upper limit of total metaslabs per top-level vdev
        zfs_vdev_ms_count_limit                                   131072
        # Number of non-interactive I/Os to allow in sequence
        zfs_vdev_nia_credit                                            5
        # Number of non-interactive I/Os before _max_active
        zfs_vdev_nia_delay                                             5
        # Timeout before determining that a device is missing
        zfs_vdev_open_timeout_ms                                    1000
        # Queue depth percentage for each top-level vdev
        zfs_vdev_queue_depth_pct                                    1000
        # Select raidz implementation.
        zfs_vdev_raidz_impl   cycle [fastest] original scalar sse2 ssse3
        # Aggregate read I/O over gap
        zfs_vdev_read_gap_limit                                    32768
        # Max active rebuild I/Os per vdev
        zfs_vdev_rebuild_max_active                                    3
        # Min active rebuild I/Os per vdev
        zfs_vdev_rebuild_min_active                                    1
        # Max active removal I/Os per vdev
        zfs_vdev_removal_max_active                                    2
        # Min active removal I/Os per vdev
        zfs_vdev_removal_min_active                                    1
        # I/O scheduler
        zfs_vdev_scheduler                                        unused
        # Max active scrub I/Os per vdev
        zfs_vdev_scrub_max_active                                      3
        # Min active scrub I/Os per vdev
        zfs_vdev_scrub_min_active                                      1
        # Max active sync read I/Os per vdev
        zfs_vdev_sync_read_max_active                                 10
        # Min active sync read I/Os per vdev
        zfs_vdev_sync_read_min_active                                 10
        # Max active sync write I/Os per vdev
        zfs_vdev_sync_write_max_active                                10
        # Min active sync write I/Os per vdev
        zfs_vdev_sync_write_min_active                                10
        # Max active trim/discard I/Os per vdev
        zfs_vdev_trim_max_active                                       2
        # Min active trim/discard I/Os per vdev
        zfs_vdev_trim_min_active                                       1
        # Aggregate write I/O over gap
        zfs_vdev_write_gap_limit                                    4096
        # Bytes to read per chunk
        zfs_vnops_read_chunk_size                                1048576
        # The size limit of write-transaction zil log data
        zfs_wrlog_data_max                                    3352715264
        # Use legacy ZFS xattr naming for writing new user namespace xattrs
        zfs_xattr_compat                                               0
        # Max event queue length
        zfs_zevent_len_max                                           512
        # Expiration time for recent zevents records
        zfs_zevent_retain_expire_secs                                900
        # Maximum recent zevents records to retain for duplicate checking
        zfs_zevent_retain_max                                       2000
        # Max number of taskq entries that are cached
        zfs_zil_clean_taskq_maxalloc                             1048576
        # Number of taskq entries that are pre-populated
        zfs_zil_clean_taskq_minalloc                                1024
        # Max percent of CPUs that are used per dp_sync_taskq
        zfs_zil_clean_taskq_nthr_pct                                 100
        # Disable xattr=sa extended attribute logging in ZIL by settng 0.
        zfs_zil_saxattr                                                1
        # Limit in bytes of ZIL log block size
        zil_maxblocksize                                          131072
        # Limit in bytes WR_COPIED size
        zil_maxcopied                                               7680
        # Disable ZIL cache flushes
        zil_nocacheflush                                               0
        # Disable intent logging replay
        zil_replay_disable                                             0
        # Limit in bytes slog sync writes per commit
        zil_slog_bulk                                           67108864
        # Log all slow ZIOs, not just those with vdevs
        zio_deadman_log_all                                            0
        # Throttle block allocations in the ZIO pipeline
        zio_dva_throttle_enabled                                       1
        # Prioritize requeued I/O
        zio_requeue_io_start_cut_in_line                               1
        # Max I/O completion time
        zio_slow_io_ms                                             30000
        # Percentage of CPUs to run an IO worker thread
        zio_taskq_batch_pct                                           80
        # Number of threads per IO worker taskqueue
        zio_taskq_batch_tpq                                            0
        # Configure IO queues for read IO
        zio_taskq_read                         fixed,1,8 null scale null
        # Configure IO queues for write IO
        zio_taskq_write                             sync null scale null
        # Number of CPUs per write issue taskq
        zio_taskq_write_tpq                                           16
        # Minimal size of block to attempt early abort
        zstd_abort_size                                           131072
        # Enable early abort attempts when using zstd
        zstd_earlyabort_pass                                           1
        # Process volblocksize blocks per thread
        zvol_blk_mq_blocks_per_thread                                  8
        # Default blk-mq queue depth
        zvol_blk_mq_queue_depth                                      128
        # Enable strict ZVOL quota enforcment
        zvol_enforce_quotas                                            1
        # Do not create zvol device nodes
        zvol_inhibit_dev                                               0
        # Major number for zvol device
        zvol_major                                                   230
        # Max number of blocks to discard
        zvol_max_discard_blocks                                    16384
        # Number of zvol taskqs
        zvol_num_taskqs                                                0
        # Timeout for ZVOL open retries
        zvol_open_timeout_ms                                        1000
        # Prefetch N bytes at zvol start+end
        zvol_prefetch_bytes                                       131072
        # Synchronously handle bio requests
        zvol_request_sync                                              0
        # Number of threads to handle I/O requests. Setto 0 to use all active CPUs
        zvol_threads                                                   0
        # Use the blk-mq API for zvols
        zvol_use_blk_mq                                                0
        # Default volmode property value
        zvol_volmode                                                   1

ZIL committed transactions:                                        52.2k
        Commit requests:                                            9.3k
        Flushes to stable storage:                                  9.3k
        Transactions to SLOG storage pool:            0 Bytes          0
        Transactions to non-SLOG storage pool:      262.1 MiB       7.4k

@brian-maloney brian-maloney added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 13, 2025
@amotin
Copy link
Member

amotin commented Feb 18, 2025

@maru-sama The fact that ARC is allowed to use almost all of RAM in 2.3 is not a bug, but feature. Once there appear memory pressure from the kernel and other consumers, it should shrink.

@maru-sama
Copy link

@maru-sama The fact that ARC is allowed to use almost all of RAM in 2.3 is not a bug, but feature. Once there appear memory pressure from the kernel and other consumers, it should shrink.

Hello, thanks for the reply. I removed my initial comment since the behaviour I described is standard (as you said) and the original poster is seeing a different issue.

@amotin
Copy link
Member

amotin commented Feb 18, 2025

@brian-maloney As I see, most of your ARC size is reported as non-evictable metadata, which means something is actively referencing it. It might be metadata blocks backing dnodes, referenced by kernel via inodes. ZFS should start inodes pruning process for kernel once the percent of non-evictable metadata goes above the threshold and there is a need of eviction. So either the pruning does not start, or it does not work right, or I wonder if it just can't since those files are actually opened and kernel doesn't want them to go.

@satmandu
Copy link
Contributor

@brian-maloney Do you use docker?

@brian-maloney
Copy link
Author

@brian-maloney Do you use docker?

I do have Docker running on the system, but the application that triggers the memory exhaustion (duplicacy) is not running in a container. So if the mere presence of containers on the system is enough to trigger the issue, then that might be of use, but most of the time (all except for 10-15 minutes per day) the ARC max is honored, it's only while duplicacy is running that I see the problem behavior.

To @amotin's comment, it's likely that duplicacy does hold many files open while computing chunks to check the backup storage location for. I am not that familiar with the implementation.

@brian-maloney
Copy link
Author

I did some testing with this. The dangerous interaction seems to occur only during the first phase of the backup operation, which is a depth-first listing of all files on the system (as seen in ListLocalFiles). I'm not sure what exactly about this implementation would be causing this behavior, but I did do a ls -R / to compare behavior and didn't see anything bad happen, though they are very different things.

During this first phase, lsof shows only DIR type objects opened by duplicacy (well, except for the binary itself). If I shepherd the process through the first phase using drop_caches, it respects the ARC max for the subsequent phases of the backup where it is actually opening files.

I am hopeful that these are enough clues to help someone more knowledgable than myself on ZFS internals and the 2.3.0 changes either suggest a code fix or a workaround. All of my attempts to tune so far have been completely ineffective.

Thanks in advance for any advice you can provide!

@shodanshok
Copy link
Contributor

@maru-sama The fact that ARC is allowed to use almost all of RAM in 2.3 is not a bug, but feature. Once there appear memory pressure from the kernel and other consumers, it should shrink.

@amotin A quick test shows that, even on 2.3, setting zfs_arc_max to a low value (eg: 2 GB) and reading a big file does not exceed the max configured ARC capacity. So it seems that zfs_arc_max should be honored unless the shrinker threads fails to reclaim memory. Is my understanding wrong?

@amotin
Copy link
Member

amotin commented Feb 19, 2025

@shodanshok Yes, zfs_arc_max should still be honored, I haven't said otherwise. It just changed the default value from 50% to somewhere about 95%.

@mabod
Copy link

mabod commented Feb 20, 2025

@amotin :

The fact that ARC is allowed to use almost all of RAM in 2.3 is not a bug, but feature. Once there appear memory pressure from the kernel and other consumers, it should shrink.

Does that mean that the 50 % RAM limit for ARC that was in place for ZFS 2.2 and before does not apply anymore? Where can I read more about this change? I havent found a note in the change log.

@amotin
Copy link
Member

amotin commented Feb 20, 2025

Does that mean that the 50 % RAM limit for ARC that was in place for ZFS 2.2 and before does not apply anymore?

Right.

Where can I read more about this change? I havent found a note in the change log.

It was changed here: #15437 . It was discussed among developers and accompanied by several other changes to better coexist with Linux kernel, including some fixes to Linux kernel itself.

@brian-maloney
Copy link
Author

@brian-maloney Do you use docker?

Just to (hopefully) rule out any issues related to docker, I stopped the docker service (and all containers) and ran another backup. The same issue is present, so I think it's probably not docker-related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

6 participants