Ignore zfs_arc_shrinker_limit in direct reclaim mode #16313

shodanshok · 2024-06-29T19:58:18Z

Motivation and Context

zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse due to excessive memory reclaim. However, when the kernel is in direct reclaim mode (ie: low on memory), limiting ARC reclaim increases OOM risk. This is especially true on system without (or with inadequate) swap.

Description

This patch ignores zfs_arc_shrinker_limit when the kernel is in direct reclaim mode, avoiding most OOM. It also restores echo 3 > /proc/sys/vm/drop_caches ability to correctly drop (almost) all ARC.

How Has This Been Tested?

Minimal testing on a Debian 12 virtual machine.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

adamdmoss

I run with zfs_arc_shrinker_limit=0 for the same underlying reason, and I like this nuanced approach more.

module/os/linux/zfs/arc_os.c

amotin · 2024-06-30T01:49:12Z

Could you share why are you changing it only for direct reclaim? From my 6.6.20 kernel code reading direct and regular reclaim path use mostly the same code, so I'd expect they may have similar problems. What kernel are you using?

Speaking about tests, just yesterday I've found more reliable reproduction of the ARC problems we've hit with 6.6 kernel with zfs_arc_shrinker_limit=0 before disabling MGLRU. Run fio in simple read loop, wait for ZFS ARC to consume most of available memory, and then try to allocate 2GB of RAM (more than available) at once from user-space (I've written a trivial program for that). After that I immediately see ~800MB of active swap and ZFS evicts ARC to its minimum limit. 10-15 seconds later ARC recovers to full size and the process can be repeated again, but this time swap reaches 1.5GB. On third repeat swap reaches 2GB, and then returns back to 1.5GB since system probably needs those pages. Changes of zfs_arc_shrinker_limit should not affect excessive swapping, but without limit ARC drops to its minimum each time. I've reported it to the MGLRU author (Yu Zhao) and hope it will result in some fixes, but there are already plenty of affected kernels. PS: I am not advocating for the limit.

shodanshok · 2024-06-30T07:10:20Z

Could you share why are you changing it only for direct reclaim? From my 6.6.20 kernel code reading direct and regular reclaim path use mostly the same code, so I'd expect they may have similar problems. What kernel are you using?

For this specific test VM, I am using kernel 6.1.0-22-amd64, without MGLRU.

The main issue I would like to avoid (which is especially evident on MGLRU-enabled RHEL 9 kernels, as 5.14.0-427.22.1.el9_4.x86_64) is excessive swap file utilization when ARC should be shrunk, or OOM when no swap is available. On the other hand, setting zfs_arc_shrinker_limit=0 on a MGLRU kernel shows ARC collapse even with small/moderate memory pressure.

This patch try to both avoid OOM due to non-shrinking ARC when no swap is available and ARC collapse due to small direct memory reclaims. I tried on a non-MGLRU kernel to keep firsts tests simple, but I plan to test on RHEL 9 also.

My understanding of reclaim is that kswapd issue indirect reclaim proactively, while direct reclaim is done only on low-memory condition. Am I wrong?

Speaking about tests, just yesterday I've found more reliable reproduction of the ARC problems we've hit with 6.6 kernel with zfs_arc_shrinker_limit=0 before disabling MGLRU. Run fio in simple read loop, wait for ZFS ARC to consume most of available memory, and then try to allocate 2GB of RAM (more than available) at once from user-space (I've written a trivial program for that). After that I immediately see ~800MB of active swap and ZFS evicts ARC to its minimum limit. 10-15 seconds later ARC recovers to full size and the process can be repeated again, but this time swap reaches 1.5GB. On third repeat swap reaches 2GB, and then returns back to 1.5GB since system probably needs those pages.

This matches my observations.

Thanks.

shodanshok · 2024-06-30T12:46:20Z

I am testing another approach:

+      int64_t limit = current_is_kswapd();
+      if (!limit) {
+              limit = INT64_MAX;
+      }

-      int64_t limit = INT64_MAX;
-      if (current_is_kswapd() && zfs_arc_shrinker_limit != 0) {
-              limit = zfs_arc_shrinker_limit;
-      }

zfs_arc_shrinker_limit default (10000) seems low to me. ~~This patch dynamically sets limit as required by kswapd, which asks either for 131072 or 0 (direct reclaim), so the kernel module parameter can be removed.~~

Any thoughts?

EDIT: never mind, current_is_kswapd does not asks for a number of pages, it returns a flag.

amotin · 2024-07-05T20:50:01Z

zfs_arc_shrinker_limit default (10000) seems low to me.

The 10000 default is definitely too low, since kernel does not always evicts that amount, but uses it as a base to which it applies memory pressure, starting from 1/1024th and increasing. It means that my the time kernel actually request full 10000 pages, it will already evict everything it can from page cache, that is not exactly the goal here. If I would try to think about the value, I would probably make it a fraction of arc_c, since no fixed value will be ever right (either too small for the most systems, or too big for small ones).

This patch dynamically sets limit as required by kswapd, which asks either for 131072 or 0 (direct reclaim), so the kernel module parameter can be removed.

Reading your response above I am still not getting why do you suggest that we should ignore indirect memory pressure, but partially obey direct? The algorithm is pretty much the same there. Do you have actual data that it helps?

This matches my observations.

Good. And what does this patch do to it?

shodanshok · 2024-07-06T12:37:24Z

The 10000 default is definitely too low, since kernel does not always evicts that amount, but uses it as a base to which it applies memory pressure, starting from 1/1024th and increasing. It means that my the time kernel actually request full 10000 pages, it will already evict everything it can from page cache, that is not exactly the goal here.

Even worse, if it has no more pagecache to evict and no swap, it invokes OOM.

If I would try to think about the value, I would probably make it a fraction of arc_c, since no fixed value will be ever right (either too small for the most systems, or too big for small ones).

Do we have something similar in the form of zfs_arc_shrink_shift?

Reading your response above I am still not getting why do you suggest that we should ignore indirect memory pressure, but partially obey direct? The algorithm is pretty much the same there. Do you have actual data that it helps?

Maybe I am wrong, but this patch does not ignore indirect memory pressure any more than the current behavior. If anything, the patch as shown in #16313 (comment) would increase reactivity to indirect memory reclaim via kspawd (which asks for much more pages than 10000). What this patch really do is to ignore the limit when direct reclaim is requested - as the kernel enters direct reclaim only when low on memory, let ARC shrink as if zfs_arc_shrinker_limit=0. This greatly reduces OOM in my (limited) tests.

As a side note, it restores correct behavior for echo 3 > /proc/sys/vm/drop_caches - which can be handy in some cases.

Good. And what does this patch do to it?

This patch does nothing to prevent excessive swap, but it greatly help to prevent OOM on machines without swap. That was my intent, at least - am I missing something?

Thanks.

amotin · 2024-07-08T19:51:24Z

If I would try to think about the value, I would probably make it a fraction of arc_c, since no fixed value will be ever right (either too small for the most systems, or too big for small ones).

Do we have something similar in the form of zfs_arc_shrink_shift?

I was thinking about zfs_arc_shrink_shift also, but it is not used for Linux shrinker calls, only for for ZFS own arc_reap(), which I am not sure activates often on later Linux'es, possibly due to inability to predict reserved_highatomic. I haven't decided what to do about it, focusing on a thought whether we should apply the limit to the count() or somehow to scan() call instead.

Reading your response above I am still not getting why do you suggest that we should ignore indirect memory pressure, but partially obey direct? The algorithm is pretty much the same there. Do you have actual data that it helps?

Maybe I am wrong, but this patch does not ignore indirect memory pressure any more than the current behavior.

I didn't mean that the patch ignores indirect more, but that indirect kswapd now has no limits. Is it more sane somehow? The code seems to be quite shared.

shodanshok · 2024-07-09T13:49:22Z

I didn't mean that the patch ignores indirect more, but that indirect kswapd now has no limits. Is it more sane somehow? The code seems to be quite shared.

The original patch version (the one committed) disables limit only if current_is_kswapd returns 0, which I (wrongly?) understand to mean direct reclaim.

The other version proposed in #16313 (comment) sets limit as passed by current_is_kswapd which, in my testing on Debian 12 and Rocky 9, seems to always be 0 (direct reclaim) or 131072 (ie: limit is set 13x higher than default, which provides fast ARC shrinking without total collapse).

It seems to me that in no ways I am ignoring limit for indirect reclaim mode initiated by kswapd. Am I wrong?

Thanks.

shodanshok · 2024-07-13T07:58:23Z

Well, I had a basic misunderstanding on what current_is_kswapd does: this function does not return a number of page to be freed, rather a bitmap flag. So it should not be used as a limit - please discard the approach discussed in #16313 (comment)

That said, the patch as-committed (with no limit for direct reclaim, when current_is_kswapd is 0) should be fine. Maybe I should add a commit to raise current_is_kswapd to a more reasonable level, eg: 65K - 128K pages (which seems to work fine on my tests).

shodanshok · 2024-07-29T14:31:13Z

I resolved a conflict with the latest code.

@amotin the way I understand the code, this patch effectively limit reclaim to zfs_arc_shrinker_limit only when the current reclaim thread is kswapd and zfs_arc_shrinker_limit > 0. Direct reclaim and reclaim by other kernel threads (eg: khugepaged) should be unlimited. Am I wrong?

Thanks.

amotin

Reviewing some data from our performance team with zfs_arc_shrinker_limit=0 I see a case when system has 2/3 of RAM free, but something (I guess it may be kswapd) requests ARC eviction. It makes me understand your wish to restrict it specifically. But without real understanding what is going on there I am not exactly comfortable to propose a workaround, since it clearly can not be called a proper solution. I'll need to investigate it. One things specific to that system is that it is 2-node NUMA, which makes me wonder if kernel may try to balance it or do something similar.

Meanwhile, if we go your way, I would simplify the patch to the proposed below. And please rebase and collapse your commits. 5 line change does not worth 4 commits.

module/os/linux/zfs/arc_os.c

amotin · 2024-07-31T16:56:06Z

@shodanshok You should do rebase to master for your branch, not merge.

zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse due to excessive memory reclaim. However, when the kernel is in direct reclaim mode (ie: low on memory), limiting ARC reclaim increases OOM risk. This is especially true on system without (or with inadequate) swap. This patch ignores zfs_arc_shrinker_limit when the kernel is in direct reclaim mode, avoiding most OOM. It also restores "echo 3 > /proc/sys/vm/drop_caches" ability to correctly drop (almost) all ARC. Signed-off-by: Gionatan Danti <[email protected]>

shodanshok · 2024-07-31T19:35:07Z

@shodanshok You should do rebase to master for your branch, not merge.

@amotin you are right, sorry. I hope the last force-commit (after rebase) is what you asked. Thanks.

shodanshok · 2024-07-31T20:08:00Z

But without real understanding what is going on there I am not exactly comfortable to propose a workaround, since it clearly can not be called a proper solution. I'll need to investigate it.

You are right, it is not a proper solution. The real solution would be to never ignore kernel reclaim requests, but setting zfs_arc_shrinker_limit=0 on recent (MGLRU-enabled) kernels seems to cause ARC collapse (as an aside note, thanks for your work in this area).

Still, ignoring only kswapd reclaims should be an improvement: direct reclaims (and reclaims from kernel threads other than kswapd) are urgent ones, and ARC should promptly shrink.

One things specific to that system is that it is 2-node NUMA, which makes me wonder if kernel may try to balance it or do something similar.

NUMA can play its role, but I see similar behavior on single socket (and even single-core) machines.

Regards.

behlendorf

Do we have any updated results for how this change has been working in practice? It makes good sense to me, but I'd like to know a little more about how it's been tested.

shodanshok · 2024-08-20T08:10:34Z

@behlendorf I have done some testing (simulating memory pressure and reclaim) on both a Debian12 and a Rocky9 virtual machines, but nothing more. Anything specific would you like to test on these machines?

Anyway, the whole idea is based on current_is_kswapd() returning 0 for both direct reclaims and reclaims initiated by other kernel subsystem (ie: khugepaged). Am I reading the code right, or am I missing something?

Thanks.

behlendorf · 2024-08-21T16:59:27Z

@shodanshok you've got it right. Thanks for the info, I don't have any additional specific tests. I've merged this to get some additional miles on it.

adamdmoss approved these changes Jun 29, 2024

View reviewed changes

module/os/linux/zfs/arc_os.c Outdated Show resolved Hide resolved

shodanshok mentioned this pull request Jul 5, 2024

Several improvements to ARC shrinking #16197

Merged

13 tasks

amotin reviewed Jul 30, 2024

View reviewed changes

module/os/linux/zfs/arc_os.c Outdated Show resolved Hide resolved

shodanshok force-pushed the reclaim branch from e314806 to 21edc07 Compare July 31, 2024 19:13

shodanshok force-pushed the reclaim branch from 21edc07 to f708a9a Compare July 31, 2024 19:21

behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Aug 16, 2024

behlendorf approved these changes Aug 19, 2024

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Design Review Needed Architecture or design is under discussion labels Aug 19, 2024

behlendorf merged commit bbe8512 into openzfs:master Aug 21, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore zfs_arc_shrinker_limit in direct reclaim mode #16313

Ignore zfs_arc_shrinker_limit in direct reclaim mode #16313

shodanshok commented Jun 29, 2024

adamdmoss left a comment

amotin commented Jun 30, 2024 •

edited

Loading

shodanshok commented Jun 30, 2024 •

edited

Loading

shodanshok commented Jun 30, 2024 •

edited

Loading

amotin commented Jul 5, 2024

shodanshok commented Jul 6, 2024 •

edited

Loading

amotin commented Jul 8, 2024

shodanshok commented Jul 9, 2024 •

edited

Loading

shodanshok commented Jul 13, 2024

shodanshok commented Jul 29, 2024

amotin left a comment

amotin commented Jul 31, 2024

shodanshok commented Jul 31, 2024

shodanshok commented Jul 31, 2024

behlendorf left a comment

shodanshok commented Aug 20, 2024

behlendorf commented Aug 21, 2024

Ignore zfs_arc_shrinker_limit in direct reclaim mode #16313

Ignore zfs_arc_shrinker_limit in direct reclaim mode #16313

Conversation

shodanshok commented Jun 29, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

adamdmoss left a comment

Choose a reason for hiding this comment

amotin commented Jun 30, 2024 • edited Loading

shodanshok commented Jun 30, 2024 • edited Loading

shodanshok commented Jun 30, 2024 • edited Loading

amotin commented Jul 5, 2024

shodanshok commented Jul 6, 2024 • edited Loading

amotin commented Jul 8, 2024

shodanshok commented Jul 9, 2024 • edited Loading

shodanshok commented Jul 13, 2024

shodanshok commented Jul 29, 2024

amotin left a comment

Choose a reason for hiding this comment

amotin commented Jul 31, 2024

shodanshok commented Jul 31, 2024

shodanshok commented Jul 31, 2024

behlendorf left a comment

Choose a reason for hiding this comment

shodanshok commented Aug 20, 2024

behlendorf commented Aug 21, 2024

amotin commented Jun 30, 2024 •

edited

Loading

shodanshok commented Jun 30, 2024 •

edited

Loading

shodanshok commented Jun 30, 2024 •

edited

Loading

shodanshok commented Jul 6, 2024 •

edited

Loading

shodanshok commented Jul 9, 2024 •

edited

Loading