Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore zfs_arc_shrinker_limit in direct reclaim mode #16313

Merged
merged 1 commit into from
Aug 21, 2024

Conversation

shodanshok
Copy link
Contributor

Motivation and Context

zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse due to excessive memory reclaim. However, when the kernel is in direct reclaim mode (ie: low on memory), limiting ARC reclaim increases OOM risk. This is especially true on system without (or with inadequate) swap.

Description

This patch ignores zfs_arc_shrinker_limit when the kernel is in direct reclaim mode, avoiding most OOM. It also restores echo 3 > /proc/sys/vm/drop_caches ability to correctly drop (almost) all ARC.

How Has This Been Tested?

Minimal testing on a Debian 12 virtual machine.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Contributor

@adamdmoss adamdmoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run with zfs_arc_shrinker_limit=0 for the same underlying reason, and I like this nuanced approach more.

module/os/linux/zfs/arc_os.c Outdated Show resolved Hide resolved
@amotin
Copy link
Member

amotin commented Jun 30, 2024

Could you share why are you changing it only for direct reclaim? From my 6.6.20 kernel code reading direct and regular reclaim path use mostly the same code, so I'd expect they may have similar problems. What kernel are you using?

Speaking about tests, just yesterday I've found more reliable reproduction of the ARC problems we've hit with 6.6 kernel with zfs_arc_shrinker_limit=0 before disabling MGLRU. Run fio in simple read loop, wait for ZFS ARC to consume most of available memory, and then try to allocate 2GB of RAM (more than available) at once from user-space (I've written a trivial program for that). After that I immediately see ~800MB of active swap and ZFS evicts ARC to its minimum limit. 10-15 seconds later ARC recovers to full size and the process can be repeated again, but this time swap reaches 1.5GB. On third repeat swap reaches 2GB, and then returns back to 1.5GB since system probably needs those pages. Changes of zfs_arc_shrinker_limit should not affect excessive swapping, but without limit ARC drops to its minimum each time. I've reported it to the MGLRU author (Yu Zhao) and hope it will result in some fixes, but there are already plenty of affected kernels. PS: I am not advocating for the limit.

@shodanshok
Copy link
Contributor Author

shodanshok commented Jun 30, 2024

Could you share why are you changing it only for direct reclaim? From my 6.6.20 kernel code reading direct and regular reclaim path use mostly the same code, so I'd expect they may have similar problems. What kernel are you using?

For this specific test VM, I am using kernel 6.1.0-22-amd64, without MGLRU.

The main issue I would like to avoid (which is especially evident on MGLRU-enabled RHEL 9 kernels, as 5.14.0-427.22.1.el9_4.x86_64) is excessive swap file utilization when ARC should be shrunk, or OOM when no swap is available. On the other hand, setting zfs_arc_shrinker_limit=0 on a MGLRU kernel shows ARC collapse even with small/moderate memory pressure.

This patch try to both avoid OOM due to non-shrinking ARC when no swap is available and ARC collapse due to small direct memory reclaims. I tried on a non-MGLRU kernel to keep firsts tests simple, but I plan to test on RHEL 9 also.

My understanding of reclaim is that kswapd issue indirect reclaim proactively, while direct reclaim is done only on low-memory condition. Am I wrong?

Speaking about tests, just yesterday I've found more reliable reproduction of the ARC problems we've hit with 6.6 kernel with zfs_arc_shrinker_limit=0 before disabling MGLRU. Run fio in simple read loop, wait for ZFS ARC to consume most of available memory, and then try to allocate 2GB of RAM (more than available) at once from user-space (I've written a trivial program for that). After that I immediately see ~800MB of active swap and ZFS evicts ARC to its minimum limit. 10-15 seconds later ARC recovers to full size and the process can be repeated again, but this time swap reaches 1.5GB. On third repeat swap reaches 2GB, and then returns back to 1.5GB since system probably needs those pages.

This matches my observations.

Thanks.

@shodanshok
Copy link
Contributor Author

shodanshok commented Jun 30, 2024

I am testing another approach:

+      int64_t limit = current_is_kswapd();
+      if (!limit) {
+              limit = INT64_MAX;
+      }

-      int64_t limit = INT64_MAX;
-      if (current_is_kswapd() && zfs_arc_shrinker_limit != 0) {
-              limit = zfs_arc_shrinker_limit;
-      }

zfs_arc_shrinker_limit default (10000) seems low to me. This patch dynamically sets limit as required by kswapd, which asks either for 131072 or 0 (direct reclaim), so the kernel module parameter can be removed.

Any thoughts?

EDIT: never mind, current_is_kswapd does not asks for a number of pages, it returns a flag.

@amotin
Copy link
Member

amotin commented Jul 5, 2024

zfs_arc_shrinker_limit default (10000) seems low to me.

The 10000 default is definitely too low, since kernel does not always evicts that amount, but uses it as a base to which it applies memory pressure, starting from 1/1024th and increasing. It means that my the time kernel actually request full 10000 pages, it will already evict everything it can from page cache, that is not exactly the goal here. If I would try to think about the value, I would probably make it a fraction of arc_c, since no fixed value will be ever right (either too small for the most systems, or too big for small ones).

This patch dynamically sets limit as required by kswapd, which asks either for 131072 or 0 (direct reclaim), so the kernel module parameter can be removed.

Reading your response above I am still not getting why do you suggest that we should ignore indirect memory pressure, but partially obey direct? The algorithm is pretty much the same there. Do you have actual data that it helps?

This matches my observations.

Good. And what does this patch do to it?

@shodanshok
Copy link
Contributor Author

shodanshok commented Jul 6, 2024

The 10000 default is definitely too low, since kernel does not always evicts that amount, but uses it as a base to which it applies memory pressure, starting from 1/1024th and increasing. It means that my the time kernel actually request full 10000 pages, it will already evict everything it can from page cache, that is not exactly the goal here.

Even worse, if it has no more pagecache to evict and no swap, it invokes OOM.

If I would try to think about the value, I would probably make it a fraction of arc_c, since no fixed value will be ever right (either too small for the most systems, or too big for small ones).

Do we have something similar in the form of zfs_arc_shrink_shift?

Reading your response above I am still not getting why do you suggest that we should ignore indirect memory pressure, but partially obey direct? The algorithm is pretty much the same there. Do you have actual data that it helps?

Maybe I am wrong, but this patch does not ignore indirect memory pressure any more than the current behavior. If anything, the patch as shown in #16313 (comment) would increase reactivity to indirect memory reclaim via kspawd (which asks for much more pages than 10000). What this patch really do is to ignore the limit when direct reclaim is requested - as the kernel enters direct reclaim only when low on memory, let ARC shrink as if zfs_arc_shrinker_limit=0. This greatly reduces OOM in my (limited) tests.

As a side note, it restores correct behavior for echo 3 > /proc/sys/vm/drop_caches - which can be handy in some cases.

Good. And what does this patch do to it?

This patch does nothing to prevent excessive swap, but it greatly help to prevent OOM on machines without swap. That was my intent, at least - am I missing something?

Thanks.

@amotin
Copy link
Member

amotin commented Jul 8, 2024

If I would try to think about the value, I would probably make it a fraction of arc_c, since no fixed value will be ever right (either too small for the most systems, or too big for small ones).

Do we have something similar in the form of zfs_arc_shrink_shift?

I was thinking about zfs_arc_shrink_shift also, but it is not used for Linux shrinker calls, only for for ZFS own arc_reap(), which I am not sure activates often on later Linux'es, possibly due to inability to predict reserved_highatomic. I haven't decided what to do about it, focusing on a thought whether we should apply the limit to the count() or somehow to scan() call instead.

Reading your response above I am still not getting why do you suggest that we should ignore indirect memory pressure, but partially obey direct? The algorithm is pretty much the same there. Do you have actual data that it helps?

Maybe I am wrong, but this patch does not ignore indirect memory pressure any more than the current behavior.

I didn't mean that the patch ignores indirect more, but that indirect kswapd now has no limits. Is it more sane somehow? The code seems to be quite shared.

@shodanshok
Copy link
Contributor Author

shodanshok commented Jul 9, 2024

I didn't mean that the patch ignores indirect more, but that indirect kswapd now has no limits. Is it more sane somehow? The code seems to be quite shared.

The original patch version (the one committed) disables limit only if current_is_kswapd returns 0, which I (wrongly?) understand to mean direct reclaim.

The other version proposed in #16313 (comment) sets limit as passed by current_is_kswapd which, in my testing on Debian 12 and Rocky 9, seems to always be 0 (direct reclaim) or 131072 (ie: limit is set 13x higher than default, which provides fast ARC shrinking without total collapse).

It seems to me that in no ways I am ignoring limit for indirect reclaim mode initiated by kswapd. Am I wrong?

Thanks.

@shodanshok
Copy link
Contributor Author

Well, I had a basic misunderstanding on what current_is_kswapd does: this function does not return a number of page to be freed, rather a bitmap flag. So it should not be used as a limit - please discard the approach discussed in #16313 (comment)

That said, the patch as-committed (with no limit for direct reclaim, when current_is_kswapd is 0) should be fine. Maybe I should add a commit to raise current_is_kswapd to a more reasonable level, eg: 65K - 128K pages (which seems to work fine on my tests).

@shodanshok
Copy link
Contributor Author

I resolved a conflict with the latest code.

@amotin the way I understand the code, this patch effectively limit reclaim to zfs_arc_shrinker_limit only when the current reclaim thread is kswapd and zfs_arc_shrinker_limit > 0. Direct reclaim and reclaim by other kernel threads (eg: khugepaged) should be unlimited. Am I wrong?

Thanks.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing some data from our performance team with zfs_arc_shrinker_limit=0 I see a case when system has 2/3 of RAM free, but something (I guess it may be kswapd) requests ARC eviction. It makes me understand your wish to restrict it specifically. But without real understanding what is going on there I am not exactly comfortable to propose a workaround, since it clearly can not be called a proper solution. I'll need to investigate it. One things specific to that system is that it is 2-node NUMA, which makes me wonder if kernel may try to balance it or do something similar.

Meanwhile, if we go your way, I would simplify the patch to the proposed below. And please rebase and collapse your commits. 5 line change does not worth 4 commits.

module/os/linux/zfs/arc_os.c Outdated Show resolved Hide resolved
@amotin
Copy link
Member

amotin commented Jul 31, 2024

@shodanshok You should do rebase to master for your branch, not merge.

zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse
due to excessive memory reclaim. However, when the kernel is
in direct reclaim mode (ie: low on memory), limiting ARC reclaim
increases OOM risk. This is especially true on system without
(or with inadequate) swap.

This patch ignores zfs_arc_shrinker_limit when the kernel is in
direct reclaim mode, avoiding most OOM. It also restores
"echo 3 > /proc/sys/vm/drop_caches" ability to correctly drop
(almost) all ARC.

Signed-off-by: Gionatan Danti <[email protected]>
@shodanshok
Copy link
Contributor Author

@shodanshok You should do rebase to master for your branch, not merge.

@amotin you are right, sorry. I hope the last force-commit (after rebase) is what you asked. Thanks.

@shodanshok
Copy link
Contributor Author

But without real understanding what is going on there I am not exactly comfortable to propose a workaround, since it clearly can not be called a proper solution. I'll need to investigate it.

You are right, it is not a proper solution. The real solution would be to never ignore kernel reclaim requests, but setting zfs_arc_shrinker_limit=0 on recent (MGLRU-enabled) kernels seems to cause ARC collapse (as an aside note, thanks for your work in this area).

Still, ignoring only kswapd reclaims should be an improvement: direct reclaims (and reclaims from kernel threads other than kswapd) are urgent ones, and ARC should promptly shrink.

One things specific to that system is that it is 2-node NUMA, which makes me wonder if kernel may try to balance it or do something similar.

NUMA can play its role, but I see similar behavior on single socket (and even single-core) machines.

Regards.

@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Aug 16, 2024
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any updated results for how this change has been working in practice? It makes good sense to me, but I'd like to know a little more about how it's been tested.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Design Review Needed Architecture or design is under discussion labels Aug 19, 2024
@shodanshok
Copy link
Contributor Author

@behlendorf I have done some testing (simulating memory pressure and reclaim) on both a Debian12 and a Rocky9 virtual machines, but nothing more. Anything specific would you like to test on these machines?

Anyway, the whole idea is based on current_is_kswapd() returning 0 for both direct reclaims and reclaims initiated by other kernel subsystem (ie: khugepaged). Am I reading the code right, or am I missing something?

Thanks.

@behlendorf
Copy link
Contributor

@shodanshok you've got it right. Thanks for the info, I don't have any additional specific tests. I've merged this to get some additional miles on it.

@behlendorf behlendorf merged commit bbe8512 into openzfs:master Aug 21, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants