-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow memory leak of Fluentd v0.14 compared to v0.12 #1941
Slow memory leak of Fluentd v0.14 compared to v0.12 #1941
Comments
Hmm... weird. |
@repeatedly - We are setting up some soak test for various Fluentd versions (v0.12, v0.14 and v1.1). Will keep you posted once we have some numbers for the latest version. Hope it would be fixed in the latest version (there seems to be some fixes in between those versions). |
I am on fluentd-1.0.2 and am experiencing the same i.e memory usage keeps growing up. I am just tailing many files and sending to an aggregator instance. I couldn't find anything interesting in dentry cache or in perf report, but, I will try the GC options as suggested here: https://docs.fluentd.org/v1.0/articles/performance-tuning-single-process#reduce-memory-usage |
RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 had no effect: Above is the the memory usage used by Ruby/Fluentd after the GC change. There is a spike just after 3AM, interestingly this is when we move older files out of the path tailed by fluentd. Will setting 'open_on_every_update' in the in_tail plugin help at all? |
@cosmok That's interesting. How many files do you watch and could you paste your configuration to reproduce the problem? |
@repeatedly Config (simplified and scrubbed to remove sensitive details) Forwarder config (this is where the problem is)
|
I think I'm also hitting this issue. Using a container based on We have one of this containers in each node for a kubernetes cluster. We were hitting some Buffer Overflow issues. These are the last values values regarding buffer plugin in the conf (it's 0.x syntax but I've checked it gets converted correctly):
I've played with most of the values, the above ones are the first that gave me stability in some of the pods, combined with setting The pods leaking memory seem to start doing so after logging messages about buffer overflow:
The above happens for 1674 lines and after all that is logged, the memory leak shows up. I'm repeating the tests to see if this is consistent, but the memory leaked for almost 50 minutes and then started to decrease memory usage at the same pace it was increasing during the leak (about 7MiB per minute). Please let me know if there's any more information I could provide. It's being a bit difficult for me to reproduce, it only happens in our most logging-busy cluster, but I thought the timing of the buffer overflow and the memory leak would be worth mentioning. Here's a picture of the monitoring of one of this containers. The drop in CPU happens when the memory starts to grow, and the logs show that this was the moment the Buffer Overflow happened. The memory starts to decrease when the Network I/O suddenly becomes way lower. The prometheus plug in exposes an endpoint that we use for liveness probe and it remained live for all this monitored time. |
@rubencabrera Thanks for the detailed report. I'm now re-working this issue and try to reproduce the problem with your configuration. |
Thanks @repeatedly The report above was done in the middle of an upgrade to the version of fluentd and all the plugins we use. The problems we had with buffering seem solved now and we don't see the containers getting restarted so often (so I presume the memory issue is under control in this scenario). I can only have a long running window in that cluster on weekends and this one I'll be out, but I could try again to remove the resource limit to see what happens. If the leak doesn't occur, maybe I could try to get the buffer error back to see if that's the problem. If you have any other ideas that could help, please let me know. |
@repeatedly - We just tried the latest Comparison among Comparison among It almost felt like GC kicked in really late and did not bring the memory allocation down sufficiently. So we also tried setting Any luck with the reproduction? |
@qingling128 Thanks. Do you use v0.12 API with v1.x, right? @tahajahangir Do you use fluentd v1.x serise? |
We are going to test 1.2.2 next week. |
When I use jemalloc for ruby, the problem of memory leak disappears.
Are there some bugs in C extensions? |
@repeatedly - We are using https://docs.fluentd.org/v1.0/articles/api-plugin-output. The page said In fact, I set up some stub gem and was still seeing the memory leak. The stub gem is just sleeping 0.06 seconds for each chunk and not doing anything else:
The |
@qingling128 And the problem has been fixed, when I set the LD_PRELOAD for libjemalloc or use a new ruby package compiling with option --with-jemalloc. You can refer to https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html. |
I assume qingling128 uses jemalloc properly. |
@lee2014 @repeatedly - Yeah, We were using jemalloc. And I have added a step to explicitly enforce that in the container in case the package did not set it up properly. Still getting similar results. |
BTW, our configuration uses a combination of tail, parser, systemd, record_reformer, detect_exceptions plugins. The full configuration can be found at below. I'm setting up another experiment to trim that down to just using a simple tail plugin (so that we can rule out some noises). Will update with collected data. |
If you have a time, could you use
|
Sure, I'll give in_dummy a try. BTW, I tried two configuration settings below. One triggered memory issue while the other did not. Good one (with only in_tail plugin) Bad one (in_tail plugin + multi_format, parser, record_reformer and record_transformer plugins) I just set up a few more experiments of some combination of multi_format, parser, record_reformer and record_transformer plugins. Hopefully that would narrow down the culprit. |
In fact, scratch #1941 (comment). The "good" one did not have log generator properly set up. I just fixed the log generator (shown as I also set up a Will let it soak a bit and update the thread. |
I'm also heading an issue, i'm using 2 instances of fluentd in kubernetes and each of them consume 1gb memory |
Hi @repeatedly, any luck reproducing this? |
@qingling128 I am running your log generator with stub code but still no luck on my Ubuntu environment. How many files to you tail? I'm now testing with 9 files (9000msg/sec). |
We are tailing 30 files, each with 30 msg/sec. |
By the way, I just started some simple setup in order to provide some reproduction steps without getting deeply involved in our infrastructure: https://github.com/qingling128/fluent-plugin-buffer-output-stub/tree/master. I only just set up an experiment. Will let it run overnight to see if this can reproduce the issue. The configuration and setup are similar but not exactly the same with what we have in Kubernetes. But I will iterate on this to try to reproduce it. |
Thanks. I will try it with similar setting. |
Does anyone reproduce this problem on their environment without Docker/k8s? Here is my environment:
This is based on above. Just changed
fluentd: 1.2.4 |
I also tried to reproduce it in a non-docker & non-k8s environment with similar configurations. No luck yet. |
Hi @repeatedly @qingling128 , I am using neither docker or k8s. My config is as per my comment here: #1941 (comment) |
Just a thought, would log rotation contribute to the issue? As I thought about the difference between the two setups (k8s v.s. no k8s), this is the first thing that crossed my mind. Current GKE log rotation happens when log file exceeds 10MB. At the load of 100kb/s, the log file is rotated every (10 * 1024 / 100 = 102) seconds. |
I see. I will run rotation script with par.rb and observe memory usage. |
I noticed timer for rotated files are not released. Will fix and observe memory usage. |
Sounds promising. Keep us posted. :D Thanks a lot! |
patch is here: #2105 |
Great! Let me know if there is anything I could help test. :) |
in_tail: Fix rotation related resource leak. fix #1941
I released v1.2.5.rc1 for testing. |
Great. I've set up some test for that version. Will keep you posted. |
Seems like that patch fixed the issue we had. Thank you so much @repeatedly ! Gem versions we tested with:
|
BTW, when are we expecting a formal release of |
Released v1.2.5. Thanks for the testing. |
Thank you! |
Fluentd version: 0.14.25
Environment: running inside a debian:stretch-20180312 based container.
Dockerfile: here
We noticed a slow memory leak that built up over a month or so.
The same setup that ran with Fluentd 0.12.41 have stable memory usage over the same period of time.
Still investigating and trying to narrow down versions. But wanna create a ticket to track this.
Config:
The text was updated successfully, but these errors were encountered: