-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluentd retains excessive amounts of memory after handling traffic peaks #1657
Comments
Did you try this setting? https://docs.fluentd.org/v0.12/articles/performance-tuning#reduce-memory-usage |
Yes, even with constraints on the oldobject factor, the problem persists. In fact, even with more draconion restrictions on the garbage collector the problem persists, e.g.: $ RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 \
RUBY_GC_HEAP_GROWTH_FACTOR=1.05 \
RUBY_GC_MALLOC_LIMIT_MAX=16777216 \
RUBY_GC_OLDMALLOC_LIMIT_MAX=16777216 \
td-agent --no-supervisor -c test.conf & Coupling that output $ td-agent-bit -i tcp://127.0.0.1:10130 -t test -o forward://127.0.0.1:10131 &
$ seq 1 3000 | sed 's/.*/{"num": &, "filler": "this is filler text to make the event larger"}/' > /dev/tcp/localhost/10130 & ... sees the Honestly, it seems like the real issue here is the Ruby 2.1 garbage collector; as far as I can tell, it never releases memory to the OS that it allocates, thereby damning any process that has an even momentary need for a large quantity of memory to hold on to that memory for the rest of its lifetime. (Please correct me if I'm wrong here.) If the Ruby gc issue cannot be fixed by some form of additional configuration, then perhaps Fluentd could use some type of backpressure mechanism to avoid ingesting input faster than it can truly process it and avoid accumulating large queues in memory? |
+1 |
it's a stale issue. I'm closing. if you have any problem, updating fluentd and ruby might help it. |
I have re-tested this issue using the latest td-agent 3.5.1 RPM package for RHEL 7. That package includes Fluentd 1.7.4 and embeds Ruby 2.4.0. The problem still remains exactly as it was originally reported. Nothing has been fixed, despite the intervening 2 years. The issue should be re-opened. P.S. You as the Fluentd developers and project administrators collectively have the right to run your project however you see fit. However, making claims of Fluentd's suitability for production use and its performance on your website are not consistent with this kind of callous disregard for the validity of an easily reproducible issue that has serious impacts on Fluentd's production suitability. I gave very simple steps to reproduce this issue, and it only took me a few minutes to download the latest TD-Agent RPM, install it, and copy and paste the commands from my original report to see that the outcome remained the same. You could have trivially done the same. The fact that you could not be bothered to do so but instead chose to try to bury and ignore this problem speaks volumes, especially as another user indicated as recently as June 27 that it is still affecting people. If you truly feel that is the appropriate response, then you should also remove the false claims of performance and production-suitability from the Fluentd website. |
Thank you re-test it. then I should re-open this issue. |
I'm having the same issue using td-agent 3.8.0 RPM package for Amazon2. That package includes Fluentd 1.11.1 and embeds Ruby 2.4.10. Any news here? Work in progress? Our fluentD is in production with 4 aggregators ingesting at the same time into ElasticSearch. So far stable but the memory is normalizing always close to 100% leaving just 250-300 MB free which I think doesn't have any sense... I don't know what else to test... I'm checking this since weeks, changing many configurations and different versions without any clue. Even I tried to adjust the GC variables as @joell did in the past but the behaviour is the same. You can see here how a new aggregator doesn't release memory until be close to 100%. The unique thing we improve adding a new aggregator is reduce a little bit the CPU usage. Our buffer size (I think is in sync with this memory issue) as I read in other threads with memory buffer the behaviour is different... Total network I/O (when the peacks are related also with the amount of memory needed) Please help, thanks in advance. |
@cede87: As I noted in my comment on April 9, 2017 the underlying issue here appears to be in the default Ruby garbage collector and memory allocator. The Fluentd developers could directly avoid this by applying a backpressure mechanism or spooling incoming data to disk instead of in memory. Alternatively, the Ruby memory allocation issue can be indirectly avoided by replacing or manipulating the Ruby memory allocator. One method is to replace the allocator with jemalloc (though different versions are reported to be more effective than others); this approach was documented as being done by the Fluentd devs, but as I noted in the original issue text it doesn't look like jemalloc is actually used in the build that produces the RPM. Another method would be to try to manipulate the allocator's behavior through things like the A summary of the underlying problem and some of the techniques you might be able to try -- including going so far as patching the Ruby garbage collector yourself -- can be found in this article. Best of luck. |
@joell thanks for the quick reply. I did the same checks you did in the past and I could verify that we are using jemalloc with our RPM installation. So if I'm not mistaken we are not able to use MALLOC_ARENA_MAX environment variable... even so we are suffering the same problems. Any suggestion? [root@x ~]# pmap 4057 | grep jemalloc Thanks! |
@cede87: The presence of
Glancing at package content listings online, it looks like you would see |
@joell many many thanks for your suggestions. I was able to change jemalloc version from 5.x to 3.6.0 using td-agent 3.8.0 in one server (just to test). Notice the difference. So I can confirm the following:
I think FluentD developers should take a look on this. |
Would appreciate a solution, the memory usage has peaked and isn't falling down at all. |
Hi @Adhira-Deogade please follow my notes to be able to downgrade jemalloc version.
Note: First delete the files in: cd /opt/td-agent/embedded/lib Note: If you do ls you should see these symbolic links:
I hope it helps you, |
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days |
This issue should remain open until it is resolved. It has continued to affect people since it was reported in 2017, and the current best workarounds are laborious and invasive. |
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days |
I repeat: This issue should remain open until it is resolved. It has continued to affect people since it was reported in 2017, and the current best workarounds are laborious and invasive. |
This issue is still affecting us. The unique solution is changing the jemalloc version. Currently is what we are doing even in production environments. Which is not the best solution. |
Thank you for notifying it.
td-agent loads jemalloc by if [ -f "${TD_AGENT_HOME}/embedded/lib/libjemalloc.so" ]; then
export LD_PRELOAD="${TD_AGENT_HOME}/embedded/lib/libjemalloc.so"
fi Environment=LD_PRELOAD=<%= install_path %>/embedded/lib/libjemalloc.so
td-agent 3 uses jemalloc 4.5.0, not 5.x
|
Hmm, I confirmed that jemalloc-3.6.0 consumes fewer memory than jemalloc 5.2.1 (td-agent 4.1.0's default) in this case: jemalloc 3.6.0:
jemalloc 5.2.1:
You can switch jemalloc version easily by
But I'm not sure it's always efficient and worth to replace. |
@ashie: Thank you for looking into this issue and confirming yourself what the community has been reporting. Regarding the efficiency of jemalloc 3.x vs 5.x, the common trend I've read is that 5.x may be a bit faster. However, as Fleuntd is both advertised for production use and frequently used in production environments, I would argue that stability is more important than speed here. We have run into issues using Fluentd in production where its memory consumption has grown to the point where it has actively hampered other more important production services on a host. Ultimately, we've had to move away from Fluentd for certain applications because of this bug. For the sake of ensuring system stability, I would argue for making jemalloc 3.x the default allocator. If people need greater performance and are confident their use case will not trigger this memory consumption issue, they could use jemalloc 5.x instead via I urge that the default Fluentd configuration prioritize stability over performance. |
Thanks for your opinion. I've opened an issue for td-agent: fluent/fluent-package-builder#305 |
I can confirm that the issue is still very much present and valid on 4.2.0 td-agent / Ubuntu Bionic. We're using it to report on Artifactory stats as part of Jfrog monitoring platform so configuration is pretty much default. Fluentd's memory usage is creeping up a lot so as a workaround we've applied cgroup limits to it (50% of 128GB RAM utilization of the host which is a massive 64 GB). It took only a few hours after restarting td-agent service for it to be OOM killed:
|
@ashie What is the latest status on this issue? This seems like a HUGE flaw in fluentd. The issue you linked fluent/fluent-package-builder#305 was closed without action taken. We are facing fluentd OOM issues in production. Please advise. |
Same issue with td-agent 4.4.2 ( fluentd 1.15.3 ) on ubi image ( rhel8 ). I cannot believe this is not fixed - surely not production ready :( |
I can confirm this behavior on td-agent 4.4.1 fluentd 1.13.3 (c328422) as well. The memory seems is not dynamically deallocated. Initial startup without traffic, memory consumption is low. After a large traffic then turn to no traffic, memory is still staying at the highest point. |
This issue may relate to #4174 |
Regarding the case where |
When setting up a simple, 2-part Fluentd (TCP -> forwarder) -> (forwardee -> disk configuration) and giving it 5 million JSON objects to process all at once, resident-set memory consumption jumps from an inital 30MB to between 200-450MB, and does not come back down after computation is complete. This is observed using version 2.3.5-1.el7 of the TD Agent RPM package running on CentOS 7. (The version of Fluentd in that package is 0.12.36.)
Steps to reproduce:
As you can see from the RSS numbers, each td-agent process started out around 30MB and they ended at ~290MB and ~460MB, respectively. Neither process will release that memory if you wait a while. (In the real-world staging system we initially discovered this on, memory consumption of the
test-out.conf
-equivalent configuration reached over 3GB, and thetest-in.conf
-equivalent was a Fluent Bit instance exhibiting a recently-fixed duplication bug.)Reviewing a Fluentd-related kubernetes issue during our own diagnostics, we noticed that the behavior we observed seemed similar to the Fluentd behavior described there when built without jemalloc. This led us to check if the td-agent binary we were using was in fact linked with jemalloc. According to the FAQ, jemalloc is used when building the Treasure Data RPMs, and though we found jemalloc libraries installed on the system, we couldn't find any existence of jemalloc in the process running in memory. Specifically, we tried the following things:
In short, this leads us to wonder... are the binaries invoked by
td-agent
actually linked with jemalloc? If they are not, is the old memory fragmentation problem that jemalloc solved what we are observing here? (And if they aren't, am I raising this issue in the wrong place, and if so where should I raise it?)The text was updated successfully, but these errors were encountered: