-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3 #3389
Comments
I'm setting up more experiments for the versions in between these two to narrow down the root cause. |
I'm now suspecting that #3391 might be a cause of this issue, there is no evidence yet though. |
At the company where I'm currently employed we're using fluentd on 50+ servers via td-agent .deb packages available from http://packages.treasuredata.com/4/ubuntu/bionic/ . Almost all of the servers run .deb package version 4.1.0-1, i.e. fluentd version 1.12.1, and we have so far only experienced the 100% CPU issue on servers where we coincidentally upgraded to 4.1.1-1 / 1.12.3. My point is that the issue can perhaps be narrowed down between versions 1.12.1 and 1.12.3. When the 100% CPU issue is present, a sigdump ( Sadly, we have not found a way to reliably reproduce the issue, and I don't have anything else to add at this stage. |
@mtbtrifork Thanks for your report! It's very informative. BTW we should discuss about this cause at #3382, because it's same cause with yours, and this issue (#3389) isn't yet judged as same cause with it. |
@ashie I'm sorry about the confusion. Let me correct myself: The two versions of td-agent don't ship with the same Ruby version:
|
Yes, so that td-agent 4.1.1 can hit Ruby's issue mentioned at #3382 (comment) |
Now I believe that following similar issues (not sure it's same with yours or not) are caused by Ruby's resolv: #3387 #3382
If you don't use excon 0.80.0 or later, or resolv 0.2.1 doesn't resolve your problem, we should suspect another cause. |
Thanks, it's very helpful! I'm interested in whether v1.12.4 fixes this issue or not, because v1.12.0.rc0 - v.1.12.3 leaks tail watchers after file rotation, it will cause CPU & memory spikes: #3393 |
Glad that helped. I just set up an experiment with |
@qingling128 How's it going your Fluentd 1.12.4 experiment? |
Thanks for the report! |
Occasionally we see sudden bumps (both in #3389 (comment) and #3389 (comment)) of memory, and the timing looks arbitrary. That might mean the sudden memory bump is non-deterministic. e.g. Certain agent versions could have lower memory usage during the time duration I observed it, but chances are that I was just lucky? If that's the case, the test comparison result probably only gives us a general trend, but not necessarily narrowing down the root cause to exactly version A and B. I agree that we probably need to look at a few more versions near that range. |
As https://github.com/Stackdriver/kubernetes-configs/blob/1d0b24b650d7d044899c3e958faeda62acbae9c6/logging-agent.yaml#L131 Could someone reproduce it with simple fluent.conf? |
Close. If it reproduces again, please let us know. |
Describe the bug
When upgrading Fluentd 1.11.2 to the latest 1.12.3, our soak tests detected that the memory and CPU usage increased by a non-trivial percentage, resulting in Fluentd not being able to catch up with 1000 log entries / second throughput (a baseline of the soak test).
The diff between the two is:
Expected behavior
CPU and memory usage is stable and log entries are not dropped.
Your Environment
Fluentd runs inside a Debian 9 based container in GKE with fixed throughput of 1000 log entries per second. (In total 3 VMs, which is why in total there are 3000 log entries per second).
Your Configuration
https://github.com/Stackdriver/kubernetes-configs/blob/1d0b24b650d7d044899c3e958faeda62acbae9c6/logging-agent.yaml#L131
The text was updated successfully, but these errors were encountered: