CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3 #3389

qingling128 · 2021-05-24T18:57:03Z

Describe the bug
When upgrading Fluentd 1.11.2 to the latest 1.12.3, our soak tests detected that the memory and CPU usage increased by a non-trivial percentage, resulting in Fluentd not being able to catch up with 1000 log entries / second throughput (a baseline of the soak test).

The diff between the two is:

gem uninstall fluentd -ax --force
gem install fluentd -v 1.12.3

Expected behavior
CPU and memory usage is stable and log entries are not dropped.

Your Environment
Fluentd runs inside a Debian 9 based container in GKE with fixed throughput of 1000 log entries per second. (In total 3 VMs, which is why in total there are 3000 log entries per second).

Your Configuration
https://github.com/Stackdriver/kubernetes-configs/blob/1d0b24b650d7d044899c3e958faeda62acbae9c6/logging-agent.yaml#L131

The text was updated successfully, but these errors were encountered:

qingling128 · 2021-05-24T19:07:00Z

I'm setting up more experiments for the versions in between these two to narrow down the root cause.

ashie · 2021-05-25T09:38:00Z

I'm now suspecting that #3391 might be a cause of this issue, there is no evidence yet though.

ashie · 2021-05-25T23:24:52Z

Or #3382 #3387 may be related to this.

mtbtrifork · 2021-05-27T07:40:22Z

At the company where I'm currently employed we're using fluentd on 50+ servers via td-agent .deb packages available from http://packages.treasuredata.com/4/ubuntu/bionic/ . Almost all of the servers run .deb package version 4.1.0-1, i.e. fluentd version 1.12.1, and we have so far only experienced the 100% CPU issue on servers where we coincidentally upgraded to 4.1.1-1 / 1.12.3. My point is that the issue can perhaps be narrowed down between versions 1.12.1 and 1.12.3.

When the 100% CPU issue is present, a sigdump (kill -CONT) appears to always show a running thread with the randomisation going on in Ruby's resolv.rb, which I have seen mentioned in #3387 (comment) . I find it difficult to believe, however, that Ruby's resolv.rb code should be causing the issue, as the td-agent .deb package for 1.12.1 seems to ship with the exact same Ruby version (2.7.0, and thereby the exact same resolv.rb) as the package for 1.12.3. And, as mentioned, we believe we don't see the issue with 1.12.1.

Sadly, we have not found a way to reliably reproduce the issue, and I don't have anything else to add at this stage.

ashie · 2021-05-27T09:30:27Z

@mtbtrifork Thanks for your report! It's very informative.
I'm now suspecting Ruby or excon.
As I mentioned at #3382 (comment) there are similar reports.

BTW we should discuss about this cause at #3382, because it's same cause with yours, and this issue (#3389) isn't yet judged as same cause with it.

mtbtrifork · 2021-05-27T10:34:37Z

@ashie I'm sorry about the confusion.

Let me correct myself: The two versions of td-agent don't ship with the same Ruby version:

$ apt install td-agent=4.1.1-1
$ td-agent --version
td-agent 4.1.1 fluentd 1.12.3 (b6e76c09d60ee866de6470f2bdd2551e7c8591d3)
$ /opt/td-agent/bin/ruby --version
ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-linux]

$ apt install td-agent=4.1.0-1
$ td-agent --version
td-agent 4.1.0 fluentd 1.12.1 (e3effa337593618cbd7f0f4ef071766df1ec69a0)
$ /opt/td-agent/bin/ruby --version
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]

ashie · 2021-05-27T11:47:05Z

Yes, so that td-agent 4.1.1 can hit Ruby's issue mentioned at #3382 (comment)

ashie · 2021-06-04T05:38:24Z

Now I believe that following similar issues (not sure it's same with yours or not) are caused by Ruby's resolv: #3387 #3382
And resolv 0.2.1 gem fixes it: ruby/resolv#9
It's triggered by excon's change: excon/excon#739, excon uses Ruby's resolv as of 0.80.0.
excon might be unexpectedly upgraded by upgrading fluentd & plugins.
So please check following things:

Check your versions of Ruby, excon and resolv (if you use resolv gem instead of Ruby's one)
Try resolv 0.2.1 gem

If you don't use excon 0.80.0 or later, or resolv 0.2.1 doesn't resolve your problem, we should suspect another cause.

qingling128 · 2021-06-08T06:42:37Z

Not sure how much it helps, but here are some results of the previous experiments (The only difference is the fluentd gem version):

ashie · 2021-06-08T07:22:48Z

Thanks, it's very helpful!
Your data indicates that it's definitely caused since v1.12.0.rc2 which introduces follow_inodes of in_tail: #3182 #3196
Although some other features are also introduced at v1.12.0.rc2, I think follow_inodes has most heavy impact to the performance in that.
Now I think it's different problem with #3382.

I'm interested in whether v1.12.4 fixes this issue or not, because v1.12.0.rc0 - v.1.12.3 leaks tail watchers after file rotation, it will cause CPU & memory spikes: #3393

qingling128 · 2021-06-08T15:06:10Z

Glad that helped. I just set up an experiment with 1.12.4 as well. Will report back once we get some results. It typically takes a few days for the issue to start revealing itself.

cosmo0920 · 2021-06-24T05:45:52Z

@qingling128 How's it going your Fluentd 1.12.4 experiment?

qingling128 · 2021-06-24T19:57:48Z

It's a bit hard to tell. Eventually all three experiments' memory usage reached a higher plateau:

ashie · 2021-06-29T03:07:36Z

Thanks for the report!
It seems 1.12.4 is same level with 1.12.3, it doesn't resolve the issue.
We should check the changes in 1.11.5 - 1.12.0.rc2 again.

qingling128 · 2021-07-01T03:50:09Z

Occasionally we see sudden bumps (both in #3389 (comment) and #3389 (comment)) of memory, and the timing looks arbitrary.

That might mean the sudden memory bump is non-deterministic. e.g. Certain agent versions could have lower memory usage during the time duration I observed it, but chances are that I was just lucky? If that's the case, the test comparison result probably only gives us a general trend, but not necessarily narrowing down the root cause to exactly version A and B.

I agree that we probably need to look at a few more versions near that range.

kenhys · 2021-07-30T08:31:05Z

As https://github.com/Stackdriver/kubernetes-configs/blob/1d0b24b650d7d044899c3e958faeda62acbae9c6/logging-agent.yaml#L131
contains many factor source/filter/match, so I've focused on only in_tail to reproduce by more simple tail test case (globing *log) in a shorter span , but no luck to monitor the problematic spike (sampling every 60sec with GC::Profiler). it is a combination of 3000 log entries/sec with dummer, rotate source periodically, remove sources.

Could someone reproduce it with simple fluent.conf?

qingling128 · 2021-08-19T21:30:37Z

Looks like 1.13.3 has better memory usage:

ashie · 2021-08-20T01:51:27Z

ref: GoogleCloudPlatform/fluent-plugin-google-cloud#468

ashie · 2021-08-20T02:10:45Z

Close.

If it reproduces again, please let us know.
We'll reopen this.

mtbtrifork mentioned this issue Jun 2, 2021

Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost. #3382

Closed

ashie self-assigned this Jul 2, 2021

qingling128 mentioned this issue Jul 19, 2021

chore(gemspec): update fluentd to 1.13.3 GoogleCloudPlatform/fluent-plugin-google-cloud#463

Closed

ashie mentioned this issue Jul 27, 2021

v1.13.3 #3470

Merged

ashie closed this as completed Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3 #3389

CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3 #3389

qingling128 commented May 24, 2021 •

edited

Loading

qingling128 commented May 24, 2021

ashie commented May 25, 2021

ashie commented May 25, 2021

mtbtrifork commented May 27, 2021

ashie commented May 27, 2021

mtbtrifork commented May 27, 2021

ashie commented May 27, 2021

ashie commented Jun 4, 2021 •

edited

Loading

qingling128 commented Jun 8, 2021

ashie commented Jun 8, 2021

qingling128 commented Jun 8, 2021

cosmo0920 commented Jun 24, 2021

qingling128 commented Jun 24, 2021

ashie commented Jun 29, 2021 •

edited

Loading

qingling128 commented Jul 1, 2021

kenhys commented Jul 30, 2021

qingling128 commented Aug 19, 2021

ashie commented Aug 20, 2021

ashie commented Aug 20, 2021

CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3 #3389

CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3 #3389

Comments

qingling128 commented May 24, 2021 • edited Loading

qingling128 commented May 24, 2021

ashie commented May 25, 2021

ashie commented May 25, 2021

mtbtrifork commented May 27, 2021

ashie commented May 27, 2021

mtbtrifork commented May 27, 2021

ashie commented May 27, 2021

ashie commented Jun 4, 2021 • edited Loading

qingling128 commented Jun 8, 2021

ashie commented Jun 8, 2021

qingling128 commented Jun 8, 2021

cosmo0920 commented Jun 24, 2021

qingling128 commented Jun 24, 2021

ashie commented Jun 29, 2021 • edited Loading

qingling128 commented Jul 1, 2021

kenhys commented Jul 30, 2021

qingling128 commented Aug 19, 2021

ashie commented Aug 20, 2021

ashie commented Aug 20, 2021

qingling128 commented May 24, 2021 •

edited

Loading

ashie commented Jun 4, 2021 •

edited

Loading

ashie commented Jun 29, 2021 •

edited

Loading