Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluentd-sender(0.14.20) become cpu 100% after fluentd-server down #1665

Closed
bungoume opened this issue Aug 18, 2017 · 30 comments
Closed

fluentd-sender(0.14.20) become cpu 100% after fluentd-server down #1665

bungoume opened this issue Aug 18, 2017 · 30 comments
Labels
moreinfo Missing version, need reproducible steps, need to investigate more v0.14

Comments

@bungoume
Copy link

After the connection to the server is lost, the fluentd-sender CPU usage become 100%.
Even if the server is restored, CPU still sticks to max.

fluentd(sender) version: 0.14.20

image
21:36 - server down
21:38 - server reovered

fluentd sender config

https://github.com/bungoume/log-sender/blob/master/fluent.tmpl#L147-L168

@repeatedly
Copy link
Member

Do you have reproducible step?
I tested destination shutdown on my Mac but forwarder's CPU is lower...

@repeatedly repeatedly added moreinfo Missing version, need reproducible steps, need to investigate more v0.14 labels Aug 22, 2017
@GLStephen
Copy link

I've seen this behavior as well. The target server for a forwarder has a blip and then the sending server spikes and stays high CPU.

@repeatedly
Copy link
Member

@GLStephen Could you tell me your setting and traffic?
I can't reproduce on my environment so I want to reproduce the problem first.

@bungoume
Copy link
Author

It was reproduced when AWS CLB(ELB) was placed in front of the aggregator(server).

log sender -> ELB -> aggregator

when ELB->aggs connection was closed,
sender seems to misunderstand that aggregator server has downed and recovered.
And, sender CPU will be 100%.

2017-08-28 10:03:15 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:16 +0000 [warn]: #0 [forward_any] recovered forwarding server '' host="log-aggregator.local" port=24224
2017-08-28 10:03:16 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:16 +0000 [warn]: #0 [forward_any] failed to flush the buffer. retry_time=1 next_retry_seconds=2017-08-28 10:03:17 +0000 chunk="557cd66b41ec2fe08ea752ca6ee2d50e" error_class=Fluent::Plugin::ForwardOutput::ConnectionClosedError error="failed to establish connection with node "
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:606:in `send_data_actual'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:633:in `send_data'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:285:in `block in try_write'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:304:in `block in select_a_healthy_node'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:298:in `times'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:298:in `select_a_healthy_node'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:285:in `try_write'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/output.rb:1034:in `try_flush'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/output.rb:1268:in `flush_thread_run'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/output.rb:420:in `block (2 levels) in start'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-08-28 10:03:16 +0000 [warn]: #0 [forward_any] recovered forwarding server '' host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] recovered forwarding server '' host="log-aggregator.local" port=24224
2017-08-28 10:03:20 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:21 +0000 [warn]: #0 [forward_any] recovered forwarding server '' host="log-aggregator.local" port=24224

@GLStephen
Copy link

@repeatedly there are some aspects to our config I would prefer not to make public. What info do you need exactly? The basics are. We are using flowcounter to forward message counts every 15 seconds. Running about 1m messages a minute. We have also seen what appears to be deadlocking with 14 at these volumes.

@GLStephen
Copy link

@repeatedly more info is here, we are seeing this as related to the above, sending about 1m events/hour

fluent-plugins-nursery/fluent-plugin-bigquery#143

@repeatedly
Copy link
Member

@GLStephen Above link is for deadlock. Do you hit 100% CPU problem together?

@repeatedly
Copy link
Member

@bungoume It means the problem doesn't happen without AWS CLB/ELB?

@GLStephen
Copy link

@repeatedly I've seen both happen together in a forwarding pair. Our aggregator, referenced above will deadlock and the downstream flowcounter metric server will do this 100% CPU thing like above.

@bungoume
Copy link
Author

@repeatedly yes,
log-sender -> ELB connection was closed or
log-aggregator downed only when log-sender directory connect,
the problem does not happen.

@repeatedly
Copy link
Member

@GLStephen Thanks. I will fix deadlock issue first.

@bungoume So the ELB returns wrong response? I'm not familiar with ELB connection handling so I want to know what's the return in this situation.

@bungoume
Copy link
Author

bungoume commented Aug 29, 2017

We collected logs. (fluentd log, ruby profile, and TCP dump)
fluentd20170829.tar.gz

10.81.24.149: client (log-sender)
10.81.21.192: server (log-aggregator ELB)

Normal case

image

  • client send message (No.7) and TCP-ACK(No.10)
  • server returns software-ACK (No.12)
    ex: ackVX21QxVErW1j597Kn12TZQ==
  • FIN by client (No.16-18)

Problem case

image

  • client send message (No.538) and TCP-ACK(No.540)
  • without server software-ACK
  • FIN by server (No.543-)

This may be due to the TCP level health check being successful.
(But this is not a direct reason for CPU 100%?)

@repeatedly
Copy link
Member

@bungoume Thanks!

@GLStephen One question. Do you forward logs between fluentd nodes directly? Not use ELB like bungoume, right?

@GLStephen
Copy link

@repeatedly Correct, we connect directly from the forwarder to the aggregator

@repeatedly
Copy link
Member

@bungoume Could you get sigdump result from problematic fluentd instance?

@okkez
Copy link
Contributor

okkez commented Sep 6, 2017

@bungoume Could you try #1684 ?

@bungoume
Copy link
Author

bungoume commented Sep 6, 2017

@repeatedly
Add require 'sigdump/setup' at /usr/bin/fluentd and attach the executed log.
sigdump(normal)-1549.txt
sigdump(cpu100)-267.txt (send SIGCONT twice)

I tried #1684 but it did not resolve.
image

@repeatedly
Copy link
Member

@bungoume Thanks.
I wrote a patch for this problem. Could you test this? > #1686

@bungoume
Copy link
Author

bungoume commented Sep 7, 2017

@repeatedly Thank you!
The bug seems to have been fixed.

However, I get a lot of logs.
log:
20170907T012240.txt

2017-09-07 01:20:08 Block ELB->log-agg connect
2017-09-07 01:21:40 Fix ELB->log-agg connect
2017-09-07 01:22:38 Stop fluentd

@repeatedly
Copy link
Member

If you have lots of chunks, it seems normal because such logs are generated by each chunk.

@repeatedly
Copy link
Member

I'm not sure this case is popular or not.
If there are lots of servers behind ELB, retried chunks are routed to live aggregators, right?

@bungoume
Copy link
Author

Yes, it does not occur much.
However, it may be caused by log-aggregators deployment error or configuration bug.

It is fatal for all applications to die due to fluentd.

@repeatedly
Copy link
Member

repeatedly commented Sep 11, 2017

it may be caused by log-aggregators deployment error or configuration bug.

How to detect this by fluentd forwarder side?
Maybe, we can't suppress these error because temporal network failure cause same error, right?

@bungoume
Copy link
Author

I think there is no need to distinguish it from network error.
Would you please reduce the amount of logs, including network errors?

@bungoume
Copy link
Author

@repeatedly how is this issue?
This problem also occurred in using AWS-NLB and v0.14.23.

repeatedly added a commit that referenced this issue Nov 29, 2017
Don't update retry state when failed to get ack response. fix #1665
@bungoume
Copy link
Author

bungoume commented Dec 5, 2017

@repeatedly Unfortunately, this problem has also reappeared on fluentd-1.0.0.rc1(Ruby 2.4.2)

@bungoume
Copy link
Author

bungoume commented Dec 6, 2017

fluent.txt
This is the sender log when CPU reaches 100%.

@repeatedly
Copy link
Member

Is this same situation or other network sequence?

@bungoume
Copy link
Author

bungoume commented Dec 8, 2017

the last report is using NLB.
Occurs when log-aggregator stops due to Elasticsearch failure.

@nathan-oakes
Copy link

nathan-oakes commented Jun 1, 2018

I have noticed this same issue, but not on all nodes. Config is using 1.0 (td-agent3):
Within Datacenter: [td-agent in_tail, out_forward on nodes] -> (td-agent in_forward, out_forward on "Forwarder") -> Within AWS VPC over IPSEC tunnel (in_forward, out_elasticsearch and out_s3 on "Aggregator").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
moreinfo Missing version, need reproducible steps, need to investigate more v0.14
Projects
None yet
Development

No branches or pull requests

5 participants