-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fluentd-sender(0.14.20) become cpu 100% after fluentd-server down #1665
Comments
Do you have reproducible step? |
I've seen this behavior as well. The target server for a forwarder has a blip and then the sending server spikes and stays high CPU. |
@GLStephen Could you tell me your setting and traffic? |
It was reproduced when AWS CLB(ELB) was placed in front of the aggregator(server). log sender -> ELB -> aggregator when ELB->aggs connection was closed,
|
@repeatedly there are some aspects to our config I would prefer not to make public. What info do you need exactly? The basics are. We are using flowcounter to forward message counts every 15 seconds. Running about 1m messages a minute. We have also seen what appears to be deadlocking with 14 at these volumes. |
@repeatedly more info is here, we are seeing this as related to the above, sending about 1m events/hour |
@GLStephen Above link is for deadlock. Do you hit 100% CPU problem together? |
@bungoume It means the problem doesn't happen without AWS CLB/ELB? |
@repeatedly I've seen both happen together in a forwarding pair. Our aggregator, referenced above will deadlock and the downstream flowcounter metric server will do this 100% CPU thing like above. |
@repeatedly yes, |
@GLStephen Thanks. I will fix deadlock issue first. @bungoume So the ELB returns wrong response? I'm not familiar with ELB connection handling so I want to know what's the return in this situation. |
We collected logs. (fluentd log, ruby profile, and TCP dump) 10.81.24.149: client (log-sender) Normal case
Problem case
This may be due to the TCP level health check being successful. |
@bungoume Thanks! @GLStephen One question. Do you forward logs between fluentd nodes directly? Not use ELB like bungoume, right? |
@repeatedly Correct, we connect directly from the forwarder to the aggregator |
@bungoume Could you get sigdump result from problematic fluentd instance? |
@repeatedly I tried #1684 but it did not resolve. |
@repeatedly Thank you! However, I get a lot of logs. 2017-09-07 01:20:08 Block ELB->log-agg connect |
If you have lots of chunks, it seems normal because such logs are generated by each chunk. |
I'm not sure this case is popular or not. |
Yes, it does not occur much. It is fatal for all applications to die due to fluentd. |
How to detect this by fluentd forwarder side? |
I think there is no need to distinguish it from network error. |
@repeatedly how is this issue? |
Don't update retry state when failed to get ack response. fix #1665
@repeatedly Unfortunately, this problem has also reappeared on fluentd-1.0.0.rc1(Ruby 2.4.2) |
fluent.txt |
Is this same situation or other network sequence? |
the last report is using NLB. |
I have noticed this same issue, but not on all nodes. Config is using 1.0 (td-agent3): |
After the connection to the server is lost, the fluentd-sender CPU usage become 100%.
Even if the server is restored, CPU still sticks to max.
fluentd(sender) version: 0.14.20
21:36 - server down
21:38 - server reovered
fluentd sender config
https://github.com/bungoume/log-sender/blob/master/fluent.tmpl#L147-L168
The text was updated successfully, but these errors were encountered: