-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Td-agent flush thread stuck with 100% CPU in an infinite loop #3547
Comments
I think this is Yajl's internal bug :-( After digging
The problematic line is here. This code is obviously bogus, because Here is a mock code that should show the essense of this bug: #include <stdio.h>
void main()
{
unsigned int want = 4000000000;
unsigned int used = 300000000;
unsigned int need = 1000000000;
while (want >= (need - used)) {
need <<= 1;
printf("%u %u %u\n", need, used, need - used);
}
} This code will enter into an infinite loop, with |
@shenningz I posted some fix to eeddaann/fluent-plugin-loki#5.
Evidently the root problem is that Please add feedbacks to that PR if you have any; I close this ticket in fluent/fluentd. |
Just to avoid misunderstandings - the 32Gb memory in the top output refers to the K8S node. The Fluentd container is constrained to 500Mb. So this may not necessarily be a large memory problem. Our analysis of the infinite loop is described here: brianmario/yajl-ruby#205 |
@shenningz Thank you.
So the |
@fujimotos A new yajl-ruby 1.4.2 gem has been released which supposedly fixes the issue. Can you please consider yajl-ruby 1.4.2 for the next Fluentd release while taking into account the mentioned Ruby dependencies ? |
fixed in yajl-ruby-1.4.3 which is in Td-agent 4.3.2 |
Describe the bug
Td-agent is deployed in a Kubernetes pod and configured to receive Fluent Bit logs and forward to Loki and Graylog. Some time after the start - often after a target such as the Loki Distributors cannot be reached - the CPU utilization of the td-agent Ruby process goes up to 100% utilization and stays there forever. No more logs from Fluent Bit are received, none are forwarded. This only happens with multiple forward targets, in our case Graylog and Loki. So far the issue has never occcurred with only one target.
Apparently the flush_thread_0 Thread is stuck with 100% CPU utilization:
top -H
top - 14:20:36 up 19 days, 22:26, 0 users, load average: 1.42, 1.48, 1.70
Threads: 17 total, 2 running, 15 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.1 us, 0.9 sy, 0.0 ni, 83.8 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 31657.9 total, 3179.4 free, 15111.0 used, 13367.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 16346.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
41476 root 20 0 916380 148856 9632 R 99.7 0.5 9032:07 flush_thread_0
41478 root 20 0 916380 148856 9632 S 0.3 0.5 22:28.29 event_loop
1 root 20 0 308200 53128 9328 S 0.0 0.2 1:50.29 td-agent
12 root 20 0 308200 53128 9328 S 0.0 0.2 0:31.32 signal_thread.*
13 root 20 0 308200 53128 9328 S 0.0 0.2 0:00.00 socket_manager*
41473 root 20 0 916380 148856 9632 S 0.0 0.5 0:02.16 ruby
41474 root 20 0 916380 148856 9632 S 0.0 0.5 7:39.57 flush_thread_0
41475 root 20 0 916380 148856 9632 S 0.0 0.5 0:01.66 enqueue_thread
41477 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.22 enqueue_thread
41479 root 20 0 916380 148856 9632 S 0.0 0.5 0:03.69 event_loop
41480 root 20 0 916380 148856 9632 S 0.0 0.5 0:02.40 event_loop
41481 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.01 ruby
41482 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.81 event_loop
41483 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.01 ruby
41484 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.00 fluent_log_eve*
43545 root 20 0 4244 3540 2976 S 0.0 0.0 0:00.01 bash
43557 root 20 0 6136 3232 2720 R 0.0 0.0 0:00.00 top
Attaching a debugger shows that it's stuck in the libyajl.so library.
After rebuilding libyajl.so with debug symbols we could see the exact location:
gdb -p 41476
(gdb) bt
#0 yajl_buf_ensure_available (want=1, buf=0x7fb0b441ec50) at ../../../../ext/yajl/yajl_buf.c:64
#1 yajl_buf_append (buf=0x7fb0b441ec50, data=0x7fb0bd2a517e, len=1) at ../../../../ext/yajl/yajl_buf.c:89
#2 0x00007fb0bd2a104e in yajl_gen_string (g=0x7fb0b4420210, str=0x56277d137758 "facility", len=8) at ../../../../ext/yajl/yajl_gen.c:248
#3 0x00007fb0bd29eff4 in yajl_encode_part (wrapper=wrapper@entry=0x7fb0b44201e0, obj=, io=io@entry=8) at ../../../../ext/yajl/yajl_ext.c:249
#4 0x00007fb0bd29f1a8 in yajl_encode_part (wrapper=wrapper@entry=0x7fb0b44201e0, obj=obj@entry=94727611734280, io=io@entry=8)
at ../../../../ext/yajl/yajl_ext.c:197
#5 0x00007fb0bd29f288 in rb_yajl_encoder_encode (argc=argc@entry=1, argv=argv@entry=0x7fb0bc5c0398, self=)
at ../../../../ext/yajl/yajl_ext.c:1115
#6 0x00007fb0bd29feda in rb_yajl_json_ext_hash_to_json (argc=0, argv=0x7fb0bc5c0fe8, self=) at ../../../../ext/yajl/yajl_ext.c:1179
...
In order to provide even more details we opened a separate issue with yajl-ruby:
brianmario/yajl-ruby#205 (comment)
To Reproduce
Expected behavior
Td-agent should not get stuck eating up 100% in an infinite loop when trying to flush records. Instead, if running into backpressure situations, it should buffer or discard records until the target is available again.
Your Environment
Your Configuration
Your Error Log
As the Td-agent is stuck in an infinite loop, no more infos/warnings/errors are logged.
Additional context
No response
The text was updated successfully, but these errors were encountered: