Td-agent flush thread stuck with 100% CPU in an infinite loop #3547

shenningz · 2021-11-02T15:20:57Z

Describe the bug

Td-agent is deployed in a Kubernetes pod and configured to receive Fluent Bit logs and forward to Loki and Graylog. Some time after the start - often after a target such as the Loki Distributors cannot be reached - the CPU utilization of the td-agent Ruby process goes up to 100% utilization and stays there forever. No more logs from Fluent Bit are received, none are forwarded. This only happens with multiple forward targets, in our case Graylog and Loki. So far the issue has never occcurred with only one target.

Apparently the flush_thread_0 Thread is stuck with 100% CPU utilization:

top -H
top - 14:20:36 up 19 days, 22:26, 0 users, load average: 1.42, 1.48, 1.70
Threads: 17 total, 2 running, 15 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.1 us, 0.9 sy, 0.0 ni, 83.8 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 31657.9 total, 3179.4 free, 15111.0 used, 13367.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 16346.7 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
41476 root 20 0 916380 148856 9632 R 99.7 0.5 9032:07 flush_thread_0
41478 root 20 0 916380 148856 9632 S 0.3 0.5 22:28.29 event_loop
1 root 20 0 308200 53128 9328 S 0.0 0.2 1:50.29 td-agent
12 root 20 0 308200 53128 9328 S 0.0 0.2 0:31.32 signal_thread.*
13 root 20 0 308200 53128 9328 S 0.0 0.2 0:00.00 socket_manager*
41473 root 20 0 916380 148856 9632 S 0.0 0.5 0:02.16 ruby
41474 root 20 0 916380 148856 9632 S 0.0 0.5 7:39.57 flush_thread_0
41475 root 20 0 916380 148856 9632 S 0.0 0.5 0:01.66 enqueue_thread
41477 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.22 enqueue_thread
41479 root 20 0 916380 148856 9632 S 0.0 0.5 0:03.69 event_loop
41480 root 20 0 916380 148856 9632 S 0.0 0.5 0:02.40 event_loop
41481 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.01 ruby
41482 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.81 event_loop
41483 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.01 ruby
41484 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.00 fluent_log_eve*
43545 root 20 0 4244 3540 2976 S 0.0 0.0 0:00.01 bash
43557 root 20 0 6136 3232 2720 R 0.0 0.0 0:00.00 top

Attaching a debugger shows that it's stuck in the libyajl.so library.
After rebuilding libyajl.so with debug symbols we could see the exact location:

gdb -p 41476
(gdb) bt
#0 yajl_buf_ensure_available (want=1, buf=0x7fb0b441ec50) at ../../../../ext/yajl/yajl_buf.c:64
#1 yajl_buf_append (buf=0x7fb0b441ec50, data=0x7fb0bd2a517e, len=1) at ../../../../ext/yajl/yajl_buf.c:89
#2 0x00007fb0bd2a104e in yajl_gen_string (g=0x7fb0b4420210, str=0x56277d137758 "facility", len=8) at ../../../../ext/yajl/yajl_gen.c:248
#3 0x00007fb0bd29eff4 in yajl_encode_part (wrapper=wrapper@entry=0x7fb0b44201e0, obj=, io=io@entry=8) at ../../../../ext/yajl/yajl_ext.c:249
#4 0x00007fb0bd29f1a8 in yajl_encode_part (wrapper=wrapper@entry=0x7fb0b44201e0, obj=obj@entry=94727611734280, io=io@entry=8)
at ../../../../ext/yajl/yajl_ext.c:197
#5 0x00007fb0bd29f288 in rb_yajl_encoder_encode (argc=argc@entry=1, argv=argv@entry=0x7fb0bc5c0398, self=)
at ../../../../ext/yajl/yajl_ext.c:1115
#6 0x00007fb0bd29feda in rb_yajl_json_ext_hash_to_json (argc=0, argv=0x7fb0bc5c0fe8, self=) at ../../../../ext/yajl/yajl_ext.c:1179
...

In order to provide even more details we opened a separate issue with yajl-ruby:
brianmario/yajl-ruby#205 (comment)

To Reproduce

deploy Td-agent like in the above scenario with the below configuration.
make sure some log volume is forwarded by Fluent Bit to Fluentd ( a few 100 logs/sec)
scale down Loki distributors to zero in order to force a backpressure situation.
wait until the Td-agent pod consumes 100% (as observed with Kubernetes resources and limits). This can take a few hours.

Expected behavior

Td-agent should not get stuck eating up 100% in an infinite loop when trying to flush records. Instead, if running into backpressure situations, it should buffer or discard records until the target is available again.

Your Environment

- Fluentd version: 1.13.3
- TD Agent version: 4.2.0
- Operating system: Ubuntu 20.04.3 LTS
- Kernel version: 5.4.0-6-cloud-amd64 #1 SMP Debian 5.4.93-1 (2021-02-09)

Your Configuration

<system>
  log_level info
</system>
<source>
  @type monitor_agent
  bind 0.0.0.0
  port 24220
</source>
<source>
  @type prometheus
  port 24222
</source>
<source>
  @type prometheus_monitor
  <labels>
      hostname ${hostname}
  </labels>
</source>
<source>
  @type  prometheus_output_monitor
  <labels>
      hostname ${hostname}
  </labels>
</source>
<source>
  @type forward
  port 24224
  bind 0.0.0.0
</source>
<filter ** >
  @type prometheus
  <metric>
    name fluentd_input_status_num_records_total
    type counter
    desc The total number of incoming records
    <labels>
      tag ${tag}
      hostname ${hostname}
    </labels>
  </metric>
</filter>
<filter **>
  @type record_transformer
  <record>
    ingestor "#{ENV['HOSTNAME']}"
  </record>
</filter>
<match ** >
  @type              copy
  <store>
    @type                        gelf
    host                         graylog-input
    port                         12222
    protocol                     tcp
    tls                          false
    <buffer>
      @type                      file
      path                       /var/xxx/store/fluentd/Graylog 
      total_limit_size           10GB
      chunk_limit_size           10240k
      flush_at_shutdown          true
      overflow_action            drop_oldest_chunk
      retry_max_times            120
      retry_type                 periodic
      retry_wait                 10s
    </buffer>
  </store>
  <store>
    @type prometheus
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
      <labels>
        tag ${tag}
        hostname ${hostname}
      </labels>
    </metric>
  </store>
  <store>
    @type                loki
    url                  "http://#{ENV['LOKI_HOST']}:#{ENV['LOKI_PORT']}"
    username             xxx
    password             xxx
    <label>
      ingestor
      abap_system
      app
      az
      landscape
      namespace
      node
      pod
      Labels
      MessageID
      MessageSource
      RFC3339_Timestamp
      SAP_EPP
      Severity
      ServiceName
      ServiceComponent
      UserName
      bosh_deployment
      bosh_director
      bosh_group
      bosh_id
      bosh_az
    </label>
    <buffer time>
      @type              file
      path               /var/xxx/store/fluentd/Loki
      total_limit_size   10GB
      chunk_limit_size   10240k
      flush_at_shutdown  true
      overflow_action    drop_oldest_chunk
      retry_max_times    120
      retry_type         periodic
      retry_wait         10s
      timekey            10
      timekey_wait       60
    </buffer>
  </store>
</match>

Your Error Log

As the Td-agent is stuck in an infinite loop, no more infos/warnings/errors are logged.

Additional context

No response

fujimotos · 2021-11-04T07:42:30Z

Td-agent should not get stuck eating up 100% in an infinite loop when trying to flush records.

I think this is Yajl's internal bug :-(

After digging yajl-ruby, I've found out that Yajl internally uses unsigned int
to manage buffer lengths. This means that Yajl cannot handle a large amount
of memory such as @shenningz 's (which is 32GB).

https://github.com/brianmario/yajl-ruby/blob/master/ext/yajl/yajl_buf.c#L64

The problematic line is here. This code is obviously bogus, because
if the argument want is sufficiently large, the while loop won't exit.

Here is a mock code that should show the essense of this bug:

#include <stdio.h>

void main()
{
    unsigned int want = 4000000000;
    unsigned int used = 300000000;
    unsigned int need = 1000000000;
    
    while (want >= (need - used)) {
        need <<= 1;
        printf("%u %u %u\n", need, used, need - used);
    }
}

This code will enter into an infinite loop, with used overflows to 0.

fujimotos · 2021-11-04T08:22:09Z

@shenningz I posted some fix to eeddaann/fluent-plugin-loki#5.

This only happens with multiple forward targets, in our case Graylog and Loki

Evidently the root problem is that out_loki uses yajl to encode events (Fluentd
itself has already migrated to OJ, and only uses Yajl as fallback). I posted a patch
to add the multiple JSON libraries support to out_loki.

Please add feedbacks to that PR if you have any; I close this ticket in fluent/fluentd.

shenningz · 2021-11-04T08:45:07Z

Just to avoid misunderstandings - the 32Gb memory in the top output refers to the K8S node. The Fluentd container is constrained to 500Mb. So this may not necessarily be a large memory problem. Our analysis of the infinite loop is described here: brianmario/yajl-ruby#205

fujimotos · 2021-11-04T10:41:07Z

@shenningz Thank you.

Our analysis of the infinite loop is described here: brianmario/yajl-ruby#205

So the buf->len would be 0 when malloc fails, which is another
failure pattern for yajl to enter into an infnite loop : -(

shenningz · 2022-04-08T06:57:54Z

@fujimotos A new yajl-ruby 1.4.2 gem has been released
https://rubygems.org/gems/yajl-ruby/versions/1.4.2

which supposedly fixes the issue.
brianmario/yajl-ruby#211

Can you please consider yajl-ruby 1.4.2 for the next Fluentd release while taking into account the mentioned Ruby dependencies ?
brianmario/yajl-ruby#211 (comment)

shenningz · 2022-07-07T15:06:02Z

fixed in yajl-ruby-1.4.3 which is in Td-agent 4.3.2

shenningz changed the title ~~Td-agent flush thread stuck with 100% CPU in a forever-loop~~ Td-agent flush thread stuck with 100% CPU in an infinite loop Nov 2, 2021

ashie added the Non fluentd issue label Nov 4, 2021

fujimotos mentioned this issue Nov 4, 2021

Add support for multiple JSON encoders eeddaann/fluent-plugin-loki#5

Merged

fujimotos closed this as completed Nov 4, 2021

drbugfinder-work mentioned this issue Nov 8, 2021

Fluentd loki output plugin stuck in infinite loop grafana/loki#4700

Closed

shenningz mentioned this issue Jan 28, 2022

Td-agent in infinite loop consuming 100% CPU #3611

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Td-agent flush thread stuck with 100% CPU in an infinite loop #3547

Td-agent flush thread stuck with 100% CPU in an infinite loop #3547

shenningz commented Nov 2, 2021 •

edited

Loading

fujimotos commented Nov 4, 2021

fujimotos commented Nov 4, 2021

shenningz commented Nov 4, 2021

fujimotos commented Nov 4, 2021 •

edited

Loading

shenningz commented Apr 8, 2022

shenningz commented Jul 7, 2022

Td-agent flush thread stuck with 100% CPU in an infinite loop #3547

Td-agent flush thread stuck with 100% CPU in an infinite loop #3547

Comments

shenningz commented Nov 2, 2021 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

fujimotos commented Nov 4, 2021

fujimotos commented Nov 4, 2021

shenningz commented Nov 4, 2021

fujimotos commented Nov 4, 2021 • edited Loading

shenningz commented Apr 8, 2022

shenningz commented Jul 7, 2022

shenningz commented Nov 2, 2021 •

edited

Loading

fujimotos commented Nov 4, 2021 •

edited

Loading