Memory leak #3401

Mosibi · 2021-05-28T09:52:43Z

Describe the bug

I noticed a memory issue, what looks to me as a leak. I had a more complex configuration, but boiled it down to this simple config just to rule out the specifics in my config. Even with this (see below) config, I observe the memory leakage.

I run fluentd 1.12.3 from a container on a Kubernetes cluster via Helm

To Reproduce
Deploy Fluentd 1.12.3 using Helm on a Kubernetes cluster using the provided config. Let it run for some time and observe the memory metrics

Expected behavior
No memory leak

Your Environment
Fluentd 1.12.3 on Kubernetes 1.17.7

Your Configuration

  <system>
      log_level info
      log_event_verbose true
      root_dir /tmp/fluentd-buffers/
    </system>

    <label @OUTPUT>
      <match **>
        @type file
        path /var/log/fluentd-output
        compress gzip
        <buffer>
          timekey 1d
          timekey_use_utc true
          timekey_wait 10m
        </buffer>
      </match>
    </label>

    ###
    # systemd/journald
    ###
    <source>
      @type systemd
      tag systemd
      path /var/log/journal
      read_from_head true

      <entry>
        fields_strip_underscores true
        fields_lowercase true
      </entry>

      @label @OUTPUT
    </source>

Additional context
Using my more complex configuration I observe the same memory growth but there seems to be a correlation between the growth in size and the number of logs ingested. In other words, using more inputs (in my case, container logs), the (assumed) memory leak is larger.

The text was updated successfully, but these errors were encountered:

kenhys · 2021-05-31T01:46:22Z

It seems related to #3342.

FYI: Recently, fluent-plugin-systemd 1.0.5 was released.

It fixes "Plug a memory leaks on every reload"
fluent-plugins-nursery/fluent-plugin-systemd#91

Does it reproducible even though fluent-plugin-systemd 1.0.5?

Mosibi · 2021-06-01T06:17:18Z

I tested with the complete config and then I still see the memory leak. Today I will run with only the systemd part to see if there is a change.

sumo-drosiek · 2021-06-01T09:47:16Z

I'm investigating similar issue from few days, and observed that for fluentd 1.11.5 using null output with file buffer also constantly takes more memory in opposition to null output without buffer section. I can reply tests for newer fluentd

Mosibi · 2021-06-01T12:25:38Z

I'm investigating similar issue from few days, and observed that for fluentd 1.11.5 using null output with file buffer also constantly takes more memory in opposition to null output without buffer section. I can reply tests for newer fluentd

Ah, that is good information. Please share the results with the latest fluentd version, if you can.

Is it possible to share the configuration you are using?

sumo-drosiek · 2021-06-01T12:39:59Z

This is my actual config

2021-06-01 12:30:47 +0000 [info]: using configuration file: <ROOT>
  <match fluentd.pod.healthcheck>
    @type relabel
    @label @FLUENT_LOG
  </match>
  <label @FLUENT_LOG>
    <match **>
      @type null
    </match>
  </label>
  <system>
    log_level info
  </system>
  <source>
    @type forward
    port 24321
    bind "0.0.0.0"
  </source>
  <match containers.**>
    @type relabel
    @label @NORMAL
  </match>
  <label @NORMAL>
    <match containers.**>
      @type copy
      <store>
        @type "null"
        <buffer>
          @type "file"
          path "/fluentd/buffer/logs.containers"
          compress gzip
          flush_interval 5s
          flush_thread_count 8
          chunk_limit_size 1m
          total_limit_size 128m
          queued_chunks_limit_size 128
          overflow_action drop_oldest_chunk
          retry_max_interval 10m
          retry_forever true
        </buffer>
      </store>
    </match>
  </label>
</ROOT>

Mosibi · 2021-06-01T14:05:55Z

It seems related to #3342.

FYI: Recently, fluent-plugin-systemd 1.0.5 was released.

It fixes "Plug a memory leaks on every reload"
fluent-plugin-systemd/fluent-plugin-systemd#91

Does it reproducible even though fluent-plugin-systemd 1.0.5?

Yes, also with fluent-plugin-systemd version 1.0.5 I see the same behaviour. I can attach a new graph, but it shows exactly the same as the first one.

Mosibi · 2021-06-02T06:44:06Z

Thanks to @sumo-drosiek, I can confirm that no memory leak occurs when I do not use a buffer section in the output. I rerun everything using a new config with a buffer section to be 100% sure.

The config without buffers

<system>
     log_level info
     log_event_verbose true
     root_dir /tmp/fluentd-buffers/
   </system>

   <label @OUTPUT>
     <match **>
       @type null
     </match>
   </label>

   ###
   # systemd/journald
   ###
   <source>
     @type systemd
     tag systemd
     path /var/log/journal
     read_from_head true

     <entry>
       fields_strip_underscores true
       fields_lowercase true
     </entry>


     @label @OUTPUT
   </source>

ashie · 2021-06-03T06:04:09Z

I've confirmed the issue.
I think it's not fluentd's issue, it's fluent-plugin-systemd's issue.
Please forward the report to https://github.com/fluent-plugin-systemd/fluent-plugin-systemd

Details:

I cannot reproduce it by replacing <source> with other plugins like in_tail (with tailing a huge file).
The issue is caused by read_from_head of in_systemd. When it's specified, in_systemd tries to read whole entries by running a busy loop: https://github.com/fluent-plugin-systemd/fluent-plugin-systemd/blob/9ffe8e14de75ca94e9bfe7428efd5c1a59421511/lib/fluent/plugin/in_systemd.rb#L144
Because of the busy loop, Ruby interpreter cannot run garbage collection until reach to the end of journals
I confirmed that it can be solved by running GC forcedly like this (it's too ad-hoc though):

diff --git a/lib/fluent/plugin/in_systemd.rb b/lib/fluent/plugin/in_systemd.rb
index adb8b3f..66f4373 100644
--- a/lib/fluent/plugin/in_systemd.rb
+++ b/lib/fluent/plugin/in_systemd.rb
@@ -120,6 +120,7 @@ module Fluent
         init_journal if @journal.wait(0) == :invalidate
         watch do |entry|
           emit(entry)
+          GC.start
         end
       end

Mosibi · 2021-06-03T06:26:27Z

@ashie I just wanted to add the results of my test run with buffering configured in an output where I see that without buffering there is no memory leak and with buffering a memory leak occurs

Mosibi · 2021-06-03T06:31:02Z

So to be complete:

No memory leak:

 <label @OUTPUT>
     <match **>
       @type null
     </match>
   </label>

Memory leak:

 <label @OUTPUT>
      <match **>
        @type null

        <buffer>
          @type "file"
          path "/var/log/fluentd-buffers/output.buffer"
          compress gzip
          flush_interval 5s
          flush_thread_count 8
          chunk_limit_size 1m
          total_limit_size 128m
          queued_chunks_limit_size 128
          overflow_action drop_oldest_chunk
          retry_max_interval 10m
          retry_forever true
        </buffer>
      </match>
    </label>

ashie · 2021-06-03T06:37:03Z

Thanks, further investigation may be needed at fluentd side.
Reopen.

Mosibi · 2021-06-10T06:55:26Z

The following graph displays the memory over almost 7 days. This is my regular config where only the buffer sections in the Elasticsearch outputs is disabled. I was already sure that it was the buffer section, this is just more proof :)

github-actions · 2021-09-08T10:02:10Z

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

sumo-drosiek · 2021-09-08T10:25:00Z

@ashie any update on this?

komalpcg · 2021-09-24T05:39:04Z

I am also facing this issue. Support has suggested to setup a CRON to restart service until a resolution comes: https://stackoverflow.com/questions/69161295/gcp-monitoring-agent-increasing-memory-usage-continuously.

github-actions · 2021-12-23T10:02:23Z

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

levleontiev · 2022-01-20T17:13:49Z

The issue is still here. Please unstall .

kvokka · 2022-02-20T07:41:27Z

Can it be connected with #3634?

tzulin-chin · 2022-03-15T01:50:51Z

Any update for this?
I faced this issue with file buffer on output config

kvokka · 2022-03-15T02:05:29Z

for me it was a salvation in setting

            prefer_oj_serializer true
            http_backend typhoeus
            buffer_type memory

for elasticsearch match section.

And the aggregator pod is using all the ram which you give to the limit until ram release the ram.
also I added

    extraEnvVars:
    # https://brandonhilkert.com/blog/reducing-sidekiq-memory-usage-with-jemalloc/?utm_source=reddit&utm_medium=social&utm_campaign=jemalloc
    - name: LD_PRELOAD
      value: /usr/lib/x86_64-linux-gnu/libjemalloc.so
    # https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
    - name: MALLOC_ARENA_MAX
      value: "2"

This helped to make this usable, but I still have no idea why it is that greedy with the ram

(by the way, forwarders are acting normal)

ofc I had to use manually set elasticsearch version to 7.17

I wasted tons of time on it and I'll be happy if some hints from these thoughts would be useful

ashie · 2022-03-15T02:16:27Z

    # https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
    - name: MALLOC_ARENA_MAX
      value: "2"

When you use jemalloc, MALLOC_ARENA_MAX doesn't make sense, so you can remove it.

sdwerwed · 2022-06-01T11:22:54Z

I face memory leak as well
I use input:

forward
http for the liveness and readiness probes
systemd

output:

null for fluent logs
opensearch with file buffer
stdout for health checks.

The file buffer is always flushed but memory is getting increased.

Version:

activesupport (7.0.2.3)
addressable (2.8.0)
aws-eventstream (1.2.0)
aws-partitions (1.579.0)
aws-sdk-core (3.130.2)
aws-sdk-kms (1.56.0)
aws-sdk-s3 (1.113.1)
aws-sdk-sqs (1.51.0)
aws-sigv4 (1.5.0)
benchmark (default: 0.1.0)
bigdecimal (default: 2.0.0)
bundler (2.3.12, 2.3.11)
cgi (default: 0.1.0.1)
concurrent-ruby (1.1.10)
cool.io (1.7.1)
csv (default: 3.1.2)
date (default: 3.0.3)
delegate (default: 0.1.0)
did_you_mean (default: 1.4.0)
digest-crc (0.6.4)
domain_name (0.5.20190701)
elastic-transport (8.0.0)
elasticsearch (8.1.2)
elasticsearch-api (8.1.2)
elasticsearch-xpack (7.17.1)
etc (default: 1.1.0)
excon (0.92.2)
faraday (1.10.0)
faraday-em_http (1.0.0)
faraday-em_synchrony (1.0.0)
faraday-excon (1.1.0)
faraday-httpclient (1.0.1)
faraday-multipart (1.0.3)
faraday-net_http (1.0.1)
faraday-net_http_persistent (1.2.0)
faraday-patron (1.0.0)
faraday-rack (1.0.0)
faraday-retry (1.0.3)
faraday_middleware-aws-sigv4 (0.6.1)
fcntl (default: 1.0.0)
ffi (1.15.5)
ffi-compiler (1.0.1)
fiddle (default: 1.0.0)
fileutils (default: 1.4.1)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.5.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.14)
fluent-plugin-elasticsearch (5.2.2)
fluent-plugin-grafana-loki (1.2.18)
fluent-plugin-kafka (0.17.5)
fluent-plugin-kubernetes_metadata_filter (2.10.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-opensearch (1.0.2)
fluent-plugin-prometheus (2.0.2)
fluent-plugin-record-modifier (2.1.0)
fluent-plugin-rewrite-tag-filter (2.4.0)
fluent-plugin-s3 (1.6.1)
fluent-plugin-systemd (1.0.5)
fluentd (1.14.6)
forwardable (default: 1.3.1)
getoptlong (default: 0.1.0)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.4)
http-form_data (2.3.0)
http-parser (1.2.3)
http_parser.rb (0.8.0)
i18n (1.10.0)
io-console (default: 0.5.6)
ipaddr (default: 1.2.2)
irb (default: 1.2.6)
jmespath (1.6.1)
json (default: 2.3.0, 2.1.0)
jsonpath (1.1.2)
kubeclient (4.9.3)
logger (default: 1.4.2)
lru_redux (1.1.0)
ltsv (0.1.2)
matrix (default: 0.2.0)
mime-types (3.4.1)
mime-types-data (3.2022.0105)
minitest (5.15.0, 5.13.0)
msgpack (1.5.1)
multi_json (1.15.0)
multipart-post (2.1.1)
mutex_m (default: 0.1.0)
net-pop (default: 0.1.0)
net-smtp (default: 0.1.0)
net-telnet (0.2.0)
netrc (0.11.0)
observer (default: 0.1.0)
oj (3.3.10)
open3 (default: 0.1.0)
opensearch-api (1.0.0)
opensearch-ruby (1.0.0)
opensearch-transport (1.0.0)
openssl (default: 2.1.3)
ostruct (default: 0.2.0)
power_assert (1.1.7)
prime (default: 0.1.1)
prometheus-client (4.0.0)
pstore (default: 0.1.0)
psych (default: 3.1.0)
public_suffix (4.0.7)
racc (default: 1.4.16)
rake (13.0.6, 13.0.1)
rdoc (default: 6.2.1.1)
readline (default: 0.0.2)
readline-ext (default: 0.1.0)
recursive-open-struct (1.1.3)
reline (default: 0.1.5)
rest-client (2.1.0)
rexml (default: 3.2.3.1)
rss (default: 0.2.8)
ruby-kafka (1.4.0)
ruby2_keywords (0.0.5)
rubygems-update (3.3.11)
sdbm (default: 1.0.0)
serverengine (2.2.5)
sigdump (0.2.4)
singleton (default: 0.1.0)
stringio (default: 0.1.0)
strptime (0.2.5)
strscan (default: 1.0.3)
systemd-journal (1.4.2)
test-unit (3.3.4)
timeout (default: 0.1.0)
tracer (default: 0.1.0)
tzinfo (2.0.4)
tzinfo-data (1.2022.1)
unf (0.1.4)
unf_ext (0.0.8.1)
uri (default: 0.10.0)
webrick (1.7.0, default: 1.6.1)
xmlrpc (0.3.0)
yajl-ruby (1.4.2)
yaml (default: 0.1.0)
zlib (default: 1.1.0)

danielhoherd · 2022-06-24T17:56:35Z

I am also seeing a memory leak. Here is our Gemfile.lock and here is a 30 day graph showing the memory leak over several daemonset restarts. Each distinct color is one instance of fluentd servicing a single node:

bysnupy · 2022-07-07T11:24:01Z

After upgrading fluentd v1.7.4 to v1.14.6, the memory usage has been spiked from 400~500Mi to about 4Gi. And it trend keep growing slowly but infinitely day by day, even though there is NO configuration and workload changes. The following graph was regarding several hours.

Using plugin list:

output
- forward
- cloudwatch_logs (0.14.2)
- elasticsearch (5.2.2)
input
- tail
- systemd (1.0.5)
v1.14.6: https://github.com/ViaQ/logging-fluentd/blob/5bb1c5abbc8f29519e840eb06464c4af3ee59316/fluentd/Gemfile.lock
v1.7.4: https://github.com/ViaQ/logging-fluentd/blob/e62e4cac12768a041df6511ca1ac631192136655/fluentd/Gemfile.lock

Is there any updates to affect memory usage between the both versions ?

madanrishi · 2022-09-12T17:04:07Z

@Mosibi
I have the same issue, I don't know if you have checked this or not, but seems to be a Kubernetes reporting issue, its reporting wrong metrics for memory, If you can shell into the pod and run free I think you will see the free memory is more than what is being reported by Kubernetes metrics itself.
Kubernetes metrics is reporting memory usage including cache as well, may be thats the issue.

danielhoherd · 2022-09-12T18:21:51Z

@madanrishi do you have a link to the kubernetes issue that you think this might be?

seyal84 · 2022-09-20T01:52:55Z

Any update on this issue guys? I am facing a similar issue

madanrishi · 2022-09-21T16:18:08Z

@madanrishi do you have a link to the kubernetes issue that you think this might be?
@danielhoherd
My bad on this, on further investigation this is not the case, still have the same issue, statefulset restart fixes the issue but thats not an ideal resolution

pbrzica · 2023-01-13T09:12:26Z

Hi,

We have a memory leak in the following situation:

log_level is set to default
tail plugin is picking up a lot of non-matching lines (our format is set to json)

Changing the log_level to not log non-matching pattern messages fixes the memory leak (e.g. changing log_level to error)

Environment:

Kernel ranging from 3.10 to 5.16
CentOS Linux release 7.9.2009 (Core)
td-agent 4.4.2 fluentd 1.15.3 (e89092c)

yangjiel · 2023-04-28T01:46:22Z

Facing the same issue, observed on cloudwatch_logs, log_analytics and logzio-0.0.22 plugins, slack plugin, memory consumption of each process keeps rising from hundred megabytes to several gigabytes in a few days.

td-agent 4.4.1 fluentd 1.13.3 (c328422)

yangjiel · 2023-05-11T20:43:09Z

My colleague Lester and I found out an issue related to this memory issue:
#4174

sumo-drosiek mentioned this issue Jun 1, 2021

Fluentd memory leak SumoLogic/sumologic-kubernetes-collection#1611

Closed

ashie closed this as completed Jun 3, 2021

ashie reopened this Jun 3, 2021

github-actions bot added the stale label Sep 8, 2021

ashie removed the stale label Sep 9, 2021

github-actions bot added the stale label Dec 23, 2021

kenhys removed the stale label Jan 21, 2022

igorpeshansky mentioned this issue Feb 24, 2022

Memory leak in ruby process? GoogleCloudPlatform/google-fluentd#368

Open

ashie self-assigned this Mar 15, 2022

ashie added the memory label Mar 15, 2022

yangjiel mentioned this issue May 11, 2023

Large Memory Consumption when ignore_same_log_interval #4174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #3401

Memory leak #3401

Mosibi commented May 28, 2021

kenhys commented May 31, 2021

Mosibi commented Jun 1, 2021

sumo-drosiek commented Jun 1, 2021

Mosibi commented Jun 1, 2021

sumo-drosiek commented Jun 1, 2021

Mosibi commented Jun 1, 2021

Mosibi commented Jun 2, 2021 •

edited

Loading

ashie commented Jun 3, 2021 •

edited

Loading

Mosibi commented Jun 3, 2021 •

edited

Loading

Mosibi commented Jun 3, 2021

ashie commented Jun 3, 2021

Mosibi commented Jun 10, 2021

github-actions bot commented Sep 8, 2021

sumo-drosiek commented Sep 8, 2021

komalpcg commented Sep 24, 2021

github-actions bot commented Dec 23, 2021

levleontiev commented Jan 20, 2022

kvokka commented Feb 20, 2022

tzulin-chin commented Mar 15, 2022

kvokka commented Mar 15, 2022

ashie commented Mar 15, 2022

sdwerwed commented Jun 1, 2022 •

edited

Loading

danielhoherd commented Jun 24, 2022 •

edited

Loading

bysnupy commented Jul 7, 2022 •

edited

Loading

madanrishi commented Sep 12, 2022

danielhoherd commented Sep 12, 2022

seyal84 commented Sep 20, 2022

madanrishi commented Sep 21, 2022 •

edited

Loading

pbrzica commented Jan 13, 2023

yangjiel commented Apr 28, 2023 •

edited

Loading

yangjiel commented May 11, 2023 •

edited

Loading

Memory leak #3401

Memory leak #3401

Comments

Mosibi commented May 28, 2021

kenhys commented May 31, 2021

Mosibi commented Jun 1, 2021

sumo-drosiek commented Jun 1, 2021

Mosibi commented Jun 1, 2021

sumo-drosiek commented Jun 1, 2021

Mosibi commented Jun 1, 2021

Mosibi commented Jun 2, 2021 • edited Loading

ashie commented Jun 3, 2021 • edited Loading

Mosibi commented Jun 3, 2021 • edited Loading

Mosibi commented Jun 3, 2021

ashie commented Jun 3, 2021

Mosibi commented Jun 10, 2021

github-actions bot commented Sep 8, 2021

sumo-drosiek commented Sep 8, 2021

komalpcg commented Sep 24, 2021

github-actions bot commented Dec 23, 2021

levleontiev commented Jan 20, 2022

kvokka commented Feb 20, 2022

tzulin-chin commented Mar 15, 2022

kvokka commented Mar 15, 2022

ashie commented Mar 15, 2022

sdwerwed commented Jun 1, 2022 • edited Loading

danielhoherd commented Jun 24, 2022 • edited Loading

bysnupy commented Jul 7, 2022 • edited Loading

madanrishi commented Sep 12, 2022

danielhoherd commented Sep 12, 2022

seyal84 commented Sep 20, 2022

madanrishi commented Sep 21, 2022 • edited Loading

pbrzica commented Jan 13, 2023

yangjiel commented Apr 28, 2023 • edited Loading

yangjiel commented May 11, 2023 • edited Loading

Mosibi commented Jun 2, 2021 •

edited

Loading

ashie commented Jun 3, 2021 •

edited

Loading

Mosibi commented Jun 3, 2021 •

edited

Loading

sdwerwed commented Jun 1, 2022 •

edited

Loading

danielhoherd commented Jun 24, 2022 •

edited

Loading

bysnupy commented Jul 7, 2022 •

edited

Loading

madanrishi commented Sep 21, 2022 •

edited

Loading

yangjiel commented Apr 28, 2023 •

edited

Loading

yangjiel commented May 11, 2023 •

edited

Loading