Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #3401

Open
Mosibi opened this issue May 28, 2021 · 31 comments
Open

Memory leak #3401

Mosibi opened this issue May 28, 2021 · 31 comments
Assignees
Labels

Comments

@Mosibi
Copy link

Mosibi commented May 28, 2021

Describe the bug

I noticed a memory issue, what looks to me as a leak. I had a more complex configuration, but boiled it down to this simple config just to rule out the specifics in my config. Even with this (see below) config, I observe the memory leakage.

fluentd_container_memory_usage

I run fluentd 1.12.3 from a container on a Kubernetes cluster via Helm

To Reproduce
Deploy Fluentd 1.12.3 using Helm on a Kubernetes cluster using the provided config. Let it run for some time and observe the memory metrics

Expected behavior
No memory leak

Your Environment
Fluentd 1.12.3 on Kubernetes 1.17.7

Your Configuration

  <system>
      log_level info
      log_event_verbose true
      root_dir /tmp/fluentd-buffers/
    </system>

    <label @OUTPUT>
      <match **>
        @type file
        path /var/log/fluentd-output
        compress gzip
        <buffer>
          timekey 1d
          timekey_use_utc true
          timekey_wait 10m
        </buffer>
      </match>
    </label>

    ###
    # systemd/journald
    ###
    <source>
      @type systemd
      tag systemd
      path /var/log/journal
      read_from_head true

      <entry>
        fields_strip_underscores true
        fields_lowercase true
      </entry>

      @label @OUTPUT
    </source>

Additional context
Using my more complex configuration I observe the same memory growth but there seems to be a correlation between the growth in size and the number of logs ingested. In other words, using more inputs (in my case, container logs), the (assumed) memory leak is larger.

@kenhys
Copy link
Contributor

kenhys commented May 31, 2021

It seems related to #3342.

FYI: Recently, fluent-plugin-systemd 1.0.5 was released.

It fixes "Plug a memory leaks on every reload"
fluent-plugins-nursery/fluent-plugin-systemd#91

Does it reproducible even though fluent-plugin-systemd 1.0.5?

@Mosibi
Copy link
Author

Mosibi commented Jun 1, 2021

I tested with the complete config and then I still see the memory leak. Today I will run with only the systemd part to see if there is a change.

@sumo-drosiek
Copy link

I'm investigating similar issue from few days, and observed that for fluentd 1.11.5 using null output with file buffer also constantly takes more memory in opposition to null output without buffer section. I can reply tests for newer fluentd

@Mosibi
Copy link
Author

Mosibi commented Jun 1, 2021

I'm investigating similar issue from few days, and observed that for fluentd 1.11.5 using null output with file buffer also constantly takes more memory in opposition to null output without buffer section. I can reply tests for newer fluentd

Ah, that is good information. Please share the results with the latest fluentd version, if you can.

Is it possible to share the configuration you are using?

@sumo-drosiek
Copy link

This is my actual config

2021-06-01 12:30:47 +0000 [info]: using configuration file: <ROOT>
  <match fluentd.pod.healthcheck>
    @type relabel
    @label @FLUENT_LOG
  </match>
  <label @FLUENT_LOG>
    <match **>
      @type null
    </match>
  </label>
  <system>
    log_level info
  </system>
  <source>
    @type forward
    port 24321
    bind "0.0.0.0"
  </source>
  <match containers.**>
    @type relabel
    @label @NORMAL
  </match>
  <label @NORMAL>
    <match containers.**>
      @type copy
      <store>
        @type "null"
        <buffer>
          @type "file"
          path "/fluentd/buffer/logs.containers"
          compress gzip
          flush_interval 5s
          flush_thread_count 8
          chunk_limit_size 1m
          total_limit_size 128m
          queued_chunks_limit_size 128
          overflow_action drop_oldest_chunk
          retry_max_interval 10m
          retry_forever true
        </buffer>
      </store>
    </match>
  </label>
</ROOT>

@Mosibi
Copy link
Author

Mosibi commented Jun 1, 2021

It seems related to #3342.

FYI: Recently, fluent-plugin-systemd 1.0.5 was released.

It fixes "Plug a memory leaks on every reload"
fluent-plugin-systemd/fluent-plugin-systemd#91

Does it reproducible even though fluent-plugin-systemd 1.0.5?

Yes, also with fluent-plugin-systemd version 1.0.5 I see the same behaviour. I can attach a new graph, but it shows exactly the same as the first one.

@Mosibi
Copy link
Author

Mosibi commented Jun 2, 2021

Thanks to @sumo-drosiek, I can confirm that no memory leak occurs when I do not use a buffer section in the output. I rerun everything using a new config with a buffer section to be 100% sure.

The config without buffers

<system>
     log_level info
     log_event_verbose true
     root_dir /tmp/fluentd-buffers/
   </system>

   <label @OUTPUT>
     <match **>
       @type null
     </match>
   </label>

   ###
   # systemd/journald
   ###
   <source>
     @type systemd
     tag systemd
     path /var/log/journal
     read_from_head true

     <entry>
       fields_strip_underscores true
       fields_lowercase true
     </entry>


     @label @OUTPUT
   </source>

06E6E73C-B55A-467E-93F5-C1957444D212

@ashie
Copy link
Member

ashie commented Jun 3, 2021

I've confirmed the issue.
I think it's not fluentd's issue, it's fluent-plugin-systemd's issue.
Please forward the report to https://github.com/fluent-plugin-systemd/fluent-plugin-systemd

Details:

diff --git a/lib/fluent/plugin/in_systemd.rb b/lib/fluent/plugin/in_systemd.rb
index adb8b3f..66f4373 100644
--- a/lib/fluent/plugin/in_systemd.rb
+++ b/lib/fluent/plugin/in_systemd.rb
@@ -120,6 +120,7 @@ module Fluent
         init_journal if @journal.wait(0) == :invalidate
         watch do |entry|
           emit(entry)
+          GC.start
         end
       end

@ashie ashie closed this as completed Jun 3, 2021
@Mosibi
Copy link
Author

Mosibi commented Jun 3, 2021

@ashie I just wanted to add the results of my test run with buffering configured in an output where I see that without buffering there is no memory leak and with buffering a memory leak occurs

6F12048C-F14B-4F43-B6F9-41D57A510EC1

@Mosibi
Copy link
Author

Mosibi commented Jun 3, 2021

So to be complete:

No memory leak:

 <label @OUTPUT>
     <match **>
       @type null
     </match>
   </label>

Memory leak:

 <label @OUTPUT>
      <match **>
        @type null

        <buffer>
          @type "file"
          path "/var/log/fluentd-buffers/output.buffer"
          compress gzip
          flush_interval 5s
          flush_thread_count 8
          chunk_limit_size 1m
          total_limit_size 128m
          queued_chunks_limit_size 128
          overflow_action drop_oldest_chunk
          retry_max_interval 10m
          retry_forever true
        </buffer>
      </match>
    </label>

@ashie
Copy link
Member

ashie commented Jun 3, 2021

Thanks, further investigation may be needed at fluentd side.
Reopen.

@ashie ashie reopened this Jun 3, 2021
@Mosibi
Copy link
Author

Mosibi commented Jun 10, 2021

The following graph displays the memory over almost 7 days. This is my regular config where only the buffer sections in the Elasticsearch outputs is disabled. I was already sure that it was the buffer section, this is just more proof :)

fluentd-complete-config-without-buffers-2

@github-actions
Copy link

github-actions bot commented Sep 8, 2021

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

@github-actions github-actions bot added the stale label Sep 8, 2021
@sumo-drosiek
Copy link

@ashie any update on this?

@ashie ashie removed the stale label Sep 9, 2021
@komalpcg
Copy link

I am also facing this issue. Support has suggested to setup a CRON to restart service until a resolution comes: https://stackoverflow.com/questions/69161295/gcp-monitoring-agent-increasing-memory-usage-continuously.

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

@github-actions github-actions bot added the stale label Dec 23, 2021
@levleontiev
Copy link

The issue is still here. Please unstall .

@kenhys kenhys removed the stale label Jan 21, 2022
@kvokka
Copy link

kvokka commented Feb 20, 2022

Can it be connected with #3634?

@tzulin-chin
Copy link

Any update for this?
I faced this issue with file buffer on output config

@kvokka
Copy link

kvokka commented Mar 15, 2022

for me it was a salvation in setting

            prefer_oj_serializer true
            http_backend typhoeus
            buffer_type memory

for elasticsearch match section.

And the aggregator pod is using all the ram which you give to the limit until ram release the ram.
also I added

    extraEnvVars:
    # https://brandonhilkert.com/blog/reducing-sidekiq-memory-usage-with-jemalloc/?utm_source=reddit&utm_medium=social&utm_campaign=jemalloc
    - name: LD_PRELOAD
      value: /usr/lib/x86_64-linux-gnu/libjemalloc.so
    # https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
    - name: MALLOC_ARENA_MAX
      value: "2"

This helped to make this usable, but I still have no idea why it is that greedy with the ram

(by the way, forwarders are acting normal)

ofc I had to use manually set elasticsearch version to 7.17

I wasted tons of time on it and I'll be happy if some hints from these thoughts would be useful

@ashie
Copy link
Member

ashie commented Mar 15, 2022

    # https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
    - name: MALLOC_ARENA_MAX
      value: "2"

When you use jemalloc, MALLOC_ARENA_MAX doesn't make sense, so you can remove it.

@ashie ashie self-assigned this Mar 15, 2022
@ashie ashie added the memory label Mar 15, 2022
@sdwerwed
Copy link

sdwerwed commented Jun 1, 2022

I face memory leak as well
I use input:

  • forward
  • http for the liveness and readiness probes
  • systemd

output:

  • null for fluent logs
  • opensearch with file buffer
  • stdout for health checks.

The file buffer is always flushed but memory is getting increased.
image

Version:

activesupport (7.0.2.3)
addressable (2.8.0)
aws-eventstream (1.2.0)
aws-partitions (1.579.0)
aws-sdk-core (3.130.2)
aws-sdk-kms (1.56.0)
aws-sdk-s3 (1.113.1)
aws-sdk-sqs (1.51.0)
aws-sigv4 (1.5.0)
benchmark (default: 0.1.0)
bigdecimal (default: 2.0.0)
bundler (2.3.12, 2.3.11)
cgi (default: 0.1.0.1)
concurrent-ruby (1.1.10)
cool.io (1.7.1)
csv (default: 3.1.2)
date (default: 3.0.3)
delegate (default: 0.1.0)
did_you_mean (default: 1.4.0)
digest-crc (0.6.4)
domain_name (0.5.20190701)
elastic-transport (8.0.0)
elasticsearch (8.1.2)
elasticsearch-api (8.1.2)
elasticsearch-xpack (7.17.1)
etc (default: 1.1.0)
excon (0.92.2)
faraday (1.10.0)
faraday-em_http (1.0.0)
faraday-em_synchrony (1.0.0)
faraday-excon (1.1.0)
faraday-httpclient (1.0.1)
faraday-multipart (1.0.3)
faraday-net_http (1.0.1)
faraday-net_http_persistent (1.2.0)
faraday-patron (1.0.0)
faraday-rack (1.0.0)
faraday-retry (1.0.3)
faraday_middleware-aws-sigv4 (0.6.1)
fcntl (default: 1.0.0)
ffi (1.15.5)
ffi-compiler (1.0.1)
fiddle (default: 1.0.0)
fileutils (default: 1.4.1)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.5.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.14)
fluent-plugin-elasticsearch (5.2.2)
fluent-plugin-grafana-loki (1.2.18)
fluent-plugin-kafka (0.17.5)
fluent-plugin-kubernetes_metadata_filter (2.10.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-opensearch (1.0.2)
fluent-plugin-prometheus (2.0.2)
fluent-plugin-record-modifier (2.1.0)
fluent-plugin-rewrite-tag-filter (2.4.0)
fluent-plugin-s3 (1.6.1)
fluent-plugin-systemd (1.0.5)
fluentd (1.14.6)
forwardable (default: 1.3.1)
getoptlong (default: 0.1.0)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.4)
http-form_data (2.3.0)
http-parser (1.2.3)
http_parser.rb (0.8.0)
i18n (1.10.0)
io-console (default: 0.5.6)
ipaddr (default: 1.2.2)
irb (default: 1.2.6)
jmespath (1.6.1)
json (default: 2.3.0, 2.1.0)
jsonpath (1.1.2)
kubeclient (4.9.3)
logger (default: 1.4.2)
lru_redux (1.1.0)
ltsv (0.1.2)
matrix (default: 0.2.0)
mime-types (3.4.1)
mime-types-data (3.2022.0105)
minitest (5.15.0, 5.13.0)
msgpack (1.5.1)
multi_json (1.15.0)
multipart-post (2.1.1)
mutex_m (default: 0.1.0)
net-pop (default: 0.1.0)
net-smtp (default: 0.1.0)
net-telnet (0.2.0)
netrc (0.11.0)
observer (default: 0.1.0)
oj (3.3.10)
open3 (default: 0.1.0)
opensearch-api (1.0.0)
opensearch-ruby (1.0.0)
opensearch-transport (1.0.0)
openssl (default: 2.1.3)
ostruct (default: 0.2.0)
power_assert (1.1.7)
prime (default: 0.1.1)
prometheus-client (4.0.0)
pstore (default: 0.1.0)
psych (default: 3.1.0)
public_suffix (4.0.7)
racc (default: 1.4.16)
rake (13.0.6, 13.0.1)
rdoc (default: 6.2.1.1)
readline (default: 0.0.2)
readline-ext (default: 0.1.0)
recursive-open-struct (1.1.3)
reline (default: 0.1.5)
rest-client (2.1.0)
rexml (default: 3.2.3.1)
rss (default: 0.2.8)
ruby-kafka (1.4.0)
ruby2_keywords (0.0.5)
rubygems-update (3.3.11)
sdbm (default: 1.0.0)
serverengine (2.2.5)
sigdump (0.2.4)
singleton (default: 0.1.0)
stringio (default: 0.1.0)
strptime (0.2.5)
strscan (default: 1.0.3)
systemd-journal (1.4.2)
test-unit (3.3.4)
timeout (default: 0.1.0)
tracer (default: 0.1.0)
tzinfo (2.0.4)
tzinfo-data (1.2022.1)
unf (0.1.4)
unf_ext (0.0.8.1)
uri (default: 0.10.0)
webrick (1.7.0, default: 1.6.1)
xmlrpc (0.3.0)
yajl-ruby (1.4.2)
yaml (default: 0.1.0)
zlib (default: 1.1.0)

@danielhoherd
Copy link

danielhoherd commented Jun 24, 2022

I am also seeing a memory leak. Here is our Gemfile.lock and here is a 30 day graph showing the memory leak over several daemonset restarts. Each distinct color is one instance of fluentd servicing a single node:

Screen Shot 2022-06-24 at 10 55 51 AM

@bysnupy
Copy link

bysnupy commented Jul 7, 2022

After upgrading fluentd v1.7.4 to v1.14.6, the memory usage has been spiked from 400~500Mi to about 4Gi. And it trend keep growing slowly but infinitely day by day, even though there is NO configuration and workload changes. The following graph was regarding several hours.

image

Using plugin list:

Is there any updates to affect memory usage between the both versions ?

@madanrishi
Copy link

@Mosibi
I have the same issue, I don't know if you have checked this or not, but seems to be a Kubernetes reporting issue, its reporting wrong metrics for memory, If you can shell into the pod and run free I think you will see the free memory is more than what is being reported by Kubernetes metrics itself.
Kubernetes metrics is reporting memory usage including cache as well, may be thats the issue.

@danielhoherd
Copy link

@madanrishi do you have a link to the kubernetes issue that you think this might be?

@seyal84
Copy link

seyal84 commented Sep 20, 2022

Any update on this issue guys? I am facing a similar issue

@madanrishi
Copy link

madanrishi commented Sep 21, 2022

@madanrishi do you have a link to the kubernetes issue that you think this might be?
@danielhoherd
My bad on this, on further investigation this is not the case, still have the same issue, statefulset restart fixes the issue but thats not an ideal resolution

@pbrzica
Copy link

pbrzica commented Jan 13, 2023

Hi,

We have a memory leak in the following situation:

  • log_level is set to default
  • tail plugin is picking up a lot of non-matching lines (our format is set to json)

Changing the log_level to not log non-matching pattern messages fixes the memory leak (e.g. changing log_level to error)

Environment:

  • Kernel ranging from 3.10 to 5.16
  • CentOS Linux release 7.9.2009 (Core)
  • td-agent 4.4.2 fluentd 1.15.3 (e89092c)

@yangjiel
Copy link

yangjiel commented Apr 28, 2023

Facing the same issue, observed on cloudwatch_logs, log_analytics and logzio-0.0.22 plugins, slack plugin, memory consumption of each process keeps rising from hundred megabytes to several gigabytes in a few days.

td-agent 4.4.1 fluentd 1.13.3 (c328422)

@yangjiel
Copy link

yangjiel commented May 11, 2023

My colleague Lester and I found out an issue related to this memory issue:
#4174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests