Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker-0 randomly SIGKILLed #3413

Closed
alex-vmw opened this issue Jun 8, 2021 · 9 comments
Closed

worker-0 randomly SIGKILLed #3413

alex-vmw opened this issue Jun 8, 2021 · 9 comments

Comments

@alex-vmw
Copy link

alex-vmw commented Jun 8, 2021

Describe the bug
We noticed in the fluentd logs that worker-0 was SIGKILLed for no apparent reason. Right before the SIGKILL, there were messages about tailing logs that no longer exist because k8s pods were deleted, so maybe that has something to do with it.

To Reproduce
Not sure, but can try having fluentd tail k8s pods logs and constantly create/delete pods.

Expected behavior
worker-0 shouldn't get SIGKILLed for no apparent reason.

Your Environment
Fluentd or td-agent version: fluentd 1.12.3.
Operating system: Ubuntu 20.04.1 LTS
Kernel version: 5.4.0-62-generic

Your Configuration

<ROOT>
  <system>
    log_level info
    rpc_endpoint "127.0.0.1:24444"
  </system>
  <source>
    @type systemd
    @id in_systemd_docker
    path "/var/log/journal"
    tag "systemd.unit"
    read_from_head false
    <storage>
      @type "local"
      persistent true
      path "/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json"
    </storage>
    <entry>
      field_map {"_SYSTEMD_UNIT":"unit","MESSAGE":"log","_PID":["pid"],"_PRIORITY":"priority","_COMM":"cmd","_HOSTNAME":"hostname"}
      field_map_strict true
      fields_lowercase true
    </entry>
  </source>
  <match systemd.unit>
    @type rewrite_tag_filter
    <rule>
      key "unit"
      pattern ^(.+)$
      tag "systemd.$1"
    </rule>
  </match>
  <filter systemd.kubelet.service>
    @type parser
    reserve_data true
    key_name "log"
    emit_invalid_record_to_error false
    <parse>
      @type "kubernetes"
      time_format "%m%d %H:%M:%S.%N"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
    </parse>
  </filter>
  <source>
    @type tail
    @id in_tail_container_logs
    path "/var/log/containers/*.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-containers.log.pos"
    tag "kubernetes.*"
    read_from_head true
    <parse>
      @type "multiline"
      format1 /^(?<partials>([^\n]+ (stdout|stderr) P [^\n]+\n)*)/
      format2 /(?<time>[^\n]+) (?<stream>stdout|stderr) F (?<log>[^\n]*)/
      format3 /|(?<json>{.*})/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
      unmatched_lines
    </parse>
  </source>
  <filter kubernetes.**>
    @type record_transformer
    @id filter_crio_container_logs
    enable_ruby true
    remove_keys partials
    <record>
      log ${record["partials"]&.gsub(/.+ (stdout|stderr) P (.+)\n/, '\\2')}${record["log"]}
    </record>
  </filter>
  <filter kubernetes.**>
    @type parser
    @id filter_docker_container_logs
    key_name "json"
    remove_key_name_field true
    reserve_data true
    emit_invalid_record_to_error false
    <parse>
      @type "json"
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      time_type string
    </parse>
  </filter>
  <source>
    @type tail
    @id in_tail_startupscript
    path "/var/log/startupscript.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-startupscript.log.pos"
    tag "startupscript"
    <parse>
      @type "syslog"
      unmatched_lines
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_docker
    path "/var/log/docker.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-docker.log.pos"
    tag "docker"
    <parse>
      @type "regexp"
      expression /^time="(?<time>[^)]*)" level=(?<severity>[^ ]*) msg="(?<message>[^"]*)"( err="(?<error>[^"]*)")?( statusCode=($<status_code>\d+))?/
      unmatched_lines
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_etcd
    path "/var/log/etcd.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-etcd.log.pos"
    tag "k8s.etcd"
    <parse>
      @type "none"
      unmatched_lines
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kubelet
    multiline_flush_interval 5s
    path "/var/log/kubelet.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kubelet.log.pos"
    tag "k8s.kubelet"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_proxy
    multiline_flush_interval 5s
    path "/var/log/kube-proxy.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-proxy.log.pos"
    tag "k8s.kube-proxy"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_apiserver
    multiline_flush_interval 5s
    path "/var/log/kube-apiserver.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-apiserver.log.pos"
    tag "k8s.kube-apiserver"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_controller_manager
    multiline_flush_interval 5s
    path "/var/log/kube-controller-manager.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-controller-manager.log.pos"
    tag "k8s.kube-controller-manager"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_scheduler
    multiline_flush_interval 5s
    path "/var/log/kube-scheduler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-scheduler.log.pos"
    tag "k8s.kube-scheduler"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_rescheduler
    multiline_flush_interval 5s
    path "/var/log/rescheduler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-rescheduler.log.pos"
    tag "k8s.rescheduler"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_glbc
    multiline_flush_interval 5s
    path "/var/log/glbc.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-glbc.log.pos"
    tag "k8s.glbc"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_cluster_autoscaler
    multiline_flush_interval 5s
    path "/var/log/cluster-autoscaler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-cluster-autoscaler.log.pos"
    tag "k8s.cluster-autoscaler"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_apiserver_audit
    multiline_flush_interval 5s
    path "/var/log/kubernetes/kube-apiserver-audit.log"
    pos_file "/var/log/-kube-fluentd-operator--kube-apiserver-audit.log.pos"
    tag "k8s.kube-apiserver-audit"
    <parse>
      @type "multiline"
      format_firstline "/^\\S+\\s+AUDIT:/"
      format1 /^(?<time>\S+) AUDIT:(?: (?:id="(?<id>(?:[^"\\]|\\.)*)"|ip="(?<ip>(?:[^"\\]|\\.)*)"|method="(?<method>(?:[^"\\]|\\.)*)"|user="(?<user>(?:[^"\\]|\\.)*)"|groups="(?<groups>(?:[^"\\]|\\.)*)"|as="(?<as>(?:[^"\\]|\\.)*)"|asgroups="(?<asgroups>(?:[^"\\]|\\.)*)"|namespace="(?<namespace>(?:[^"\\]|\\.)*)"|uri="(?<uri>(?:[^"\\]|\\.)*)"|response="(?<response>(?:[^"\\]|\\.)*)"|\w+="(?:[^"\\]|\\.)*"))*/
      time_format "%Y-%m-%dT%T.%L%Z"
      unmatched_lines
    </parse>
  </source>
  <filter kubernetes.**>
    @type kubernetes_metadata
    @id filter_kube_metadata
  </filter>
  <filter kubernetes.**>
    @type kubernetes_metadata
  </filter>
  <filter kubernetes.**>
    @type record_transformer
    enable_ruby true
    <record>
      kubernetes_namespace_container_name ${record["kubernetes"]["namespace_name"]}.${record["kubernetes"]["pod_name"]}.${record["kubernetes"]["container_name"]}
      container_info ${record["docker"]["container_id"]}-${record["stream"]}
    </record>
  </filter>
  <match kubernetes.**>
    @type rewrite_tag_filter
    <rule>
      key "kubernetes_namespace_container_name"
      pattern ^(.+)$
      tag "kube.$1"
    </rule>
  </match>
  <filter kube.*.*.*>
    @type record_modifier
    remove_keys "dummy_"
    <record>
      dummy_ ${record["kubernetes"]&.delete("master_url"); record["kubernetes"]&.delete("namespace_id"); if record["kubernetes"]&.has_key?("labels"); record["kubernetes"]["labels"].delete("pod-template-generation"); record["kubernetes"]["labels"].delete("controller-revision-hash");  record["kubernetes"]["labels"].delete("pod-template-hash"); end; nil}
    </record>
  </filter>
  <filter kube.*.*.*>
    @type record_transformer
    remove_keys kubernetes_namespace_container_name
  </filter>
  <filter kube.kube-system.**>
    @type parser
    reserve_data true
    key_name "log"
    emit_invalid_record_to_error false
    <parse>
      @type "kubernetes"
      time_format "%m%d %H:%M:%S.%N"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
    </parse>
  </filter>
  <source>
    @type prometheus
  </source>
  <source>
    @type prometheus_monitor
  </source>
  <source>
    @type prometheus_output_monitor
  </source>
  <match systemd.** kube.kube-system.**>
    @type copy
    <store>
      @id X
      @type "vmware_loginsight"
      host "X.X.X.X"
      include_tag_key true
      log_text_keys ["log","msg","message"]
      port 9000
      scheme "http"
      serializer "json"
      ssl_verify false
      tag_key "tag"
    </store>
  </match>
  <match **>
    @type null
  </match>
</ROOT>

Your Error Log

2021-06-08 16:15:48 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 169, namespace_cache_watch_misses: 342, namespace_cache_watch_ignored: 190, pod_cache_host_updates: 7897, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 27, pod_watch_gone_notices: 27
2021-06-08 16:15:48 +0000 [info]: #0 stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 179, namespace_cache_watch_misses: 362, namespace_cache_watch_ignored: 201, pod_cache_host_updates: 7629, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 26, pod_watch_gone_notices: 26
2021-06-08 16:15:49 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 1, pod_cache_size: 272, pod_cache_watch_updates: 17493, pod_cache_watch_delete_ignored: 864, namespace_cache_api_updates: 1, pod_cache_api_updates: 1, id_cache_miss: 1, namespace_cache_watch_deletes_ignored: 652, namespace_cache_watch_misses: 1365, namespace_cache_watch_ignored: 703, pod_cache_host_updates: 33058, pod_cache_watch_ignored: 381, pod_watch_gone_errors: 117, pod_watch_gone_notices: 117
2021-06-08 16:15:49 +0000 [info]: #0 stats - namespace_cache_size: 1, pod_cache_size: 272, pod_cache_watch_updates: 17497, pod_cache_watch_delete_ignored: 865, namespace_cache_api_updates: 1, pod_cache_api_updates: 1, id_cache_miss: 1, namespace_cache_watch_deletes_ignored: 650, namespace_cache_watch_misses: 1362, namespace_cache_watch_ignored: 701, pod_cache_host_updates: 33897, pod_cache_watch_ignored: 381, pod_watch_gone_errors: 120, pod_watch_gone_notices: 120
2021-06-08 16:16:18 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 169, namespace_cache_watch_misses: 342, namespace_cache_watch_ignored: 190, pod_cache_host_updates: 7897, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 27, pod_watch_gone_notices: 27
2021-06-08 16:16:18 +0000 [info]: #0 stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 179, namespace_cache_watch_misses: 362, namespace_cache_watch_ignored: 201, pod_cache_host_updates: 7629, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 26, pod_watch_gone_notices: 26
2021-06-08 16:16:48 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 169, namespace_cache_watch_misses: 342, namespace_cache_watch_ignored: 190, pod_cache_host_updates: 7897, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 27, pod_watch_gone_notices: 27
2021-06-08 16:16:48 +0000 [info]: #0 stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 179, namespace_cache_watch_misses: 362, namespace_cache_watch_ignored: 201, pod_cache_host_updates: 7629, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 26, pod_watch_gone_notices: 26
2021-06-08 16:16:49 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/decc-ara-gulel-nimbus-457p2_decc-ara-gulel-nimbus_gulel-container-f8e1921b-1a46304c12b181dd5c628649daf598b22a303fe4a08d4fbd5f6333597c95706b.log
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [warn]: #0 /var/log/containers/p4login-1623078060-zpl24_ara-cat-beta_p4login-18ff96509abecc02391857da8e6b820eee62274ab68b6adc6c443e432e9d36d6.log not found. Continuing without tailing it.
2021-06-08 16:16:49 +0000 [warn]: #0 stat() for /var/log/containers/p4login-1623078060-zpl24_ara-cat-beta_p4login-18ff96509abecc02391857da8e6b820eee62274ab68b6adc6c443e432e9d36d6.log failed with ENOENT. Drop tail watcher for now.
2021-06-08 16:16:49 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/test-829-sleeper-545888cddc-7hxzs_decc-ara-runway-argocd-stress-test-1_sleeper-0f7b7cb26656a58423dd88671730e13076d1e9130a938eb057b065d214b87bc9.log
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/csp-billing-mvc-68fb6d8577-p6f5j_ara-csp-rrodolfo_csp-billing-mvc-a46591834fdb5cf3728620465efbd0b063fd3c87f110c6692262ff8df25f5f60.log
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:18 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 9, pod_cache_size: 272, namespace_cache_api_updates: 49, pod_cache_api_updates: 49, id_cache_miss: 49, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_watch_deletes_ignored: 169, namespace_cache_watch_misses: 342, namespace_cache_watch_ignored: 190, pod_cache_host_updates: 7897, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 27, pod_watch_gone_notices: 27
2021-06-08 16:17:18 +0000 [info]: #0 stats - namespace_cache_size: 9, pod_cache_size: 272, namespace_cache_api_updates: 49, pod_cache_api_updates: 49, id_cache_miss: 49, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_watch_deletes_ignored: 179, namespace_cache_watch_misses: 362, namespace_cache_watch_ignored: 201, pod_cache_host_updates: 7629, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 26, pod_watch_gone_notices: 26
2021-06-08 16:17:28 +0000 [warn]: #0 stat() for /var/log/containers/decc-ara-gulel-nimbus-bszrh_decc-ara-gulel-nimbus_gulel-container-46fc33f9-c5ec018fc6d5c41a6a9114f80e227612cc256878cac2c30a2463b98532d2e5ff.log failed with ENOENT. Drop tail watcher for now.
2021-06-08 16:17:28 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/decc-ara-gulel-nimbus-9gtf6_decc-ara-gulel-nimbus_gulel-container-c235795f-dbcc61b373eb557565f5439cf6b6d0074557b4bb1ba4eba69f9dba125bdb229e.log
2021-06-08 16:17:28 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:28 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:28 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/decc-ara-gulel-nimbus-vw8ht_decc-ara-gulel-nimbus_gulel-container-bf44a93d-1bad109b217083a49e3ca74a8c8a504784e79d3a5125dfed64b6d7ab7d247da4.log
2021-06-08 16:17:28 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:29 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:29 +0000 [warn]: #0 /var/log/containers/decc-ara-gulel-nimbus-xc2pf_decc-ara-gulel-nimbus_gulel-container-de82d83d-5475370a2b4d9fcc82862d727f3f2fa413ca669fa2079c617dc0a99a4092a6ba.log not found. Continuing without tailing it.
2021-06-08 16:17:29 +0000 [warn]: #0 stat() for /var/log/containers/decc-ara-gulel-nimbus-xc2pf_decc-ara-gulel-nimbus_gulel-container-de82d83d-5475370a2b4d9fcc82862d727f3f2fa413ca669fa2079c617dc0a99a4092a6ba.log failed with ENOENT. Drop tail watcher for now.
2021-06-08 16:17:29 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/decc-ara-gulel-nimbus-355ba802-tqvrk_decc-ara-gulel-worker_gulel-container-355ba802-ae54c881e04c6fc7d3b08283c581b56c0baa1cbd01de10351f35cd73ea68c7b8.log
2021-06-08 16:17:29 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:29 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:30 +0000 [info]: fluent/log.rb:329:info: Worker 0 finished unexpectedly with signal SIGKILL
@ashie
Copy link
Member

ashie commented Jun 9, 2021

Probably same with #3341

@alex-vmw
Copy link
Author

alex-vmw commented Jun 9, 2021

Probably same with #3341

Hi @ashie, I think this issue is different because I specifically checked the logs to make sure there was no gracefulReload requested (or any other restarts, etc.), but I can provide the full logs if you'd like.

@ashie
Copy link
Member

ashie commented Jun 9, 2021

Don't worry, the comment is just only for add a reference to the similar issue.
Sorry for my fewer explanation.

@ashie
Copy link
Member

ashie commented Jun 21, 2021

I checked implementation but fluentd itself doesn't send KILL signal to workers as I mentioned at #3341 (comment)
So that KILL signal is sent by another something, probably it's OOM killer.
Please check memory usage of your fluentd's worker process and OOM killer's log (typically in /var/log/message).

I think #3341 is also killed by OOM killer, because gracefulReload consumes more memory.
On gracefulReload, fluentd tries to create a new Fluent::Engine with a new config without replacing the process. The old Fluent::Engine instance is also running while this, so that the process holds two engines while gracefulReload. (when multiple gracefulReload is attempted in short time, it might have three or more engines, I don't yet confirm it though).

A known methods to reduce fluentd's memory usage is using jemalloc (3.x is better).
See #1657 for more detail.

@cosmo0920
Copy link
Contributor

cosmo0920 commented Jun 22, 2021

Right before the SIGKILL, there were messages about tailing logs that no longer exist because k8s pods were deleted, so maybe that has something to do with it.

I'm not familiar with detailed your environment, but you use Fluentd somethings as kubernetes Fluentd operator and set limits of resources for CPU and memory?
As ashie says gracefulReload comsumes multiple times of memory for Fluent::Engine Fluent::RootAgent instances.

@ashie
Copy link
Member

ashie commented Jun 22, 2021

BTW the detail of my above explanations was wrong.:

- Fluent::Engine
+ Fluent::RootAgent

def reload_config(conf, supervisor: false)
# configure first to reduce down time while restarting
new_agent = RootAgent.new(log: log, system_config: @system_config)
ret = Fluent::StaticConfigAnalysis.call(conf, workers: system_config.workers)
ret.all_plugins.each do |plugin|
if plugin.respond_to?(:reloadable_plugin?) && !plugin.reloadable_plugin?
raise Fluent::ConfigError, "Unreloadable plugin plugin: #{Fluent::Plugin.lookup_type_from_class(plugin.class)}, plugin_id: #{plugin.plugin_id}, class_name: #{plugin.class})"
end
end
# Assign @root_agent to new root_agent
# for https://github.com/fluent/fluentd/blob/fcef949ce40472547fde295ddd2cfe297e1eddd6/lib/fluent/plugin_helper/event_emitter.rb#L50
old_agent, @root_agent = @root_agent, new_agent
begin
@root_agent.configure(conf)
rescue
@root_agent = old_agent
raise
end
unless @suppress_config_dump
$log.info :supervisor, "using configuration file: #{conf.to_s.rstrip}"
end
# supervisor doesn't handle actual data. so the following code is unnecessary.
if supervisor
old_agent.shutdown # to close thread created in #configure
return
end
stop_phase(old_agent)
$log.info 'restart fluentd worker', worker: worker_id
start_phase(new_agent)
end

@alex-vmw
Copy link
Author

A known methods to reduce fluentd's memory usage is using jemalloc (3.x is better).

@ashie and @cosmo0920 Thanks for all the details. I have now implemented jemalloc 3.6.0 and will monitor memory utilization of the fluentd pods to make sure they are not killed by Kubernetes OOM killer.

@cosmo0920
Copy link
Contributor

A known methods to reduce fluentd's memory usage is using jemalloc (3.x is better).

@ashie and @cosmo0920 Thanks for all the details. I have now implemented jemalloc 3.6.0 and will monitor memory utilization of the fluentd pods to make sure they are not killed by Kubernetes OOM killer.

Thanks for the information. For high load environment, using jemalloc 3.6.0 is better for memory consumption than using jemalloc 5.2.1 and plain Linux malloc, right?

@ashie
Copy link
Member

ashie commented Jul 16, 2021

As I mentioned before, we think it's triggered by OOM killer, the cause is high memory usage.
For high memory usage or memory leak, we have several issues already. e.g.) #1657
So I close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants