worker-0 randomly SIGKILLed #3413

alex-vmw · 2021-06-08T18:16:42Z

Describe the bug
We noticed in the fluentd logs that worker-0 was SIGKILLed for no apparent reason. Right before the SIGKILL, there were messages about tailing logs that no longer exist because k8s pods were deleted, so maybe that has something to do with it.

To Reproduce
Not sure, but can try having fluentd tail k8s pods logs and constantly create/delete pods.

Expected behavior
worker-0 shouldn't get SIGKILLed for no apparent reason.

Your Environment
Fluentd or td-agent version: fluentd 1.12.3.
Operating system: Ubuntu 20.04.1 LTS
Kernel version: 5.4.0-62-generic

Your Configuration

<ROOT>
  <system>
    log_level info
    rpc_endpoint "127.0.0.1:24444"
  </system>
  <source>
    @type systemd
    @id in_systemd_docker
    path "/var/log/journal"
    tag "systemd.unit"
    read_from_head false
    <storage>
      @type "local"
      persistent true
      path "/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json"
    </storage>
    <entry>
      field_map {"_SYSTEMD_UNIT":"unit","MESSAGE":"log","_PID":["pid"],"_PRIORITY":"priority","_COMM":"cmd","_HOSTNAME":"hostname"}
      field_map_strict true
      fields_lowercase true
    </entry>
  </source>
  <match systemd.unit>
    @type rewrite_tag_filter
    <rule>
      key "unit"
      pattern ^(.+)$
      tag "systemd.$1"
    </rule>
  </match>
  <filter systemd.kubelet.service>
    @type parser
    reserve_data true
    key_name "log"
    emit_invalid_record_to_error false
    <parse>
      @type "kubernetes"
      time_format "%m%d %H:%M:%S.%N"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
    </parse>
  </filter>
  <source>
    @type tail
    @id in_tail_container_logs
    path "/var/log/containers/*.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-containers.log.pos"
    tag "kubernetes.*"
    read_from_head true
    <parse>
      @type "multiline"
      format1 /^(?<partials>([^\n]+ (stdout|stderr) P [^\n]+\n)*)/
      format2 /(?<time>[^\n]+) (?<stream>stdout|stderr) F (?<log>[^\n]*)/
      format3 /|(?<json>{.*})/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
      unmatched_lines
    </parse>
  </source>
  <filter kubernetes.**>
    @type record_transformer
    @id filter_crio_container_logs
    enable_ruby true
    remove_keys partials
    <record>
      log ${record["partials"]&.gsub(/.+ (stdout|stderr) P (.+)\n/, '\\2')}${record["log"]}
    </record>
  </filter>
  <filter kubernetes.**>
    @type parser
    @id filter_docker_container_logs
    key_name "json"
    remove_key_name_field true
    reserve_data true
    emit_invalid_record_to_error false
    <parse>
      @type "json"
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      time_type string
    </parse>
  </filter>
  <source>
    @type tail
    @id in_tail_startupscript
    path "/var/log/startupscript.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-startupscript.log.pos"
    tag "startupscript"
    <parse>
      @type "syslog"
      unmatched_lines
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_docker
    path "/var/log/docker.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-docker.log.pos"
    tag "docker"
    <parse>
      @type "regexp"
      expression /^time="(?<time>[^)]*)" level=(?<severity>[^ ]*) msg="(?<message>[^"]*)"( err="(?<error>[^"]*)")?( statusCode=($<status_code>\d+))?/
      unmatched_lines
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_etcd
    path "/var/log/etcd.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-etcd.log.pos"
    tag "k8s.etcd"
    <parse>
      @type "none"
      unmatched_lines
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kubelet
    multiline_flush_interval 5s
    path "/var/log/kubelet.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kubelet.log.pos"
    tag "k8s.kubelet"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_proxy
    multiline_flush_interval 5s
    path "/var/log/kube-proxy.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-proxy.log.pos"
    tag "k8s.kube-proxy"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_apiserver
    multiline_flush_interval 5s
    path "/var/log/kube-apiserver.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-apiserver.log.pos"
    tag "k8s.kube-apiserver"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_controller_manager
    multiline_flush_interval 5s
    path "/var/log/kube-controller-manager.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-controller-manager.log.pos"
    tag "k8s.kube-controller-manager"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_scheduler
    multiline_flush_interval 5s
    path "/var/log/kube-scheduler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-scheduler.log.pos"
    tag "k8s.kube-scheduler"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_rescheduler
    multiline_flush_interval 5s
    path "/var/log/rescheduler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-rescheduler.log.pos"
    tag "k8s.rescheduler"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_glbc
    multiline_flush_interval 5s
    path "/var/log/glbc.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-glbc.log.pos"
    tag "k8s.glbc"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_cluster_autoscaler
    multiline_flush_interval 5s
    path "/var/log/cluster-autoscaler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-cluster-autoscaler.log.pos"
    tag "k8s.cluster-autoscaler"
    <parse>
      @type "kubernetes"
      unmatched_lines
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_apiserver_audit
    multiline_flush_interval 5s
    path "/var/log/kubernetes/kube-apiserver-audit.log"
    pos_file "/var/log/-kube-fluentd-operator--kube-apiserver-audit.log.pos"
    tag "k8s.kube-apiserver-audit"
    <parse>
      @type "multiline"
      format_firstline "/^\\S+\\s+AUDIT:/"
      format1 /^(?<time>\S+) AUDIT:(?: (?:id="(?<id>(?:[^"\\]|\\.)*)"|ip="(?<ip>(?:[^"\\]|\\.)*)"|method="(?<method>(?:[^"\\]|\\.)*)"|user="(?<user>(?:[^"\\]|\\.)*)"|groups="(?<groups>(?:[^"\\]|\\.)*)"|as="(?<as>(?:[^"\\]|\\.)*)"|asgroups="(?<asgroups>(?:[^"\\]|\\.)*)"|namespace="(?<namespace>(?:[^"\\]|\\.)*)"|uri="(?<uri>(?:[^"\\]|\\.)*)"|response="(?<response>(?:[^"\\]|\\.)*)"|\w+="(?:[^"\\]|\\.)*"))*/
      time_format "%Y-%m-%dT%T.%L%Z"
      unmatched_lines
    </parse>
  </source>
  <filter kubernetes.**>
    @type kubernetes_metadata
    @id filter_kube_metadata
  </filter>
  <filter kubernetes.**>
    @type kubernetes_metadata
  </filter>
  <filter kubernetes.**>
    @type record_transformer
    enable_ruby true
    <record>
      kubernetes_namespace_container_name ${record["kubernetes"]["namespace_name"]}.${record["kubernetes"]["pod_name"]}.${record["kubernetes"]["container_name"]}
      container_info ${record["docker"]["container_id"]}-${record["stream"]}
    </record>
  </filter>
  <match kubernetes.**>
    @type rewrite_tag_filter
    <rule>
      key "kubernetes_namespace_container_name"
      pattern ^(.+)$
      tag "kube.$1"
    </rule>
  </match>
  <filter kube.*.*.*>
    @type record_modifier
    remove_keys "dummy_"
    <record>
      dummy_ ${record["kubernetes"]&.delete("master_url"); record["kubernetes"]&.delete("namespace_id"); if record["kubernetes"]&.has_key?("labels"); record["kubernetes"]["labels"].delete("pod-template-generation"); record["kubernetes"]["labels"].delete("controller-revision-hash");  record["kubernetes"]["labels"].delete("pod-template-hash"); end; nil}
    </record>
  </filter>
  <filter kube.*.*.*>
    @type record_transformer
    remove_keys kubernetes_namespace_container_name
  </filter>
  <filter kube.kube-system.**>
    @type parser
    reserve_data true
    key_name "log"
    emit_invalid_record_to_error false
    <parse>
      @type "kubernetes"
      time_format "%m%d %H:%M:%S.%N"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
    </parse>
  </filter>
  <source>
    @type prometheus
  </source>
  <source>
    @type prometheus_monitor
  </source>
  <source>
    @type prometheus_output_monitor
  </source>
  <match systemd.** kube.kube-system.**>
    @type copy
    <store>
      @id X
      @type "vmware_loginsight"
      host "X.X.X.X"
      include_tag_key true
      log_text_keys ["log","msg","message"]
      port 9000
      scheme "http"
      serializer "json"
      ssl_verify false
      tag_key "tag"
    </store>
  </match>
  <match **>
    @type null
  </match>
</ROOT>

Your Error Log

2021-06-08 16:15:48 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 169, namespace_cache_watch_misses: 342, namespace_cache_watch_ignored: 190, pod_cache_host_updates: 7897, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 27, pod_watch_gone_notices: 27
2021-06-08 16:15:48 +0000 [info]: #0 stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 179, namespace_cache_watch_misses: 362, namespace_cache_watch_ignored: 201, pod_cache_host_updates: 7629, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 26, pod_watch_gone_notices: 26
2021-06-08 16:15:49 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 1, pod_cache_size: 272, pod_cache_watch_updates: 17493, pod_cache_watch_delete_ignored: 864, namespace_cache_api_updates: 1, pod_cache_api_updates: 1, id_cache_miss: 1, namespace_cache_watch_deletes_ignored: 652, namespace_cache_watch_misses: 1365, namespace_cache_watch_ignored: 703, pod_cache_host_updates: 33058, pod_cache_watch_ignored: 381, pod_watch_gone_errors: 117, pod_watch_gone_notices: 117
2021-06-08 16:15:49 +0000 [info]: #0 stats - namespace_cache_size: 1, pod_cache_size: 272, pod_cache_watch_updates: 17497, pod_cache_watch_delete_ignored: 865, namespace_cache_api_updates: 1, pod_cache_api_updates: 1, id_cache_miss: 1, namespace_cache_watch_deletes_ignored: 650, namespace_cache_watch_misses: 1362, namespace_cache_watch_ignored: 701, pod_cache_host_updates: 33897, pod_cache_watch_ignored: 381, pod_watch_gone_errors: 120, pod_watch_gone_notices: 120
2021-06-08 16:16:18 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 169, namespace_cache_watch_misses: 342, namespace_cache_watch_ignored: 190, pod_cache_host_updates: 7897, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 27, pod_watch_gone_notices: 27
2021-06-08 16:16:18 +0000 [info]: #0 stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 179, namespace_cache_watch_misses: 362, namespace_cache_watch_ignored: 201, pod_cache_host_updates: 7629, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 26, pod_watch_gone_notices: 26
2021-06-08 16:16:48 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 169, namespace_cache_watch_misses: 342, namespace_cache_watch_ignored: 190, pod_cache_host_updates: 7897, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 27, pod_watch_gone_notices: 27
2021-06-08 16:16:48 +0000 [info]: #0 stats - namespace_cache_size: 8, pod_cache_size: 272, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_api_updates: 46, pod_cache_api_updates: 46, id_cache_miss: 46, namespace_cache_watch_deletes_ignored: 179, namespace_cache_watch_misses: 362, namespace_cache_watch_ignored: 201, pod_cache_host_updates: 7629, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 26, pod_watch_gone_notices: 26
2021-06-08 16:16:49 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/decc-ara-gulel-nimbus-457p2_decc-ara-gulel-nimbus_gulel-container-f8e1921b-1a46304c12b181dd5c628649daf598b22a303fe4a08d4fbd5f6333597c95706b.log
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [warn]: #0 /var/log/containers/p4login-1623078060-zpl24_ara-cat-beta_p4login-18ff96509abecc02391857da8e6b820eee62274ab68b6adc6c443e432e9d36d6.log not found. Continuing without tailing it.
2021-06-08 16:16:49 +0000 [warn]: #0 stat() for /var/log/containers/p4login-1623078060-zpl24_ara-cat-beta_p4login-18ff96509abecc02391857da8e6b820eee62274ab68b6adc6c443e432e9d36d6.log failed with ENOENT. Drop tail watcher for now.
2021-06-08 16:16:49 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/test-829-sleeper-545888cddc-7hxzs_decc-ara-runway-argocd-stress-test-1_sleeper-0f7b7cb26656a58423dd88671730e13076d1e9130a938eb057b065d214b87bc9.log
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/csp-billing-mvc-68fb6d8577-p6f5j_ara-csp-rrodolfo_csp-billing-mvc-a46591834fdb5cf3728620465efbd0b063fd3c87f110c6692262ff8df25f5f60.log
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:16:49 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:18 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 9, pod_cache_size: 272, namespace_cache_api_updates: 49, pod_cache_api_updates: 49, id_cache_miss: 49, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_watch_deletes_ignored: 169, namespace_cache_watch_misses: 342, namespace_cache_watch_ignored: 190, pod_cache_host_updates: 7897, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 27, pod_watch_gone_notices: 27
2021-06-08 16:17:18 +0000 [info]: #0 stats - namespace_cache_size: 9, pod_cache_size: 272, namespace_cache_api_updates: 49, pod_cache_api_updates: 49, id_cache_miss: 49, pod_cache_watch_updates: 691, pod_cache_watch_delete_ignored: 175, namespace_cache_watch_deletes_ignored: 179, namespace_cache_watch_misses: 362, namespace_cache_watch_ignored: 201, pod_cache_host_updates: 7629, pod_cache_watch_ignored: 169, pod_watch_gone_errors: 26, pod_watch_gone_notices: 26
2021-06-08 16:17:28 +0000 [warn]: #0 stat() for /var/log/containers/decc-ara-gulel-nimbus-bszrh_decc-ara-gulel-nimbus_gulel-container-46fc33f9-c5ec018fc6d5c41a6a9114f80e227612cc256878cac2c30a2463b98532d2e5ff.log failed with ENOENT. Drop tail watcher for now.
2021-06-08 16:17:28 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/decc-ara-gulel-nimbus-9gtf6_decc-ara-gulel-nimbus_gulel-container-c235795f-dbcc61b373eb557565f5439cf6b6d0074557b4bb1ba4eba69f9dba125bdb229e.log
2021-06-08 16:17:28 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:28 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:28 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/decc-ara-gulel-nimbus-vw8ht_decc-ara-gulel-nimbus_gulel-container-bf44a93d-1bad109b217083a49e3ca74a8c8a504784e79d3a5125dfed64b6d7ab7d247da4.log
2021-06-08 16:17:28 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:29 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:29 +0000 [warn]: #0 /var/log/containers/decc-ara-gulel-nimbus-xc2pf_decc-ara-gulel-nimbus_gulel-container-de82d83d-5475370a2b4d9fcc82862d727f3f2fa413ca669fa2079c617dc0a99a4092a6ba.log not found. Continuing without tailing it.
2021-06-08 16:17:29 +0000 [warn]: #0 stat() for /var/log/containers/decc-ara-gulel-nimbus-xc2pf_decc-ara-gulel-nimbus_gulel-container-de82d83d-5475370a2b4d9fcc82862d727f3f2fa413ca669fa2079c617dc0a99a4092a6ba.log failed with ENOENT. Drop tail watcher for now.
2021-06-08 16:17:29 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/decc-ara-gulel-nimbus-355ba802-tqvrk_decc-ara-gulel-worker_gulel-container-355ba802-ae54c881e04c6fc7d3b08283c581b56c0baa1cbd01de10351f35cd73ea68c7b8.log
2021-06-08 16:17:29 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:29 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.
2021-06-08 16:17:30 +0000 [info]: fluent/log.rb:329:info: Worker 0 finished unexpectedly with signal SIGKILL

The text was updated successfully, but these errors were encountered:

ashie · 2021-06-09T01:40:42Z

Probably same with #3341

alex-vmw · 2021-06-09T04:03:49Z

Probably same with #3341

Hi @ashie, I think this issue is different because I specifically checked the logs to make sure there was no gracefulReload requested (or any other restarts, etc.), but I can provide the full logs if you'd like.

ashie · 2021-06-09T05:06:30Z

Don't worry, the comment is just only for add a reference to the similar issue.
Sorry for my fewer explanation.

ashie · 2021-06-21T07:16:15Z

I checked implementation but fluentd itself doesn't send KILL signal to workers as I mentioned at #3341 (comment)
So that KILL signal is sent by another something, probably it's OOM killer.
Please check memory usage of your fluentd's worker process and OOM killer's log (typically in /var/log/message).

I think #3341 is also killed by OOM killer, because gracefulReload consumes more memory.
On gracefulReload, fluentd tries to create a new Fluent::Engine with a new config without replacing the process. The old Fluent::Engine instance is also running while this, so that the process holds two engines while gracefulReload. (when multiple gracefulReload is attempted in short time, it might have three or more engines, I don't yet confirm it though).

A known methods to reduce fluentd's memory usage is using jemalloc (3.x is better).
See #1657 for more detail.

cosmo0920 · 2021-06-22T01:54:51Z

Right before the SIGKILL, there were messages about tailing logs that no longer exist because k8s pods were deleted, so maybe that has something to do with it.

I'm not familiar with detailed your environment, but you use Fluentd somethings as kubernetes Fluentd operator and set limits of resources for CPU and memory?
As ashie says gracefulReload comsumes multiple times of memory for ~~Fluent::Engine~~ Fluent::RootAgent instances.

ashie · 2021-06-22T02:05:53Z

BTW the detail of my above explanations was wrong.:

- Fluent::Engine
+ Fluent::RootAgent

fluentd/lib/fluent/engine.rb

Lines 167 to 202 in 5bc4c6b

    
           def reload_config(conf, supervisor: false) 
        
             # configure first to reduce down time while restarting 
        
             new_agent = RootAgent.new(log: log, system_config: @system_config) 
        
             ret = Fluent::StaticConfigAnalysis.call(conf, workers: system_config.workers) 
        
             ret.all_plugins.each do |plugin| 
        
               if plugin.respond_to?(:reloadable_plugin?) && !plugin.reloadable_plugin? 
        
                 raise Fluent::ConfigError, "Unreloadable plugin plugin: #{Fluent::Plugin.lookup_type_from_class(plugin.class)}, plugin_id: #{plugin.plugin_id}, class_name: #{plugin.class})" 
        
               end 
        
             end 
        
             # Assign @root_agent to new root_agent 
        
             # for https://github.com/fluent/fluentd/blob/fcef949ce40472547fde295ddd2cfe297e1eddd6/lib/fluent/plugin_helper/event_emitter.rb#L50 
        
             old_agent, @root_agent = @root_agent, new_agent 
        
             begin 
        
               @root_agent.configure(conf) 
        
             rescue 
        
               @root_agent = old_agent 
        
               raise 
        
             end 
        
             unless @suppress_config_dump 
        
               $log.info :supervisor, "using configuration file: #{conf.to_s.rstrip}" 
        
             end 
        
             # supervisor doesn't handle actual data. so the following code is unnecessary. 
        
             if supervisor 
        
               old_agent.shutdown      # to close thread created in #configure 
        
               return 
        
             end 
        
             stop_phase(old_agent) 
        
             $log.info 'restart fluentd worker', worker: worker_id 
        
             start_phase(new_agent) 
        
           end

alex-vmw · 2021-06-24T23:52:17Z

A known methods to reduce fluentd's memory usage is using jemalloc (3.x is better).

@ashie and @cosmo0920 Thanks for all the details. I have now implemented jemalloc 3.6.0 and will monitor memory utilization of the fluentd pods to make sure they are not killed by Kubernetes OOM killer.

cosmo0920 · 2021-06-25T00:57:10Z

A known methods to reduce fluentd's memory usage is using jemalloc (3.x is better).

@ashie and @cosmo0920 Thanks for all the details. I have now implemented jemalloc 3.6.0 and will monitor memory utilization of the fluentd pods to make sure they are not killed by Kubernetes OOM killer.

Thanks for the information. For high load environment, using jemalloc 3.6.0 is better for memory consumption than using jemalloc 5.2.1 and plain Linux malloc, right?

ashie · 2021-07-16T05:34:09Z

As I mentioned before, we think it's triggered by OOM killer, the cause is high memory usage.
For high memory usage or memory leak, we have several issues already. e.g.) #1657
So I close this issue.

ashie mentioned this issue Jun 21, 2021

worker-0 is getting SIGKILLed instead of nicely reloading upon issuing gracefulReload #3341

Closed

ashie closed this as completed Jul 16, 2021

ashie mentioned this issue Sep 1, 2021

Worker being OOM killed silently with info level log fluent/fluentd-kubernetes-daemonset#1295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker-0 randomly SIGKILLed #3413

worker-0 randomly SIGKILLed #3413

alex-vmw commented Jun 8, 2021

ashie commented Jun 9, 2021

alex-vmw commented Jun 9, 2021

ashie commented Jun 9, 2021 •

edited

Loading

ashie commented Jun 21, 2021

cosmo0920 commented Jun 22, 2021 •

edited

Loading

ashie commented Jun 22, 2021

alex-vmw commented Jun 24, 2021

cosmo0920 commented Jun 25, 2021

ashie commented Jul 16, 2021 •

edited

Loading

worker-0 randomly SIGKILLed #3413

worker-0 randomly SIGKILLed #3413

Comments

alex-vmw commented Jun 8, 2021

ashie commented Jun 9, 2021

alex-vmw commented Jun 9, 2021

ashie commented Jun 9, 2021 • edited Loading

ashie commented Jun 21, 2021

cosmo0920 commented Jun 22, 2021 • edited Loading

ashie commented Jun 22, 2021

alex-vmw commented Jun 24, 2021

cosmo0920 commented Jun 25, 2021

ashie commented Jul 16, 2021 • edited Loading

ashie commented Jun 9, 2021 •

edited

Loading

cosmo0920 commented Jun 22, 2021 •

edited

Loading

ashie commented Jul 16, 2021 •

edited

Loading