Working fluentd stopped following tail for one file #3800

bjg2 · 2022-06-27T14:55:15Z

Describe the bug

We're using fluentd in k8s to fetch all the tailing logs and send it to a mongo database. Generally, it works as expected.

After long time, it happens randomly that fluentd stops tailing a file, so we lost important log data for more than 1 day. Other fluentd apps (on other nodes) worked as expected, and given node worked as expected for everything except for that tailing file. This is not the first time it happened (last time we noticed it on May 6th).

That tailed file is a log file of a pod that has huge amount of logs. Prior to the issue pod was running 3-4 days without stop without any issues with logging. Last log that passed through from that file was at "2022-06-25T19:17:35.751Z". Before that moment, each time that rotation of that file happened, there was a pair of logs: detected rotation + following tail. Later, each time that rotation happened, it logger only: detected rotation.

Before (last, and each before):

2022-06-25T19:13:05.741980613Z stdout F 2022-06-25 19:13:05 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log; waiting 5 seconds
2022-06-25T19:13:05.743587912Z stdout F 2022-06-25 19:13:05 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log

After (first, and every next):

2022-06-25T19:17:36.744586203Z stdout F 2022-06-25 19:17:36 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log; waiting 5 seconds

To note file at /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log looked healthy and ok, and was constantly outputting healthy logs (it was rotated once each ~12 minutes). We mitigated by just restarting that api pod - it changed name and that changed file name, which fluentd followed fine later. Problem existed since that problematic point in time until we did mitigation, for only that file. There was no need to restart fluentd.

To Reproduce

Can't repro.

Expected behavior

Not to stop tailing.

Your Environment

- Fluentd version: fluentd 1.13.3
- TD Agent version: /
- Operating system: 
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.13.5
PRETTY_NAME="Alpine Linux v3.13"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"

- Kernel version: 4.19.0-17-cloud-amd64

(Using docker image fluent/fluentd:v1.13-1)

Your Configuration

Cannot repro, so I'm putting whole config in (db config is replaced when deploying).

# Inputs from container logs
<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd.log.pos
  read_from_head
  tag kubernetes.*
  <parse>
    @type cri
  </parse>
</source>

# Merge logs split into multiple lines
<filter kubernetes.**>
  @type concat
  key message
  use_partial_cri_logtag true
  partial_cri_logtag_key logtag
  partial_cri_stream_key stream
  separator ""
</filter>

# Enriches records with Kubernetes metadata
<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

# Prettify kubernetes metadata
<filter kubernetes.**>
  @type record_transformer
  enable_ruby
  <record>
    nodeName ${record.dig("kubernetes", "host")}
    namespaceName ${record.dig("kubernetes", "namespace_name")}
    podName ${record.dig("kubernetes", "pod_name")}
    containerName ${record.dig("kubernetes", "container_name")}
    containerImage ${record.dig("kubernetes", "container_image")}
  </record>
  remove_keys docker,kubernetes
</filter>
 
# Expands inner json
<filter kubernetes.**>
  @type parser
  format json
  key_name message
  reserve_data true
  remove_key_name_field true
  emit_invalid_record_to_error false
  time_format %Y-%m-%dT%H:%M:%S.%NZ
  time_key time
  keep_time_key
</filter>

# Mongodb keys should not have dollar or a dot inside
<filter kubernetes.**>
  @type rename_key
  replace_rule1 \$ [dollar]
</filter>

# Mongodb keys should not have dollar or a dot inside
<filter kubernetes.**>
  @type rename_key
  replace_rule1 \. [dot]
</filter>

# Outputs to log db
<match kubernetes.**>
  @type mongo

  connection_string "#{ENV['MONGO_ANALYTICS_DB_HOST']}"
  collection logs

  <buffer>
    flush_thread_count 8
    flush_interval 1s
  </buffer>
</match>

Your Error Log

Not sure how you mean all - we have a ton of logs, most of them lost. The best logs I could find were noted in the 'Describe the bug' part, you can ask for some specifics if needed.

Additional context

Running docker image fluent/fluentd:v1.13-1 in k8s 1.21 on Digital Ocean as a daemon set.

The text was updated successfully, but these errors were encountered:

bjg2 · 2022-07-08T09:37:05Z

Ping, any advice on this one? It happens unexpectedly from time to time and is very very painful, it takes a lot of our logs and messes up our analytics...

ashie · 2022-07-08T10:02:24Z

It seems same issue with #3661 or #3614 or #3586, let's continue to discuss at there.

ashie closed this as completed Jul 8, 2022

ashie mentioned this issue Jul 8, 2022

Fluentd in_tail "unreadable" file causes "following tail of <file>" to stop and no logs pushed #3614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working fluentd stopped following tail for one file #3800

Working fluentd stopped following tail for one file #3800

bjg2 commented Jun 27, 2022 •

edited

Loading

bjg2 commented Jul 8, 2022

ashie commented Jul 8, 2022

Working fluentd stopped following tail for one file #3800

Working fluentd stopped following tail for one file #3800

Comments

bjg2 commented Jun 27, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

bjg2 commented Jul 8, 2022

ashie commented Jul 8, 2022

bjg2 commented Jun 27, 2022 •

edited

Loading