Fluentd in_tail pos_file problem causing log stop sending #3586

jasminehung · 2021-12-28T13:40:23Z

Describe the bug

My Fluentd daemonset handles hundreds of containers per node, and for some specific containers (always the same few containers, named as myapp in this post), log happens to stop sending to elasticsearch every few days randomly.
Sometimes log will resume sending to elasticsearch after a few hours, but sometimes we have to restart myapp pods or restart fluentd pods to make it keep working.

To Reproduce

Run a k8s pod myapp which writes 1,000,000 events in json format (using cri container runtime as logging driver) per hour stably, without generating a burst of log.

Expected behavior

Expected fluentd to ship log continually and stably.

Your Environment

- docker image: fluentd-kubernetes-daemonset:v1.14-debian-elasticsearch7–1 (with a Fluentd version: 1.14.3)

Your Configuration

<source>
 @type tail
 path /var/log/containers/*.log
 pos_file /var/log/fluentd-docker-log.pos
 time_format %Y-%m-%dT%H:%M:%S
 tag kubernetes.*
 skip_refresh_on_startup true
 read_from_head true
   <parse>
     @type cri
   </parse>
 </source>

Your Error Log

There's no error log when this happens, but there were some info logs with special behavior:
2021–12–27 07:37:52 +0000 [info]: #0 detected rotation of /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds
2021–12–27 07:37:52 +0000 [info]: #0 following tail of /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log
---- at list moment myapp log stop being sent to elasticsearch (no more "following tail" info log for /var/log/containers/myapp) ------
2021–12–27 07:38:43 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds
2021–12–27 07:39:47 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds
2021–12–27 07:40:51 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds


In the pos file we found the records for myapp is strange:
1. There are duplicated rows tracking for myapp container (Will fluentd use the last or first line? or both? or none? ). For the other hundreds of normal pods, there's only 1 row for each container. 
2. Record with position "ffffffffffffffff 0000000000000000" keep appearing. Does this mean pos_file is not able to track the container?

Results of querying "myapp" from the pos file: 
Line 241: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000bb1725 00000000000a0142
Line 260: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000
Line 292: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000bf8c99 00000000000a0027
Line 302: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000
Line 561: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000afaeea 00000000000a001c
Line 785: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000
Line 804: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 000000000016e1a8 00000000000a0019

Additional context

Adding pos_file_compaction_interval and set it to 3m seems to fix the log missing problem, but still not able to explain the root cause of the case.

ashie · 2022-07-08T14:30:42Z

I think it's same cause with #3661, so I close this issue and will continue investigating this at #3661.

fujimotos added the bug Something isn't working label Jan 12, 2022

salavessa mentioned this issue Feb 1, 2022

Fluentd in_tail "unreadable" file causes "following tail of <file>" to stop and no logs pushed #3614

Closed

ashie self-assigned this Feb 22, 2022

ashie mentioned this issue Mar 9, 2022

In_tail stop working after log rotation in K8s #3661

Closed

ashie mentioned this issue Jul 8, 2022

Working fluentd stopped following tail for one file #3800

Closed

ashie closed this as completed Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluentd in_tail pos_file problem causing log stop sending #3586

Fluentd in_tail pos_file problem causing log stop sending #3586

jasminehung commented Dec 28, 2021 •

edited

Loading

ashie commented Jul 8, 2022

Fluentd in_tail pos_file problem causing log stop sending #3586

Fluentd in_tail pos_file problem causing log stop sending #3586

Comments

jasminehung commented Dec 28, 2021 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

ashie commented Jul 8, 2022

jasminehung commented Dec 28, 2021 •

edited

Loading