Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd in_tail pos_file problem causing log stop sending #3586

Closed
jasminehung opened this issue Dec 28, 2021 · 1 comment
Closed

Fluentd in_tail pos_file problem causing log stop sending #3586

jasminehung opened this issue Dec 28, 2021 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@jasminehung
Copy link

jasminehung commented Dec 28, 2021

Describe the bug

My Fluentd daemonset handles hundreds of containers per node, and for some specific containers (always the same few containers, named as myapp in this post), log happens to stop sending to elasticsearch every few days randomly.
Sometimes log will resume sending to elasticsearch after a few hours, but sometimes we have to restart myapp pods or restart fluentd pods to make it keep working.

To Reproduce

Run a k8s pod myapp which writes 1,000,000 events in json format (using cri container runtime as logging driver) per hour stably, without generating a burst of log.

Expected behavior

Expected fluentd to ship log continually and stably.

Your Environment

- docker image: fluentd-kubernetes-daemonset:v1.14-debian-elasticsearch7–1 (with a Fluentd version: 1.14.3)

Your Configuration

<source>
 @type tail
 path /var/log/containers/*.log
 pos_file /var/log/fluentd-docker-log.pos
 time_format %Y-%m-%dT%H:%M:%S
 tag kubernetes.*
 skip_refresh_on_startup true
 read_from_head true
   <parse>
     @type cri
   </parse>
 </source>

Your Error Log

There's no error log when this happens, but there were some info logs with special behavior:
2021–12–27 07:37:52 +0000 [info]: #0 detected rotation of /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds
2021–12–27 07:37:52 +0000 [info]: #0 following tail of /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log
---- at list moment myapp log stop being sent to elasticsearch (no more "following tail" info log for /var/log/containers/myapp) ------
2021–12–27 07:38:43 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds
2021–12–27 07:39:47 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds
2021–12–27 07:40:51 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds


In the pos file we found the records for myapp is strange:
1. There are duplicated rows tracking for myapp container (Will fluentd use the last or first line? or both? or none? ). For the other hundreds of normal pods, there's only 1 row for each container. 
2. Record with position "ffffffffffffffff 0000000000000000" keep appearing. Does this mean pos_file is not able to track the container?

Results of querying "myapp" from the pos file: 
Line 241: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000bb1725 00000000000a0142
Line 260: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000
Line 292: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000bf8c99 00000000000a0027
Line 302: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000
Line 561: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000afaeea 00000000000a001c
Line 785: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000
Line 804: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 000000000016e1a8 00000000000a0019

Additional context

Adding pos_file_compaction_interval and set it to 3m seems to fix the log missing problem, but still not able to explain the root cause of the case.

@ashie
Copy link
Member

ashie commented Jul 8, 2022

I think it's same cause with #3661, so I close this issue and will continue investigating this at #3661.

@ashie ashie closed this as completed Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants