Fluentd in_tail "unreadable" file causes "following tail of <file>" to stop and no logs pushed #3614

salavessa · 2022-02-01T17:16:01Z

Describe the bug

After a warning of an "unreadable" (likely due to rotation), no more logs were pushed (in_tail + pos_file).
Reloading config or restarting fluentd sorts the issue.
All other existing files being tracked continued to work as expected.

To Reproduce

Not able to reproduce at will.

Expected behavior

Logs to be pushed as usual after file rotation as fluentd recovers from the temporary "unreadable" file.

Your Environment

- Fluentd version: 1.14.4
- TD Agent version: 4.2.0
- Operating system: Ubuntu 20.04.3 LTS
- Kernel version: 5.11.0-1022-aws

Your Configuration

<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/es-containers.log.pos
  tag kubernetes.*
  <parse>
    @type regexp
    expression /^(?<time>[^ ]+) (?<stream>[^ ]+) (?<logtag>[^ ]+) (?<log>.+)$/
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%N%z
  </parse>
  read_from_head true
</source>

Your Error Log

Relevant log entries with some context. When "detected rotation of ..." isn't followed by a "following tail of ..." then log file contents aren't being processed/pushed:

2022-02-01 01:26:33 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:26:33 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 01:32:53 +0000 [warn]: #0 /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log unreadable. It is excluded and would be examined next time.
2022-02-01 01:32:54 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:32:54 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 01:38:04 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:44:44 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:53:15 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
[...]
---- after issuing a config reload (kill -SIGUSR2 <pid>) it starts to work fine again, i.e. "following tail of ..." 
2022-02-01 11:36:19 +0000 [info]: Reloading new config
2022-02-01 11:36:19 +0000 [info]: using configuration file: <ROOT>
[...]
2022-02-01 11:36:19 +0000 [info]: shutting down input plugin type=:tail plugin_id="object:c6c0"
[...]
2022-02-01 11:36:19 +0000 [info]: adding source type="tail"
2022-02-01 11:36:19 +0000 [info]: #0 shutting down fluentd worker worker=0
2022-02-01 11:36:19 +0000 [info]: #0 shutting down input plugin type=:tail plugin_id="object:c6c0"
[...]
2022-02-01 11:36:20 +0000 [info]: #0 restart fluentd worker worker=0
---- the entry below seems to be related with the actual underlying issue... the ruby object which stopped pushing logs has now been terminated as a new one was created
2022-02-01 11:36:20 +0000 [warn]: #0 /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log already exists. use latest one: deleted #<struct Fluent::Plugin::TailInput::Entry path="/var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log", pos=10740032, ino=1797715, seek=1530>
2022-02-01 11:36:20 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 11:37:30 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 11:37:30 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 11:43:20 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 11:43:20 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log

Additional context

This issue seems be related with #3586 but unfortunately I didn't check the pos file while the issue was happening so can't tell if it presented unexpected values for the failing file.

The text was updated successfully, but these errors were encountered:

mizunashi-mana · 2022-02-14T02:30:29Z

We have encountered the same problem twice. Last month we upgraded fluentd from 1.13.3 (td-agent 4.2.0) to 1.14.3 (td-agent 4.3.0), and then we have got the problem.

In our case, we are tracking more than 3 log files with the tail plugin, and only the file with the highest log flow has this problem. From the metrics produced by prometheus plugin, the file position always fluctuates by at least 400000/sec and the rotation happens at least once every 10 minutes.

The input source is set as follows:

<source>
  @type tail
  format ...
  path /var/log/server.log
  pos_file /var/log/td-agent/server.log.pos
  tag input.server.log
</source>

I will also share the record of the investigation when the problem occurs. First, fluentd suddenly, without warning, stops tracking new rotated files and only logs the following in the stdout log.

detected rotation of /var/log/server.log; waiting 5 seconds

Normally, it would have been followed by an announcement that a new file was being tracked, as follows

The following tail of /var/log/server.log

However, we have not seen this announcement at all after the problem occurs. Also, when I view the file descriptors that fluentd is tracking, I see the following:

$ sudo ls "/proc/$(pgrep -f 'td-agent.*--under-supervisor')/fd" -l | grep server
lr-x------ 1 td-agent td-agent 64 Feb 5 01:41 49 -> /var/log/server.3.log

This is the rotated file, and we see that fluentd is still tracking the old file, not the new one. Also, looking at the position file, we see the following:

$ sudo cat /var/log/td-agent/server.log.pos 
/var/log/server.log 000000002001f2d0 000000003a000042
/var/log/server.log 0000000000000000 0000000000000000

As far as the situation is concerned, it seems that something triggers fluentd to stop tracking new files altogether during rotate. We have about 300 servers running 24/7 and have only had two problems in the past month, so it seems to be pretty rare for problems to occur. However, since we haven't had any problems with fluentd 1.13.3 for 6 months, it seems likely that there was some kind of regression in fluentd 1.13.3 -> 1.14.3.

We are planning to downgrade fluentd.

ashie · 2022-02-14T04:10:49Z

Memo: From v1.14.0 to v1.14.3, following changes are made to in_tail:

mizunashi-mana · 2022-02-22T04:47:22Z

We downgraded td-agent 4.3.0 (fluentd v1.14.3) to td-agent 4.2.0 (fluentd v1.13.3), and still have problems.
We broadened monitoring targets to 500 servers (approx 1500 files) from 200 servers (approx 400 files) as a result of this issue. Then, we found the 2 servers (2 files) having this issue after a week from td-agent downgraded.
Upgrading fluentd 1.13.3 -> 1.14.3 does not seem to be the cause, sorry.

It seemed that some of our log files were affected and some were not. The differences are as follows.

rotation policy; the unaffected files were using a time based rotation policy (e.g. *.log-%Y%m%d-%s), while the affected files were using a fixed window policy (e.g. *.%i.log). The fixed window policy takes longer to rotate than the time based rotation policy (because it requires renaming all the previous files).
rotation period; unaffected files are rotated in 30 minutes, while affected files are rotated in about 7 minutes. However, it may be just a matter of probability since the more rotations we do, the higher the probability that we will get this problem.

We are planning to change rotation policy and retry upgrading td-agent. And, we will comment some updates if we get.

mizunashi-mana · 2022-03-17T03:07:28Z

We enabled debug logging with fluentd 1.13.3 and encountered this issue. The state was same:

fluentd had been reporting detected rotation of /var/log/server.log; waiting 5 seconds without The following tail of /var/log/server.log.
fluentd tracked the old file descriptor.
position file had two entries; a correct position and /var/log/server.log 0000000000000000 0000000000000000.
fluentd_tail_file_inode metrics had been keeping the same inode from this issue occurred.

And, fluentd reported Skip update_watcher because watcher has been already updated by other inotify event following detected rotation of /var/log/server.log; waiting 5 seconds. The message seems to be reported by https://github.com/fluent/fluentd/blob/v1.13.3/lib/fluent/plugin/in_tail.rb#L482. It seems to be strongly related position file having two entries.

We searched some other debug logs related this issue but we had no such logs. fluentd reported no other debug logs but only pairs of detected rotation of /var/log/server.log; waiting 5 seconds and The following tail of /var/log/server.log.

ashie · 2022-03-17T03:17:34Z

Thanks for your report!

And, fluentd reported Skip update_watcher because watcher has been already updated by other inotify event following detected rotation of /var/log/server.log; waiting 5 seconds. The message seems to be reported by https://github.com/fluent/fluentd/blob/v1.13.3/lib/fluent/plugin/in_tail.rb#L482. It seems to be strongly related position file having two entries.

To tell the truth, I was suspecting it but I couldn't confirm it because I can't yet reproduce it.
Your report it very helpful for me.

Although I'm not yet sure the mechanism of this issue, you might be able to avoid this issue by disabling stat watcher (enable_stat_watcher false).

ashie · 2022-07-08T14:24:39Z

Similar reports:

issue	date	user	fluentd version	read_from_head	follow_inodes	work around
#3586	2021-12-28	@jasminehung	v1.14.3	true	true	`pos_file_compaction_interval 3m`
#3614	2022-02-02	@salavessa	v1.14.4	true	false	`enable_stat_watcher false`
	2022-02-14	@mizunashi-mana	1.14.3, v1.13.3	false	false
#3661	2022-03-08	@shenmuxiaosen	v1.13.3	false	false	v1.12.4 was stable
	2022-07-08	@ljuaneda	?	?	?	v1.11.5 was stable
#3800	2022-06-27	@bjg2	v1.13.3	true	false
#3838	2022-08-01	@bertpassek	v1.15.1	?	?

ashie · 2022-07-08T14:28:40Z

From #3586

Your Error Log

There's no error log when this happens, but there were some info logs with special behavior:
2021–12–27 07:37:52 +0000 [info]: #0 detected rotation of /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds
2021–12–27 07:37:52 +0000 [info]: #0 following tail of /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log
---- at list moment myapp log stop being sent to elasticsearch (no more "following tail" info log for /var/log/containers/myapp) ------
2021–12–27 07:38:43 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds
2021–12–27 07:39:47 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds
2021–12–27 07:40:51 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds


In the pos file we found the records for myapp is strange:
1. There are duplicated rows tracking for myapp container (Will fluentd use the last or first line? or both? or none? ). For the other hundreds of normal pods, there's only 1 row for each container. 
2. Record with position "ffffffffffffffff 0000000000000000" keep appearing. Does this mean pos_file is not able to track the container?

Results of querying "myapp" from the pos file: 
Line 241: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000bb1725 00000000000a0142
Line 260: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000
Line 292: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000bf8c99 00000000000a0027
Line 302: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000
Line 561: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000afaeea 00000000000a001c
Line 785: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000
Line 804: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 000000000016e1a8 00000000000a0019

Additional context

Adding pos_file_compaction_interval and set it to 3m seems to fix the log missing problem, but still not able to explain the root cause of the case.

ashie · 2022-07-08T15:00:26Z

There are several similar reports and most of them said:

When tailing is stopped,

202x–xx–xx xx:xx:xx +0000 [info]: #0 detected rotation of /var/log/containersxxx.log; waiting 5 seconds

was observed but

202x–xx–xx xx:xx:xx +0000 [info]: #0 following tail of /var/log/containers/xxx.log

wasn't observed.

In addition, a user reports that following debug message is observed when log level is debug:

202x–xx–xx xx:xx:xx +0000 [debug]: #0 Skip update_watcher because watcher has been already updated by other inotify event

So, it seems caused by the following code:

fluentd/lib/fluent/plugin/in_tail.rb

Lines 489 to 498 in 438a82a

    
           log.info("detected rotation of #{target_info.path}; waiting #{@rotate_wait} seconds") 
        
           if @pf 
        
             pe_inode = pe.read_inode 
        
             target_info_from_position_entry = TargetInfo.new(target_info.path, pe_inode) 
        
             unless pe_inode == @pf[target_info_from_position_entry].read_inode 
        
               log.debug "Skip update_watcher because watcher has been already updated by other inotify event" 
        
               return 
        
             end 
        
           end

salavessa · 2022-07-08T15:55:30Z

For what's worth: after adding enable_stat_watcher false back in March (as mentioned by @ashie here) none of my 120+ servers experienced this issue again.

wangchaoforever · 2022-07-15T07:12:05Z

@ashie Is this one fixed?

In #3614 some users reported that in_tail rarely stops tailing. It seems it's caused when skipping update_watcher due to unexpected duplicate pos_file entries. The root cause is unknown yet. To investigate it, show more infomation when it happens. Ideally it should't happen so that log level "warn" is desired for it. Signed-off-by: Takuro Ashie <[email protected]>

ashie · 2022-08-01T09:18:27Z

At v1.15.1 we added a log message like the following to investigate this issue:

2022-08-01 17:57:19 +0900 [warn]: #0 Skip update_watcher because watcher has been already updated by other inotify event path="logs/hige.log" inode=59511737 inode_in_pos_file=0

If you see such log message, please report it.

vparfonov · 2022-08-01T09:51:03Z

Hello, @ashie! We have something similar to: "no more log pushed" and found duplication in pos file.
Like this:

/var/log/pods/pod-l4wrv/service/0.log	ffffffffffffffff	000000008b00ad36

/var/log/pods/pod-5bmg8/service/0.log	000000000333fd2e	000000005a00673b
/var/log/pods/pod-5bmg8/service/0.log	0000000004b2a353	000000005a00673e
/var/log/pods/pod-5bmg8/service/0.log	0000000003666a64	000000005a00673c
/var/log/pods/pod-5bmg8/service/0.log	0000000006a53eab	000000005a00673e
/var/log/pods/pod-5bmg8/service/0.log	0000000004a21b23	000000005a01e248
/var/log/pods/pod-5bmg8/service/0.log	0000000000000000	0000000000000000

Maybe this can help you.

ashie · 2022-08-01T09:53:44Z

@vparfonov Could you provide your in_tail config? Do you use follow_inode true?

vparfonov · 2022-08-01T10:02:41Z

@ashie, looks like not, we use default setting for follow_inode

<source>
  @type tail
  @id container-input
  path "/var/log/pods/**/*.log"
  exclude_path ["/var/log/pods/openshift-logging_collector-*/*/*.log", "/var/log/pods/openshift-logging_elasticsearch-*/*/*.log", "/var/log/pods/openshift-logging_kibana-*/*/*.log"]
  pos_file "/var/lib/fluentd/pos/es-containers.log.pos"
  refresh_interval 5
  rotate_wait 5
  tag kubernetes.*
  read_from_head "true"
  skip_refresh_on_startup true
  @label @MEASURE
  <parse>
    @type multi_format
    <pattern>
      format json
      time_format '%Y-%m-%dT%H:%M:%S.%N%Z'
      keep_time_key true
    </pattern>
    <pattern>
      format regexp
      expression /^(?<time>[^\s]+) (?<stream>stdout|stderr)( (?<logtag>.))? (?<log>.*)$/
      time_format '%Y-%m-%dT%H:%M:%S.%N%:z'
      keep_time_key true
    </pattern>
  </parse>
</source>

artheus · 2022-08-22T14:36:11Z

I looked a bit at this now. What seems to happen is that in the in_tail.rb it checks if File.readable? which is false in the case of a broken symlink.
So in the case of log rotation, the symlink to the container log file will be broken for a while, in Kubernetes environments.
If there are no log lines output within the timeout of the detected log rotation, the in_tail will assume the file to be readable, but no file is actually created yet.
This could easily be fixed by polling the symlink, until it's no longer broken (or until some kind of max_retry is hit)
I also think it should be announced somewhere that one should install capng to get the capabilities lib working properly. Which it was not, for us. And we're using the official docker images for running fluentd.

EDIT: This should be reproducible in a local k8s environment by creating a tiny app that outputs logs every 10th - 30th second. And running fluentd in_tail the log files there.

artheus · 2022-08-23T08:35:48Z

It is really weird however, that the re-examination done on the next refresh does not always pick up on that the symlinks are no longer broken, and that the file should now be readable.
I have however found that in most of the cases where we will get no more logs, after a detected log rotation, the "detected rotation of ..." log entry was sent twice for the same file, on the same node, and in other cases when FluentD is able to pick up tailing the file again later, the log line appears only once.

artheus · 2022-08-23T11:24:15Z

I created a test case in test/plugin/test_in_tail.rb for just playing around with this exact event:

sub_test_case "multiple log rotations" do
    data(
      "128" => ["128-0.log", 128, "msg"],
      "256" => ["256-0.log", 256, "msg"],
      "512" => ["512-0.log", 512, "msg"],
      "1024" => ["1024-0.log", 1024, "msg"],
      "2048" => ["2048-0.log", 2048, "msg"],
      "4096" => ["4096-0.log", 4096, "msg"],
      "8192" => ["8192-0.log", 8192, "msg"],
    )

    def test_reproduce_err_after_rotations(data)
      file, num_lines, msg = data

      File.open("#{@tmp_dir}/#{file}", 'wb') do |f|
        num_lines.times do
          f.puts "#{msg}\n"
        end
      end

      conf = config_element("", "", {
        "path" => "#{@tmp_dir}/*.log.link",
        "pos_file" => "#{@tmp_dir}/tail.pos",
        "refresh_interval" => "1s",
        "read_from_head" => "true",
        "format" => "none",
        "rotate_wait" => "1s",
        "follow_inodes" => "true",
        "tag" => "t1",
      })

      link_name="#{@tmp_dir}/#{num_lines}.log.link"

      File.symlink(file, link_name)

      dl_opts = {log_level: Fluent::Log::LEVEL_DEBUG}
      logdev = $stdout
      logger = ServerEngine::DaemonLogger.new(logdev, dl_opts)
      log_instance = Fluent::Log.new(logger)

      rotations = 5
      rot_now = 1

      d = create_driver(conf, false)
      d.instance.log = log_instance
      d.run(timeout: 30) do
        sleep 1

        assert_equal(num_lines, d.record_count)

        # rotate logs
        while rot_now <= rotations do
          sleep 2

          puts "unlink #{link_name}"
          File.unlink(link_name)
          puts "symlink #{num_lines}-#{rot_now}.log #{link_name}"
          File.symlink("#{num_lines}-#{rot_now}.log", link_name)

          sleep 1

          File.open("#{@tmp_dir}/#{num_lines}-#{rot_now}.log", 'wb') do |f|
            num_lines.times do
              f.puts "#{msg}\n"
            end
          end

          assert_equal(num_lines*rot_now, d.record_count)

          rot_now = rot_now + 1
        end
      end
    end
  end

In this case, it seems to be working properly. But maybe we can help each-other in reproducing the error?

UPDATE: Changed to multiple files of different sizes, and changed log rotation to how kubelet actually does it.
UPDATE 2: Code updated! Now it seems I've got something that somehow reproduces the error, in this very state the tests works fine. But if you comment out the "follow_inodes" => "true", line, the error comes up and the log line Skip update_watcher because watcher has been.... mentioned by @ashie above, is logged all the time. So I think that the follow_inodes option is not only important for preventing duplicate log messages, but also for tail on wildcards, and symlinks to work properly!

SriramDuvvuri · 2022-08-30T05:26:25Z

HI All,

We are facing a similar kind of issue where the td-agent.log & .pos files are not updating properly.
In some cases , failing after when log rotation occurs, file gets created but nothing is generated as shown below. Even though if we restart the td-agent, the problem is not fixed

As per some previous comments, I have tried modifying the config with the below parameters, but nothing changes

enable_stat_watcher false
read_from_head true
follow_inodes true
refresh_interval 5
rotate_wait 5
skip_refresh_on_startup true
pos_file_compaction_interval 30m

As we have systems over RHEL 7/6 we are using fluentd version using 1.11.5 td-agent 3.8.1/3.8.0, in our environment
Here is the config we are using, please provide us your inputs here.

[root]:/var/log/td-agent> ls -ltr
-rw-r--r-- 1 td-agent td-agent       68 Aug 29 20:00 s2.pos
-rw-r--r-- 1 td-agent td-agent      384 Aug 29 20:00 b1.pos
-rw-r--r-- 1 td-agent td-agent    37509 Aug 29 20:00 a3.pos
-rw-r--r-- 1 td-agent td-agent    41629 Aug 29 20:00 xx.pos
-rw-r----- 1 td-agent td-agent 18562883 Aug 29 20:38 **td-agent.log-20220830**
-rw-r----- 1 td-agent td-agent        **0 Aug 30 03:50 td-agent.log**
****

Config:

<source>
@type tail
  path /home/server/xx*.log
  exclude_path ["/home/server/xxyy*.log"]
  pos_file /var/log/td-agent/xxlog.pos
  tag xx.log
  enable_stat_watcher false
  read_from_head true
  follow_inodes true
  refresh_interval 5
  rotate_wait 5
  skip_refresh_on_startup true
  pos_file_compaction_interval 30m
  <parse>
    @type multiline
    format_firstline /<20/
    format1 /<(?<logDate>.*?)>-<(?<logThread>.*?)>-<(?<level>.*?)>-<(?<logClass>.*?)>-(?<logMessage>.*)/
    time_key logDate
    time_format %Y/%m/%d-%H:%M:%S.%N
  </parse>
</source>

PS. the same config is working fine in few other instances. I am not sure what was the problem here.

Appreciate your help in advance.

artheus · 2022-09-13T09:24:36Z

We just ran into the same problem again. Without the log line Skip update_watcher because watcher has been.... showing up.
Now we only get these:

2022-09-13 06:36:37 +0000 [warn]: #0 [in_tail_container_logs] got incomplete line at shutdown from /var/log/containers/redacted-service-name-6b6754b75c-swn6z_backend_redacted-service-name-13f36bd8f73334e603ee6fe80642d971f6ea0b25ce966c6ec494c5b226e4d93c.log: ""

Is it wrong that we are running this in a k8s DaemonSet? Should we install FluentD/FluentBit directly on the vm's instead?
Are there any special capabilities needed for the Pods created by the DaemonSet?
I'm starting to feel a bit lost in this problem...

artheus · 2023-10-07T10:25:11Z

Here's the pos file migration script I wrote. (Seems I forgot that I just wrote it in bash, and not Golang)

It requires sqlite3 to be installed, for the script to work. Installation guides can be found here: https://www.tutorialspoint.com/sqlite/sqlite_installation.htm

#!/usr/bin/env bash

info() {
  printf "[%s] INFO - %s\n" "$(date --iso-8601=seconds )" "$@"
}

readonly DB='/opt/fluent-bit-db/log-tracking.db'
readonly FLUENTD_LOG_POS="/var/log/fluentd-containers.log.pos"

if [[ ! -f "$FLUENTD_LOG_POS" ]]; then
  info "No FluentD log tracking file to migrate from"
  exit
fi

if [[ ! -f "$DB" ]]; then
  sqlite3 "$DB" "CREATE TABLE main.in_tail_files (id INTEGER PRIMARY KEY, name TEXT, offset INTEGER, inode INTEGER, created INTEGER, rotated INTEGER);"
else
  info "fluent-bit database already exists, will not do migration"
  exit
fi

while read -r line; do
  IFS=$'\t' read -r -a parts <<< "$line"

  filename="${parts[0]}"
  offset="$((16#${parts[1]}))"
  inode="$((16#${parts[2]}))"
  now="$(date +%s)"

  sqlite3 "$DB" "INSERT INTO in_tail_files (name, offset, inode, created, rotated) VALUES ('$filename', $offset, $inode, $now, 0)"
done < <(sort "$FLUENTD_LOG_POS")

There are no security stuff like escaping values for INSERT INTO or anything, but it's intended usage is in controlled environments. I am guessing there are plenty of things to improve on, but this has worked well on our internal k8s testing nodes, where we've used it as an init container for the fluent-bit pod which takes over for Fluentd.

Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. In this case, a tail watcher is possible to mark the position entry as `unwatched` if it's tansitioned to `rotate_wait` state by `refresh_watcher` even if another newer tail watcher is managing it. It's hard to occur in usual because `stat_watcher` will be called immediately after the file is changed while `refresh_wather` is called every 60 seconds in default. However, there is a rare possibility that this order might be swapped especillay if in_tail is busy on processing large amount of logs. Because in_tail is single threadied, event queues such as timers or inotify will be stucked in this case. There is no such problem on `follow_inode` case because position entries are always marked as `unwatched` before entering `rotate_wait` state. Signed-off-by: Takuro Ashie <[email protected]>

Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. In this case, a tail watcher is possible to mark the position entry as `unwatched` if it's tansitioned to `rotate_wait` state by `refresh_watcher` even if another newer tail watcher is managing it. It's hard to occur in usual because `stat_watcher` will be called immediately after the file is changed while `refresh_wather` is called every 60 seconds by default. However, there is a rare possibility that this order might be swapped especillay if in_tail is busy on processing large amount of logs. Because in_tail is single threadied, event queues such as timers or inotify will be stucked in this case. There is no such problem on `follow_inode` case because position entries are always marked as `unwatched` before entering `rotate_wait` state. Signed-off-by: Takuro Ashie <[email protected]>

Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. In this case, a tail watcher is possible to mark the position entry as `unwatched` if it's tansitioned to `rotate_wait` state by `refresh_watcher` even if another newer tail watcher is managing it. It's hard to occur in usual because `stat_watcher` will be called immediately after the file is changed while `refresh_wather` is called every 60 seconds by default. However, there is a rare possibility that this order might be swapped especillay if in_tail is busy on processing large amount of logs. Because in_tail is single threadied, event queues such as timers or inotify will be stucked in this case. There is no such problem on `follow_inode` case because position entries are always marked as `unwatched` before entering `rotate_wait` state. --------- Signed-off-by: Takuro Ashie <[email protected]> Co-authored-by: Daijiro Fukuda <[email protected]>

Fix fluent#3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. In this case, a tail watcher is possible to mark the position entry as `unwatched` if it's tansitioned to `rotate_wait` state by `refresh_watcher` even if another newer tail watcher is managing it. It's hard to occur in usual because `stat_watcher` will be called immediately after the file is changed while `refresh_wather` is called every 60 seconds by default. However, there is a rare possibility that this order might be swapped especillay if in_tail is busy on processing large amount of logs. Because in_tail is single threadied, event queues such as timers or inotify will be stucked in this case. There is no such problem on `follow_inode` case because position entries are always marked as `unwatched` before entering `rotate_wait` state. --------- Signed-off-by: Takuro Ashie <[email protected]> Co-authored-by: Daijiro Fukuda <[email protected]>

daipom · 2023-12-18T02:02:27Z

Thank you all very much!
Thanks to all your help, we were able to identify the logic problems and fix them!

The problem that in_tail stops collecting is fixed at Fluentd v1.16.3, Fluent Package v5.0.2 and td-agent v4.5.2.
Please use them!
If you still have problems, we'd be very grateful if you report to us.

alexei-g-aloteq · 2023-12-24T07:31:09Z

No, it is not fixed.
We've switched to image 1.16.3-debian-elasticsearch8-amd64-1.0, but still see almost daily outage.

uristernik · 2023-12-28T20:57:48Z

We are still seeing this issue as well. We are using this image v1.16.3-debian-forward-1.0.

We were seeing the following message If you keep getting this message, please restart Fluentd and like suggested here we changed follow_inodes to true and set rotate_wait to 0, but we are still seeing loads of Skip update_watcher because watcher has been already updated by other inotify event

We also noticed a pattern of memory leaking and gradual increase in CPU usage until a restart occurs. We are using fluentd as a daemonset on a kubernetes cluster. Here is our in_tail config:

<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  follow_inodes true
  rotate_wait 0
  exclude_path ["/var/log/containers/fluentd*.log"]
  <parse>
    @type multi_format
    <pattern>
      format json
      time_key time
      time_type string
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      keep_time_key true
    </pattern>
    
    <pattern>
      format /^(?<time>.+?) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<log>.+)$/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </pattern>
    
  </parse>
  emit_unmatched_lines true
</source>

<filter kubernetes.**>
  @type concat
  key log
  partial_key logtag
  partial_value P
  separator ""
</filter>

uristernik · 2024-01-01T12:55:39Z

We are still seeing this issue as well. We are using this image v1.16.3-debian-forward-1.0.

We were seeing the following message If you keep getting this message, please restart Fluentd and like suggested here we changed follow_inodes to true and set rotate_wait to 0, but we are still seeing loads of Skip update_watcher because watcher has been already updated by other inotify event

We also noticed a pattern of memory leaking and gradual increase in CPU usage until a restart occurs. We are using fluentd as a daemonset on a kubernetes cluster. Here is our in_tail config:
<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  follow_inodes true
  rotate_wait 0
  exclude_path ["/var/log/containers/fluentd*.log"]
  <parse>
    @type multi_format
    <pattern>
      format json
      time_key time
      time_type string
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      keep_time_key true
    </pattern>
    
    <pattern>
      format /^(?<time>.+?) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<log>.+)$/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </pattern>
    
  </parse>
  emit_unmatched_lines true
</source>

<filter kubernetes.**>
  @type concat
  key log
  partial_key logtag
  partial_value P
  separator ""
</filter>

In an environment where we have high volatility (we constantly deploy new code -> deployments are restarted very frequently -> pods are created -> lots of files to tail) we see a clear leak pattern.
Like mentioned above this is coupled with a huge amount of:

 Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/...

Any suggestion to mitigate this? @ashie @daipom
Maybe setting pos_file_compaction_interval to some value so we would stop watching rotated files? (of pods that no longer exist)

alexei-g-aloteq · 2024-01-03T09:02:03Z

BTW, any good alternatives for Fluentd?
It does not seem to be reliable at the moment.
This issue kept unfixed for years.

zmedico · 2024-01-03T21:29:57Z

I've observed this issue with v1.16.3. I suspect that in_tail handling of file rotations is unlikely to ever reach a satisfactory level of reliability, and something like the docker fluentd logging driver (which unfortunately breaks kubectl logs) is the only way to reliably avoid problems with log file rotation.

I suspect that my open_on_every_update true config might be making problems worse (not to mention that it introduces a race to collect log entries during log rotation), so I'm going to try disabling that to see if it helps.

zmedico · 2024-01-04T00:44:14Z

Setting open_on_every_update false did not seem to help for me. Interestingly, the issue did not appear to correlate with a log rotation, since the log file that it was not following was still older than the fluentd process. Fluentd was following the file through two levels of symlinks, and the symlink timestamps are older than the actual log file, so I don't know how to explain why it triggered the "unreadable" log message.

daipom · 2024-01-05T00:44:20Z

Thanks for reporting.

We are still seeing this issue as well. We are using this image v1.16.3-debian-forward-1.0.

We were seeing the following message If you keep getting this message, please restart Fluentd and like suggested here we changed follow_inodes to true and set rotate_wait to 0, but we are still seeing loads of Skip update_watcher because watcher has been already updated by other inotify event

We also noticed a pattern of memory leaking and gradual increase in CPU usage until a restart occurs. We are using fluentd as a daemonset on a kubernetes cluster. Here is our in_tail config:

So, there are still problems with both follow_inodes false and follow_inodes true, such as collection stops and resource leaks.

ashie · 2024-01-05T03:09:21Z

#4326 would be a possible remaining cause of this issue.
On the other hand it might not be related with above additional reports because #4326 says no error log is found.

ashie · 2024-01-05T03:15:23Z

BTW it would be better to open a new issue to treat remaining things.
This issue is too long and mixed various causes.
It seems hard to collect & track informations to debug remaining causes in this issue.

uristernik · 2024-01-07T07:08:03Z

@ashie @daipom Thanks for the responses and attention.

Should I open a new issue for that?

If you have any suggestion I'll be more than happy to test different configs.
I added the pos_file_compaction_interval 20m to the config but still seeing that Fluentd is leaking CPU


<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  follow_inodes true
  rotate_wait 0
  exclude_path ["/var/log/containers/fluentd*.log", "/var/log/containers/*kube-system*.log", "/var/log/containers/*calico-system*.log", "/var/log/containers/prometheus-node-exporter*.log", "/var/log/containers/opentelemetry-agent*.log"]
  pos_file_compaction_interval 20m
  <parse>
    @type multi_format
    <pattern>
      format json
      time_key time
      time_type string
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      keep_time_key true
    </pattern>
    <pattern>
      format /^(?<time>.+?) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<log>.+)$/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </pattern>
  </parse>
  emit_unmatched_lines true
</source>

daipom · 2024-01-09T08:48:54Z

@uristernik Thanks. Could you please open a new issue?

uristernik · 2024-01-15T09:43:09Z

@daipom Done, hopefully I described the issue clear enough. Please correct me if I wasn't accurate enough

daipom · 2024-01-16T09:06:35Z

@uristernik Thanks! I will check it!

daipom · 2024-01-16T09:07:56Z

If you all still have a similar issue, please report it on the new issue!

ashie added the bug Something isn't working label Feb 14, 2022

ashie self-assigned this Feb 14, 2022

ashie mentioned this issue Mar 9, 2022

In_tail stop working after log rotation in K8s #3661

Closed

ashie added this to Fluentd Kanban May 18, 2022

ashie mentioned this issue Jul 8, 2022

Working fluentd stopped following tail for one file #3800

Closed

ashie moved this to In-Progress in Fluentd Kanban Jul 9, 2022

ashie mentioned this issue Jul 25, 2022

in_tail: Show more information on skipping update_watcher #3829

Merged

ashie mentioned this issue Aug 1, 2022

input tail plugin stops working after file rotation #3838

Closed

This was referenced Aug 30, 2022

td-agent.log & .pos file are not updating #3879

Closed

in_tail: Fluentd process goes unresponsive/hung and not tailing any of the log files #3882

Closed

ashie mentioned this issue Oct 18, 2023

in_tail: Fix a stall bug on !follow_inode case #4327

Merged

daipom closed this as completed in #4327 Oct 27, 2023

daipom mentioned this issue Oct 30, 2023

-p overwrites the default plugin path #4269

Closed

daipom mentioned this issue Nov 9, 2023

in_tail: add warning for silent stop on !follow_inodes case #4339

Merged

daipom modified the milestones: v1.17.0, v1.16.3 Dec 18, 2023

daipom reopened this Jan 5, 2024

uristernik mentioned this issue Jan 15, 2024

File watchers might not be handled properly causing gradual increase in CPU/Memory usage #4381

Open

daipom closed this as completed Jan 16, 2024

Fluentd in_tail "unreadable" file causes "following tail of <file>" to stop and no logs pushed #3614

Fluentd in_tail "unreadable" file causes "following tail of <file>" to stop and no logs pushed #3614

Comments

salavessa commented Feb 1, 2022

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

mizunashi-mana commented Feb 14, 2022

ashie commented Feb 14, 2022 • edited Loading

mizunashi-mana commented Feb 22, 2022

mizunashi-mana commented Mar 17, 2022

ashie commented Mar 17, 2022

ashie commented Jul 8, 2022 • edited Loading

ashie commented Jul 8, 2022

Your Error Log

Additional context

ashie commented Jul 8, 2022

salavessa commented Jul 8, 2022

wangchaoforever commented Jul 15, 2022

ashie commented Aug 1, 2022

vparfonov commented Aug 1, 2022

ashie commented Aug 1, 2022 • edited Loading

vparfonov commented Aug 1, 2022

artheus commented Aug 22, 2022 • edited Loading

artheus commented Aug 23, 2022

artheus commented Aug 23, 2022 • edited Loading

SriramDuvvuri commented Aug 30, 2022 • edited by ashie Loading

Config:

artheus commented Sep 13, 2022

artheus commented Oct 7, 2023 • edited Loading

daipom commented Dec 18, 2023

alexei-g-aloteq commented Dec 24, 2023

uristernik commented Dec 28, 2023

uristernik commented Jan 1, 2024

alexei-g-aloteq commented Jan 3, 2024

zmedico commented Jan 3, 2024

zmedico commented Jan 4, 2024

daipom commented Jan 5, 2024

ashie commented Jan 5, 2024

ashie commented Jan 5, 2024

uristernik commented Jan 7, 2024 • edited Loading

daipom commented Jan 9, 2024

uristernik commented Jan 15, 2024

daipom commented Jan 16, 2024

daipom commented Jan 16, 2024

ashie commented Feb 14, 2022 •

edited

Loading

ashie commented Jul 8, 2022 •

edited

Loading

ashie commented Aug 1, 2022 •

edited

Loading

artheus commented Aug 22, 2022 •

edited

Loading

artheus commented Aug 23, 2022 •

edited

Loading

SriramDuvvuri commented Aug 30, 2022 •

edited by ashie

Loading

artheus commented Oct 7, 2023 •

edited

Loading

uristernik commented Jan 7, 2024 •

edited

Loading