1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

jgehrcke · 2021-03-02T11:30:34Z

Describe the bug

The in_tail plugin might discover a new log file, and then crash the worker upon the stat() system call due to ENOENT:

[info]: gem 'fluentd' version '1.12.0'

...

[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/root_agent.rb:200:in `block in start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:234:in `start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:353:in `refresh_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `each_value'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `block in start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `stat'
[error]: #0 unexpected error error_class=Errno::ENOENT error="No such file or directory @ rb_file_s_stat - /var/log/containers/foo-55fc795779-l66hb_foo_foo-ee7d580da4f6fac071b1e0fa8533fcada239a88404ed2fcde16480fbcd4a39fb.log"

I think this is a classical race condition. As this is part of the inotify-based watch system, of course there's the chance that between being delivered the signal that a new file is there and the code accessing it the file might already have disappeared again.

To Reproduce
Run in an environment that has log files popping up and quickly disappear thereafter (I guess, no clear repro on our side).

Expected behavior

At the system call boundary between application and OS the individual system calls like stat() / open() should always be wrapped with local error handling, expecting the system call to fail.

This can then be retried in graceful degradation fashion, w/o affecting the other operations performed by the worker process.

An appropriate action would be to log a warning/error showing how the system call failed.

Your Environment

Fluentd 1.12.0 on Linux

Not retried with 1.12.1 -- same problem seems to be there, based on commits and changelog.

Your Configuration

...

Your Error Log

see above

Additional context

This could be related to Log forwarding stopped after "No such file or directory" error in in_tail #3224.
This might be a regression introduced via Handle linux capability if available #3155 / in_unix: Use v1 plugin API #2992

The stat() call that crashes the worker is https://github.com/fluent/fluentd/blame/65a9edf4e05cf64c7ed0de56e12b3c9a0774e0d6/lib/fluent/plugin/in_tail.rb#L409

Looks like even before the last refactor this stat() wasn't protected:
https://github.com/fluent/fluentd/pull/3196/files#diff-456fdeb51bc472beb48891caac0d063e0073655dba7ac2b72e6fdc67dc6ac802R409

The worker subsequently restarts. As this affects the super early worker startup phase and results in an immediate crash, potentially a crash loop, I am not sure if this might result in data loss. #3224 has more detail about that.

The text was updated successfully, but these errors were encountered:

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke · 2021-03-02T11:56:02Z

Attempt to fix: #3275 -- quick review and feedback appreciated. Thanks!

chadlwilson · 2021-03-02T14:45:16Z

Thanks for your work here @jgehrcke - we noticed lost logs after 1.12 upgrade due to issues with in_tail and immediately reverted. Didn't have time/depth to dig into what the problem was, but race conditions on log file create/delete seems possible.

We have a number of very short-lived cron jobs whose logs I imagine can be created and then disappear before fluentd has time to tail them.

in_tail: expect ENOENT during stat (try to fix #3274 and #3224)

…)" This reverts commit c33ccd2.

mk23w-vmware · 2021-03-18T03:48:15Z

Is there a timeline for when this bug fix will be released in a Fluentd Ruby Gem(https://rubygems.org/gems/fluentd)?

We're hitting this issue regularly and are eager to pick up the fix.

Also, since we're using https://github.com/fluent/fluentd-docker-image, we'd prefer a Fluentd Ruby Gem than simply applying a patch to our existing Fluentd instance.

ashie · 2021-03-18T04:19:03Z

We are planning to release 1.12.2 until end of this month.

ashie · 2021-04-02T04:46:20Z

We released v1.12.2.

This was referenced Mar 2, 2021

ci instability: check cortex distributor logs: query Expectation not fulfilled within 30 s opstrace/opstrace#432

Closed

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3273

Closed

jgehrcke added a commit to jgehrcke/fluentd that referenced this issue Mar 2, 2021

in_tail: tw setup: expect ENOENT during stat() (fix fluent#3274)

c33ccd2

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke mentioned this issue Mar 2, 2021

in_tail: expect ENOENT during stat (try to fix #3274 and #3224) #3275

Merged

jgehrcke mentioned this issue Mar 2, 2021

systemlogs: revert fluentd bump (see #432) opstrace/opstrace#438

Merged

ashie closed this as completed in #3275 Mar 4, 2021

ashie added a commit that referenced this issue Mar 4, 2021

Merge pull request #3275 from jgehrcke/jp/in_tail_stat_enoent

4437fd9

in_tail: expect ENOENT during stat (try to fix #3274 and #3224)

ashie added a commit to ashie/fluentd that referenced this issue Mar 4, 2021

Revert "in_tail: tw setup: expect ENOENT during stat() (fix fluent#3274…

808c8bd

…)" This reverts commit c33ccd2.

ashie added a commit to ashie/fluentd that referenced this issue Mar 12, 2021

Revert "in_tail: tw setup: expect ENOENT during stat() (fix fluent#3274…

176b4c1

…)" This reverts commit c33ccd2.

ashie mentioned this issue Mar 24, 2021

Fluentd not picking new log files #3239

Closed

This was referenced Apr 9, 2021

The newest td-agent 4.1.0, Fluentd 1.12.1 does not detect log rotation #3324

Closed

in_tail throws error and crashes process #3327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

jgehrcke commented Mar 2, 2021

jgehrcke commented Mar 2, 2021

chadlwilson commented Mar 2, 2021

mk23w-vmware commented Mar 18, 2021

ashie commented Mar 18, 2021

ashie commented Apr 2, 2021

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

Comments

jgehrcke commented Mar 2, 2021

jgehrcke commented Mar 2, 2021

chadlwilson commented Mar 2, 2021

mk23w-vmware commented Mar 18, 2021

ashie commented Mar 18, 2021

ashie commented Apr 2, 2021