-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274
Comments
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Attempt to fix: #3275 -- quick review and feedback appreciated. Thanks! |
Thanks for your work here @jgehrcke - we noticed lost logs after We have a number of very short-lived cron jobs whose logs I imagine can be created and then disappear before fluentd has time to tail them. |
…)" This reverts commit c33ccd2.
…)" This reverts commit c33ccd2.
Is there a timeline for when this bug fix will be released in a Fluentd Ruby Gem(https://rubygems.org/gems/fluentd)? We're hitting this issue regularly and are eager to pick up the fix. Also, since we're using https://github.com/fluent/fluentd-docker-image, we'd prefer a Fluentd Ruby Gem than simply applying a patch to our existing Fluentd instance. |
We are planning to release 1.12.2 until end of this month. |
We released v1.12.2. |
Describe the bug
The
in_tail
plugin might discover a new log file, and then crash the worker upon thestat()
system call due to ENOENT:I think this is a classical race condition. As this is part of the inotify-based watch system, of course there's the chance that between being delivered the signal that a new file is there and the code accessing it the file might already have disappeared again.
To Reproduce
Run in an environment that has log files popping up and quickly disappear thereafter (I guess, no clear repro on our side).
Expected behavior
At the system call boundary between application and OS the individual system calls like
stat()
/open()
should always be wrapped with local error handling, expecting the system call to fail.This can then be retried in graceful degradation fashion, w/o affecting the other operations performed by the worker process.
An appropriate action would be to log a warning/error showing how the system call failed.
Your Environment
Fluentd 1.12.0 on Linux
Not retried with 1.12.1 -- same problem seems to be there, based on commits and changelog.
Your Configuration
...
Your Error Log
see above
Additional context
The
stat()
call that crashes the worker is https://github.com/fluent/fluentd/blame/65a9edf4e05cf64c7ed0de56e12b3c9a0774e0d6/lib/fluent/plugin/in_tail.rb#L409Looks like even before the last refactor this stat() wasn't protected:
https://github.com/fluent/fluentd/pull/3196/files#diff-456fdeb51bc472beb48891caac0d063e0073655dba7ac2b72e6fdc67dc6ac802R409
The worker subsequently restarts. As this affects the super early worker startup phase and results in an immediate crash, potentially a crash loop, I am not sure if this might result in data loss. #3224 has more detail about that.
The text was updated successfully, but these errors were encountered: