Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring buffer metadata crashes worker with empty buffer and/or corrupt metadata #1760

Closed
pdecat opened this issue Nov 23, 2017 · 7 comments

Comments

@pdecat
Copy link

pdecat commented Nov 23, 2017

A bit of context first: I'm deploying fluentd on kubernetes using the fluent/fluentd:v0.14.23 docker image as three pods managed by ReplicationControllers with PersistentVolumes for storing buffers (cannot use a StatefulSet here but that's another story).

Since I have upgraded from v0.14.22 to v0.14.23, only two of those pods are running fine.
The third one is having its worker to crash loop as soon as it starts, apparently when it tries to read its buffers metadata:

I  2017-11-23 12:15:59 +0000 [info]: #0 starting fluentd worker pid=657 ppid=5 worker=0
I  2017-11-23 12:15:59 +0000 [error]: #0 unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer"
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:219:in `[]'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:219:in `restore_metadata'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:322:in `load_existing_staged_chunk'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:51:in `initialize'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:144:in `new'
|    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:144:in `block in resume'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:133:in `glob'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:133:in `resume'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer.rb:171:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:120:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/output.rb:415:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:165:in `block in start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:154:in `block (2 levels) in lifecycle'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:153:in `each'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:153:in `block in lifecycle'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:140:in `each'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:140:in `lifecycle'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:164:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/engine.rb:274:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/engine.rb:219:in `run'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:774:in `run_engine'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:523:in `block in run_worker'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:699:in `main_process'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:518:in `run_worker'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/command/fluentd.rb:316:in `<top (required)>'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/bin/fluentd:5:in `<top (required)>'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/bin/fluentd:22:in `load'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/bin/fluentd:22:in `<main>'
I  2017-11-23 12:15:59 +0000 [error]: #0 unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer"
I    2017-11-23 12:15:59 +0000 [error]: #0 suppressed same stacktrace
I  2017-11-23 12:15:59 +0000 [info]: fluent/log.rb:316:info: Worker 0 finished unexpectedly with status 1 

This happens at https://github.com/fluent/fluentd/blob/v0.14.23/lib/fluent/plugin/buffer/file_chunk.rb#L219

The buffers of this pod look corrupt:

# ls -latr /var/log/pods/                                 
total 168                                     
drwxrwS---    2 root     fluent       16384 Nov 15 11:46 lost+found                          
drwxrwsr-x    3 root     fluent      151552 Nov 16 14:40 .                                   
-rw-rw-r--    1 fluent   fluent          75 Nov 16 14:40 buffer.b55e1a98d6e429d3dda1b152fd10035da.log.meta                                                                                
-rw-rw-r--    1 fluent   fluent           0 Nov 16 14:40 buffer.b55e1a98d6e429d3dda1b152fd10035da.log                                                                                     
-rw-rw-r--    1 fluent   fluent          76 Nov 16 14:40 buffer.b55e1a98689733872d3b28dbb9c3483e4.log.meta                                                                                
-rw-rw-r--    1 fluent   fluent           0 Nov 16 14:40 buffer.b55e1a98689733872d3b28dbb9c3483e4.log                                                                                     
drwxr-xr-x    1 root     root          4096 Nov 23 12:12 ..                                  
# hexdump /var/log/pods/buffer.b55e1a98d6e429d3dda1b152fd10035da.log.meta                                                                              
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040
hexdump /var/log/pods/buffer.b55e1a98689733872d3b28dbb9c3483e4.log.meta
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040

I've deleted those buffers to workaround this issue for now and the pod is back to normal.

Not sure the upgrade is the culprit, perhaps it was just the kill that happened at the wrong time.

Maybe the restore_metadata could add checks to avoid this issue.

@repeatedly
Copy link
Member

Maybe the restore_metadata could add checks to avoid this issue.

It seems good.
I'm considering adding backup_dir, default is /tmp, to <system>. If the problem happens during restore, move broken files into backup_dir.

@rhefner1
Copy link

I just hit this issue as part of a Kubernetes deployment. It would be really great if fluentd could handle this kind of failure without going into a crash backoff loop until I manually SSH in and delete the corrupted buffers.

@f0
Copy link

f0 commented Feb 26, 2018

run into the same, also a Kubernetes Deploymdent

@repeatedly
Copy link
Member

I wrote a patch for avoiding this problem: #1874

I want to know which storage use for file buffer. Local storage or network/distributed storage like NFS?

@rhefner1
Copy link

@repeatedly I use local storage for the fluentd buffer before sending to elasticsearch.

@f0
Copy link

f0 commented Feb 28, 2018

Local storage

@pdecat
Copy link
Author

pdecat commented Feb 28, 2018

Google Persistent Disk storage mounted into pods via Persistent Volume Claims, closer in behavor to local storage than distributed storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants