Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crashes cause bad chunks that fail to send #1638

Open
notslang opened this issue Jul 20, 2017 · 11 comments
Open

crashes cause bad chunks that fail to send #1638

notslang opened this issue Jul 20, 2017 · 11 comments
Labels
bug Something isn't working

Comments

@notslang
Copy link

I had a power failure and upon restarting the server, this happened:

2017-07-18 18:05:27 +0000 [info]: adding match pattern="**" type="null"
2017-07-18 18:05:27 +0000 [info]: adding source type="forward"
2017-07-18 18:05:27 +0000 [info]: #0 starting fluentd worker pid=20 ppid=16 worker=0
2017-07-18 18:05:31 +0000 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2017-07-18 18:05:31 +0000 [info]: #0 [input1] listening port port=24224 bind="0.0.0.0"
2017-07-18 18:05:31 +0000 [info]: #0 fluentd worker is now running worker=0
2017-07-18 22:25:08 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `dup'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `send_data_actual'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:631:in `send_data'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `block in try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:304:in `block in select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `times'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1034:in `try_flush'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1268:in `flush_thread_run'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:420:in `block (2 levels) in start'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-07-18 22:25:09 +0000 [warn]: #0 retry succeeded. chunk_id="5549b016b0815316776e7f21d3a793d8"
2017-07-18 22:25:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:10 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:10 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2017-07-18 22:25:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:13 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2017-07-18 22:25:13 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:13 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2017-07-18 22:25:17 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:17 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2017-07-18 22:25:25 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:25 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:41 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2017-07-18 22:25:41 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:41 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:26:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2017-07-18 22:26:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:26:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:27:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2017-07-18 22:27:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:27:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:29:15 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2017-07-18 22:29:15 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:29:15 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:33:24 +0000 [warn]: #0 failed to flush the buffer. retry_time=9 next_retry_seconds=2017-07-18 22:33:24 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:33:24 +0000 [warn]: #0 suppressed same stacktrace

Given the fact that this hasn't happened in the last month, and the chunk that it's referencing buffer.q5541970ab3b8c59d8f012cd3222cc93c.log was created right around the time the power failed, I think that it is somehow corrupted by an incomplete write. This is a problem because the error message makes no sense and crashes shouldn't cause corruption / data loss for things that are already written to the disk.

It doesn't even start sending the other chunks that aren't corrupted, I need to shut down fluentd, remove the bad chunk, and restart fluentd to get it to continue sending.

I can send you the bad chunk and meta file if you want.

@repeatedly repeatedly added the bug Something isn't working label Jul 21, 2017
@repeatedly
Copy link
Member

around the time the power failed

Does this mean fluentd running server hit unexpected machine shutdown, right?
Hmm... in such cases, automatic recover is hard, so need to check files and corrupted chunks should be routed to other place, e.g. local file.
Is the corrupted file meta data or actual chunk file?

@notslang
Copy link
Author

notslang commented Jul 21, 2017

Does this mean fluentd running server hit unexpected machine shutdown, right?

Yes - the power cut off, so nothing had time to shut down.

I don't know if it's the metadata file or the actual chunk file. However, it shouldn't be too hard to just omit whatever incomplete data was written to the end of the file, right?

I don't expect fluentd to save data that was submitted between the last time data was successfully flushed to the file buffer and the crash. However, it shouldn't cause the entire chunk to fail and the error/warning that it prints when it encounters bad data should be better... If I didn't know about that power failure I would be really confused.

@notslang
Copy link
Author

I got this error again after running out of disk space on a node.

@mchesler
Copy link

I'm experiencing this issue as well. It's occurring on fluentd restart, not a machine crash. It doesn't happen consistently, but is happening frequently.

@repeatedly
Copy link
Member

@mchesler It means you hit disk space issue and it causes broken chunk?

BTW, I will add backup feature to move bad chunks to other place.

@mchesler
Copy link

@repeatedly None of the machines where this has happened to me have com anywhere close to filling their disks, they’ve simply had the fluentd process restarted.

@johanneswuerbach
Copy link

Experiencing the same, the fluent processes were OOMKilled before. @repeatedly did you had any chance to work on the "move bad chunks to other place" feature or could you point me the code handling this?

@kumaravel29
Copy link

I am using Elasticsearch plugin to send data to the ES with file buffer. In case of any exception in indexing the document to elasticsearch, the chunk is retained for retry, but this chunk can never be successful. Because of multiple retry attempts happening, the other chunks are not processed causing consistency in data flow.
Is there a way to clear or move the bad chunk?

@naegelin
Copy link

+1 we experience the exact same issues as @kumaravel29 . The problem is actually so pervasive that we're in the process of ripping td-agent out of our infrastructure.

@ku-s-h
Copy link

ku-s-h commented Oct 24, 2018

@kumaravel29 try using tags: https://github.com/uken/fluent-plugin-elasticsearch#retry_tag

@kenhys
Copy link
Contributor

kenhys commented Oct 29, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants