crashes cause bad chunks that fail to send #1638

notslang · 2017-07-20T01:19:29Z

I had a power failure and upon restarting the server, this happened:

2017-07-18 18:05:27 +0000 [info]: adding match pattern="**" type="null"
2017-07-18 18:05:27 +0000 [info]: adding source type="forward"
2017-07-18 18:05:27 +0000 [info]: #0 starting fluentd worker pid=20 ppid=16 worker=0
2017-07-18 18:05:31 +0000 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2017-07-18 18:05:31 +0000 [info]: #0 [input1] listening port port=24224 bind="0.0.0.0"
2017-07-18 18:05:31 +0000 [info]: #0 fluentd worker is now running worker=0
2017-07-18 22:25:08 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `dup'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `send_data_actual'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:631:in `send_data'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `block in try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:304:in `block in select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `times'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1034:in `try_flush'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1268:in `flush_thread_run'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:420:in `block (2 levels) in start'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-07-18 22:25:09 +0000 [warn]: #0 retry succeeded. chunk_id="5549b016b0815316776e7f21d3a793d8"
2017-07-18 22:25:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:10 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:10 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2017-07-18 22:25:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:13 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2017-07-18 22:25:13 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:13 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2017-07-18 22:25:17 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:17 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2017-07-18 22:25:25 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:25 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:41 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2017-07-18 22:25:41 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:41 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:26:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2017-07-18 22:26:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:26:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:27:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2017-07-18 22:27:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:27:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:29:15 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2017-07-18 22:29:15 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:29:15 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:33:24 +0000 [warn]: #0 failed to flush the buffer. retry_time=9 next_retry_seconds=2017-07-18 22:33:24 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:33:24 +0000 [warn]: #0 suppressed same stacktrace

Given the fact that this hasn't happened in the last month, and the chunk that it's referencing buffer.q5541970ab3b8c59d8f012cd3222cc93c.log was created right around the time the power failed, I think that it is somehow corrupted by an incomplete write. This is a problem because the error message makes no sense and crashes shouldn't cause corruption / data loss for things that are already written to the disk.

It doesn't even start sending the other chunks that aren't corrupted, I need to shut down fluentd, remove the bad chunk, and restart fluentd to get it to continue sending.

I can send you the bad chunk and meta file if you want.

The text was updated successfully, but these errors were encountered:

repeatedly · 2017-07-21T03:25:41Z

around the time the power failed

Does this mean fluentd running server hit unexpected machine shutdown, right?
Hmm... in such cases, automatic recover is hard, so need to check files and corrupted chunks should be routed to other place, e.g. local file.
Is the corrupted file meta data or actual chunk file?

notslang · 2017-07-21T03:55:17Z

Does this mean fluentd running server hit unexpected machine shutdown, right?

Yes - the power cut off, so nothing had time to shut down.

I don't know if it's the metadata file or the actual chunk file. However, it shouldn't be too hard to just omit whatever incomplete data was written to the end of the file, right?

I don't expect fluentd to save data that was submitted between the last time data was successfully flushed to the file buffer and the crash. However, it shouldn't cause the entire chunk to fail and the error/warning that it prints when it encounters bad data should be better... If I didn't know about that power failure I would be really confused.

notslang · 2017-09-10T18:10:48Z

I got this error again after running out of disk space on a node.

mchesler · 2017-11-27T15:45:10Z

I'm experiencing this issue as well. It's occurring on fluentd restart, not a machine crash. It doesn't happen consistently, but is happening frequently.

repeatedly · 2017-11-28T08:49:24Z

@mchesler It means you hit disk space issue and it causes broken chunk?

BTW, I will add backup feature to move bad chunks to other place.

mchesler · 2017-11-28T11:05:23Z

@repeatedly None of the machines where this has happened to me have com anywhere close to filling their disks, they’ve simply had the fluentd process restarted.

johanneswuerbach · 2018-03-09T09:34:47Z

Experiencing the same, the fluent processes were OOMKilled before. @repeatedly did you had any chance to work on the "move bad chunks to other place" feature or could you point me the code handling this?

kumaravel29 · 2018-05-01T20:10:24Z

I am using Elasticsearch plugin to send data to the ES with file buffer. In case of any exception in indexing the document to elasticsearch, the chunk is retained for retry, but this chunk can never be successful. Because of multiple retry attempts happening, the other chunks are not processed causing consistency in data flow.
Is there a way to clear or move the bad chunk?

naegelin · 2018-06-11T09:13:12Z

+1 we experience the exact same issues as @kumaravel29 . The problem is actually so pervasive that we're in the process of ripping td-agent out of our infrastructure.

ku-s-h · 2018-10-24T11:13:40Z

@kumaravel29 try using tags: https://github.com/uken/fluent-plugin-elasticsearch#retry_tag

kenhys · 2020-10-29T07:47:33Z

FYI: since v1.2.0, https://github.com/fluent/fluentd/blob/master/CHANGELOG.md#release-v120---20180430
output: Backup for broken chunks is supported. It may help you.

repeatedly added the bug Something isn't working label Jul 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crashes cause bad chunks that fail to send #1638

crashes cause bad chunks that fail to send #1638

notslang commented Jul 20, 2017

repeatedly commented Jul 21, 2017

notslang commented Jul 21, 2017 •

edited

Loading

notslang commented Sep 10, 2017

mchesler commented Nov 27, 2017

repeatedly commented Nov 28, 2017

mchesler commented Nov 28, 2017

johanneswuerbach commented Mar 9, 2018

kumaravel29 commented May 1, 2018

naegelin commented Jun 11, 2018

ku-s-h commented Oct 24, 2018

kenhys commented Oct 29, 2020

crashes cause bad chunks that fail to send #1638

crashes cause bad chunks that fail to send #1638

Comments

notslang commented Jul 20, 2017

repeatedly commented Jul 21, 2017

notslang commented Jul 21, 2017 • edited Loading

notslang commented Sep 10, 2017

mchesler commented Nov 27, 2017

repeatedly commented Nov 28, 2017

mchesler commented Nov 28, 2017

johanneswuerbach commented Mar 9, 2018

kumaravel29 commented May 1, 2018

naegelin commented Jun 11, 2018

ku-s-h commented Oct 24, 2018

kenhys commented Oct 29, 2020

notslang commented Jul 21, 2017 •

edited

Loading