Add more checks for buffer corruption on startup #3970

fujimotos · 2022-11-25T08:32:34Z

Is your feature request related to a problem? Please describe.

Currently when starting up Fluentd outputs, we try to check if each buffer chunk
is non-empty, and if it has some bytes, we assume it contains valid data.

It turned out that this operation model has a few issues:

If the data was indeed corrupted, the behavior is undefined. It likely causes
many kinds of errors in various parts of the pipeline.
It is also hard to tell which chunks was corrupted from td-agent.log.
This is important because users probably want to recover the lost data.

We should perform more rigorous buffer checks on startup,
so that Fluentd can handle corrupted chunks gracefully.

Describe the solution you'd like

Perform more sanity checks on buffer chunks on startup.
Emit more error logs regarding the corrupted chunks.

Describe alternatives you've considered

N/A

Additional context

No response

daipom · 2022-11-25T09:53:43Z

Thanks for summarizing this issue!
I'm willing to consider this issue in December.
Here are my impressions at present.

It is also hard to tell which chunks was corrupted from td-agent.log.

I feel the next log should be the info level since these chunks could be a problem in the case of abnormal system termination, such as a machine power failure.

fluentd/lib/fluent/plugin/buf_file.rb

Line 148 in 981decb

log.debug { "restoring buffer file: path = #{path}" }

Perform more sanity checks on buffer chunks on startup.

We have the check in FileChunk initialization, but it can only detect the corruption of the metadata.
If the chunk body file is corrupt, it proceeds to the next logic.

fluentd/lib/fluent/plugin/buffer/file_chunk.rb

Lines 339 to 345 in 981decb

    
           begin 
        
             restore_metadata(@meta.read) 
        
           rescue => e 
        
             @chunk.close 
        
             @meta.close 
        
             raise FileChunkError, "staged meta file is broken. #{e.message}" 
        
           end

In MessagePackEventStream, the size (the number of records) is initially given from the metadata, and it will be updated after unpacked.
I'm wondering if this size could be used to confirm the corruption, but it seems from the description that the metadata values are not very reliable, so this depends on that point.

fluentd/lib/fluent/event.rb

Lines 236 to 247 in 981decb

    
           def ensure_unpacked!(unpacker: nil) 
        
             return if @unpacked_times && @unpacked_records 
        
             @unpacked_times = [] 
        
             @unpacked_records = [] 
        
             (unpacker || Fluent::MessagePackFactory.msgpack_unpacker).feed_each(@data) do |time, record| 
        
               @unpacked_times << time 
        
               @unpacked_records << record 
        
             end 
        
             # @size should be updated always right after unpack. 
        
             # The real size of unpacked objects are correct, rather than given size. 
        
             @size = @unpacked_times.size 
        
           end

in_forward has the feature of checking the stream, but it is disabled by default, so we may not notice that corrupted data is being sent.

fluentd/lib/fluent/plugin/in_forward.rb

Lines 367 to 377 in 981decb

    
           def check_and_skip_invalid_event(tag, es, remote_host) 
        
             new_es = Fluent::MultiEventStream.new 
        
             es.each { |time, record| 
        
               if invalid_event?(tag, time, record) 
        
                 log.warn "skip invalid event:", host: remote_host, tag: tag, time: time, record: record 
        
                 next 
        
               end 
        
               new_es.add(time, record) 
        
             } 
        
             new_es 
        
           end

daipom · 2022-12-28T08:48:04Z

I am examining this issue.

The most important thing is to detect file corruption at abnormal system termination, such as a machine power failure.
To handle this, we should improve the process of loading the existing chunk files at startup.

I feel the next log should be the info level since these chunks could be a problem in the case of abnormal system termination, such as a machine power failure.

fluentd/lib/fluent/plugin/buf_file.rb

Line 148 in 981decb

log.debug { "restoring buffer file: path = #{path}" }

This log level should be info at least when flush_at_shutdown is true.
When flush_at_shutdown is true, the level can be warn.

Without this log, even if we notice that we have received corrupted data on the destination server, we can not know which chunks may have been corrupted.

And, if possible, we should check for the corruption of the chunk's body when loading existing chunks.

I am currently making this modification.

daipom · 2023-01-26T10:59:21Z

I have created some PRs for this issue.

Add logs to identify the time period of potentially broken data
- buffer: warning message for restoring buffer with flush_at_shutdown #4027
- buffer: add log for time periods of restored chunks which may be broken #4028
Backup a broken chunk file to check the content.
- buffer: backup broken file chunk #4025

About adding more checks for buffer corruption, I consider the following:

It is difficult to determine if the content of the chunk file is broken or not.
- Its format depends on the plugin, so we can not define what should be considered broken in general.
I guess we can check if we can unpack the data as MessagePack only when the output plugin does not use a custom format.
- There is a way of corruption such that unpacking does not result in an error, but returns nil records.
- So we probably should each all records and check for nil or other incorrect records.
- This impact on performance would be ignorable if only for flush_at_shutdown true and buffer::resume().

daipom · 2023-02-17T03:35:18Z

I have created some PRs for this issue.

All PRs are merged, thanks for the reviews!
I will add documents and a release note about #4025 soon.

About adding more checks for buffer corruption, I consider the following:

* It is difficult to determine if the content of the chunk file is broken or not.
  
  * Its format depends on the plugin, so we can not define what should be considered broken in general.

* I guess we can check if we can unpack the data as MessagePack only when the output plugin does not use a custom format.
  
  * There is a way of corruption such that unpacking does not result in an error, but returns `nil` records.
  * So we probably should `each` all records and check for `nil` or other incorrect records.
  * This impact on performance would be ignorable if only for `flush_at_shutdown true` and `buffer::resume()`.

I want to work on other issues now, so I won't be able to work on this for a while.

daipom · 2023-03-29T07:52:50Z

Added documentation.

buffer: backup corrupted chunk files at resuming fluentd-docs-gitbook#448

The following feature would be helpful, but I will not be able to work on it for a while.

About adding more checks for buffer corruption, I consider the following:

* It is difficult to determine if the content of the chunk file is broken or not.
  
  * Its format depends on the plugin, so we can not define what should be considered broken in general.

* I guess we can check if we can unpack the data as MessagePack only when the output plugin does not use a custom format.
  
  * There is a way of corruption such that unpacking does not result in an error, but returns `nil` records.
  * So we probably should `each` all records and check for `nil` or other incorrect records.
  * This impact on performance would be ignorable if only for `flush_at_shutdown true` and `buffer::resume()`.

cosmo0920 · 2024-04-08T06:33:21Z

The following feature would be helpful, but I will not be able to work on it for a while.

About adding more checks for buffer corruption, I consider the following:

* It is difficult to determine if the content of the chunk file is broken or not.
  
  * Its format depends on the plugin, so we can not define what should be considered broken in general.

* I guess we can check if we can unpack the data as MessagePack only when the output plugin does not use a custom format.
  
  * There is a way of corruption such that unpacking does not result in an error, but returns `nil` records.
  * So we probably should `each` all records and check for `nil` or other incorrect records.
  * This impact on performance would be ignorable if only for `flush_at_shutdown true` and `buffer::resume()`.

Not sure it's possible but if we could add checksums for the buffer contents, it would be helpful to verify the correctness of the buffers. This is already implemented in the chunkio which is used in Fluent Bit's filesystem buffering mechanism.

The main issue of the current implementation is: there is no mechanisms to detect the buffer corruptions.

daipom · 2024-04-08T10:21:18Z

Not sure it's possible but if we could add checksums for the buffer contents, it would be helpful to verify the correctness of the buffers. This is already implemented in the chunkio which is used in Fluent Bit's filesystem buffering mechanism.

The main issue of the current implementation is: there is no mechanisms to detect the buffer corruptions.

I agree.

I remember that when I previously made some improvements to this issue, I did not consider such a new mechanism because it would be expensive to implement and impactful to existing logic.
However, if we could add such a checksum mechanism, it would be helpful!

fujimotos added the enhancement Feature request or improve operations label Nov 25, 2022

fujimotos added this to Fluentd Kanban Nov 25, 2022

fujimotos moved this to To-Do in Fluentd Kanban Nov 25, 2022

daipom self-assigned this Dec 28, 2022

This was referenced Jan 26, 2023

buffer: backup broken file chunk #4025

Merged

buffer: warning message for restoring buffer with flush_at_shutdown #4027

Merged

buffer: add log for time periods of restored chunks which may be broken #4028

Merged

ashie moved this from To-Do to Work-In-Progress in Fluentd Kanban Feb 14, 2023

daipom mentioned this issue Feb 17, 2023

process_partial_cri error="undefined method `split' for nil:NilClass" fluent-plugins-nursery/fluent-plugin-concat#119

Closed

daipom mentioned this issue Feb 27, 2023

Proposing a new Fluentd maintainer #4069

Merged

daipom removed this from Fluentd Kanban Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more checks for buffer corruption on startup #3970

Add more checks for buffer corruption on startup #3970

fujimotos commented Nov 25, 2022

daipom commented Nov 25, 2022 •

edited

Loading

daipom commented Dec 28, 2022

daipom commented Jan 26, 2023

daipom commented Feb 17, 2023

daipom commented Mar 29, 2023

cosmo0920 commented Apr 8, 2024

daipom commented Apr 8, 2024

Add more checks for buffer corruption on startup #3970

Add more checks for buffer corruption on startup #3970

Comments

fujimotos commented Nov 25, 2022

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

daipom commented Nov 25, 2022 • edited Loading

daipom commented Dec 28, 2022

daipom commented Jan 26, 2023

daipom commented Feb 17, 2023

daipom commented Mar 29, 2023

cosmo0920 commented Apr 8, 2024

daipom commented Apr 8, 2024

daipom commented Nov 25, 2022 •

edited

Loading