Delete irrecoverable chunks on crc32 checksum error #107

drbugfinder-work · 2024-08-22T13:40:41Z

In Fluent Bit we see many corrupted chunks (resulting from previous OOM kills in Kubernetes) but they still remain on the disk indefinitely, although the storage.delete_irrecoverable_chunks parameter is set.

It looks like the set CIO_ERR_BAD_CHECKSUM is currently not used as a case for deletion.

This PR should correct this behavior.

cc @edsiper

Signed-off-by: Richard Treu <[email protected]>

leonardo-albertovich · 2024-08-22T14:08:40Z

If I'm correct, checksum errors are not considered irrecoverable since the chunk can actually be loaded and some sort of forensics could be performed (that was my rationale to leave them out when I implemented the feature).

leonardo-albertovich · 2024-08-22T14:10:20Z

A better question would be "why does your system produce corrupted chunks?" and I think having access to your configuration, logs and corrupted chunks would be invaluable to actually fix the root cause issue. Otherwise we'd probably make the issue much harder to diagnose and fix in the future.

drbugfinder-work · 2024-08-22T16:14:37Z

Hi @leonardo-albertovich !

From our perspective, chunks with a bad checksum are irrecoverable and should be deleted if that parameter is set, as they cannot be processed by Fluent Bit anymore.

I absolutely understand your point, and I also wanted to start a separate discussion about possible ways to encode these chunks and send them (e.g., as a log payload) to be processed later.

We believe these corrupt chunks should definitely be deleted, especially since the option suggests it would do exactly that.

As I mentioned, in our case, these corrupt chunks appear when Fluent Bit Kubernetes pods are OOM-killed, so there’s no straightforward way to perform a soft shutdown of Fluent Bit.

Given that the scenario for filesystem usage in storage implies a persistent volume in K8s (or a host mount), this means that storage fills up over time as these corrupt chunks are not automatically deleted. This results in the need for manual intervention, scripting, etc., to identify and remove the bad chunks.

As I said, I think it would be best to delete them when this delete option is set, and to implement another option to forward/copy them to a specified target.

leonardo-albertovich · 2024-08-22T16:55:41Z

I think we shouldn't but I see what you mean and I guess that since the user already opted in to delete them it would be acceptable to do so.

Could you please chime in @edsiper?

Btw, I noticed something obvious yet valuable in your description. The corrupted chunks are caused by the process being OOM killed. I know it's obvious but I wasn't aware of it being the root of this case and it makes sense not to look further for the corruption source (in this case).

Do I make sense?

Delete irrecoverable chunks on crc32 checksum error

b3c2edf

Signed-off-by: Richard Treu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete irrecoverable chunks on crc32 checksum error #107

Delete irrecoverable chunks on crc32 checksum error #107

drbugfinder-work commented Aug 22, 2024

leonardo-albertovich commented Aug 22, 2024

leonardo-albertovich commented Aug 22, 2024

drbugfinder-work commented Aug 22, 2024

leonardo-albertovich commented Aug 22, 2024

Delete irrecoverable chunks on crc32 checksum error #107

Are you sure you want to change the base?

Delete irrecoverable chunks on crc32 checksum error #107

Conversation

drbugfinder-work commented Aug 22, 2024

leonardo-albertovich commented Aug 22, 2024

leonardo-albertovich commented Aug 22, 2024

drbugfinder-work commented Aug 22, 2024

leonardo-albertovich commented Aug 22, 2024