scan: delete corrupted chunks #67

jamiees2 · 2021-08-22T12:11:00Z

When we scan for chunks on startup, we load the list into memory, but exclude all chunks which are ultimately corrupted. This means that the library consumers are never actually informed about the corrupted chunks, and as a result, can't act on them to delete them.

This just deletes corrupted chunks on scan, but I'm interested in if we want to place this behind a config flag (and enable it in fluent bit), or want to do something more.

This does fix our immediate issue with 5 minutes of corrupted chunks on startup though :)

jamiees2 · 2021-08-22T12:12:49Z

Before:

ESC[1mFluent Bit v1.8.3ESC[0m
* ESC[1mESC[93mCopyright (C) 2019-2021 The Fluent Bit AuthorsESC[0m
* ESC[1mESC[93mCopyright (C) 2015-2018 Treasure DataESC[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/08/17 19:06:36] [ info] [engine] started (pid=22959)
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/24108-1627036093.415782167.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626681987.260484982.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/27785-1628312566.811558862.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626635752.528011015.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626680693.425352689.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626596863.504571931.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626652882.737676217.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626636338.941414083.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626684643.644410485.flb

for 5 minutes

After:

[2021/08/22 11:45:59] [ info] [engine] started (pid=16928)
[2021/08/22 11:45:59] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2021/08/22 11:45:59] [debug] [storage] [cio scan] opening path /var/log/td-agent-bit/cache/
[2021/08/22 11:45:59] [debug] [storage] [cio stream] new stream registered: forward.0
[2021/08/22 11:45:59] [debug] [storage] [cio scan] opening stream forward.0
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626637721.508111593.flb//var/log/td-agent-bit/cache//forward.0/10083-1626637721.508111593.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626637721.508111593.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626637721.508111593.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626687577.352016009.flb//var/log/td-agent-bit/cache//forward.0/10083-1626687577.352016009.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626687577.352016009.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626687577.352016009.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 17840-1628486295.657410446.flb//var/log/td-agent-bit/cache//forward.0/17840-1628486295.657410446.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/17840-1628486295.657410446.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 17840-1628486295.657410446.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626677477.220090314.flb//var/log/td-agent-bit/cache//forward.0/10083-1626677477.220090314.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626677477.220090314.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626677477.220090314.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 27637-1627293232.563310409.flb//var/log/td-agent-bit/cache//forward.0/27637-1627293232.563310409.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/27637-1627293232.563310409.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 27637-1627293232.563310409.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626599501.871938221.flb//var/log/td-agent-bit/cache//forward.0/10083-1626599501.871938221.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626599501.871938221.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626599501.871938221.flb

The startup still took 5+ minutes, but the next startup was super fast :D

jamiees2 · 2021-08-24T22:24:32Z

hey @edsiper, is there a chance you can take a look at this PR soon/propose an alternative way to address this bug?

edsiper · 2021-08-25T00:15:56Z

I am thinking about what would be the ideal use case. Note that a corrupted chunk can have "many" valid records, what we do in Fluent Bit is process all valid chunks that exist in a corrupted chunk. If we just skip that functionality we will be in a worse situation...

jamiees2 · 2021-08-26T17:03:24Z

so I spent some time looking through this today, I believe this isn't correct, as far as I can tell the execution flow seems to be the following:

On startup, fluent bit calls cio_load, which eventually calls out to cio_scan_stream_files with a directory. cio_scan_stream_files loops through the files in this directory and runs cio_chunk_open over each of them.

cio_chunk_open takes this chunk, allocates a struct cio_chunk for it, and starts populating it, and adds it to the struct cio_stream -> chunks list It eventually populates the backend using cio_file_open if the stream is backed by the filesystem.

cio_file_open will check if it should open the file, try to open it, and then try to mmap it. If any of these steps fail, it will return NULL, and populate err.

If cio_chunk_open receives a NULL as a return value, it will delete the chunk from the struct cio_stream -> chunks list and free everything. So as a result, from fluent bit's perspective, the file is never loaded at all, and all records of it are cleaned up.

As a result, my belief is that we actually never ever load these files, since the only way we scan them is through cio_scan_stream_files, which means we can't use them to load data.

I'm wondering if there's something I'm missing here, could you potentially link me to some code?

edsiper · 2021-10-07T22:34:28Z

If cio_chunk_open receives a NULL as a return value, it will delete the chunk from the struct cio_stream -> chunks list and free everything. So as a result, from fluent bit's perspective, the file is never loaded at all, and all records of it are cleaned up.

which line is that ?, what conditions would trigger that scenario ?

jamiees2 · 2021-10-08T11:28:28Z

I'm referring to these lines:
https://github.com/edsiper/chunkio/blob/510a277bc7c290c7a00012b2a6df5555e35947f6/src/cio_chunk.c#L78
https://github.com/edsiper/chunkio/blob/510a277bc7c290c7a00012b2a6df5555e35947f6/src/cio_chunk.c#L85-L90

In cio_file.c, cio_file_open returns NULL if mmap_file returns an error, which includes CIO_CORRUPTED, see https://github.com/edsiper/chunkio/blob/510a277bc7c290c7a00012b2a6df5555e35947f6/src/cio_file.c#L615-L620

CIO_CORRUPTED is what is returned for chunks which fail the crc32 format check, so my belief is that these files are not loaded again at all.

delete corrupted chunks

ca48b75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan: delete corrupted chunks #67

scan: delete corrupted chunks #67

jamiees2 commented Aug 22, 2021

jamiees2 commented Aug 22, 2021 •

edited

Loading

jamiees2 commented Aug 24, 2021

edsiper commented Aug 25, 2021

jamiees2 commented Aug 26, 2021 •

edited

Loading

edsiper commented Oct 7, 2021

jamiees2 commented Oct 8, 2021

scan: delete corrupted chunks #67

Are you sure you want to change the base?

scan: delete corrupted chunks #67

Conversation

jamiees2 commented Aug 22, 2021

jamiees2 commented Aug 22, 2021 • edited Loading

jamiees2 commented Aug 24, 2021

edsiper commented Aug 25, 2021

jamiees2 commented Aug 26, 2021 • edited Loading

edsiper commented Oct 7, 2021

jamiees2 commented Oct 8, 2021

jamiees2 commented Aug 22, 2021 •

edited

Loading

jamiees2 commented Aug 26, 2021 •

edited

Loading