Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan: delete corrupted chunks #67

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

jamiees2
Copy link

When we scan for chunks on startup, we load the list into memory, but exclude all chunks which are ultimately corrupted. This means that the library consumers are never actually informed about the corrupted chunks, and as a result, can't act on them to delete them.

This just deletes corrupted chunks on scan, but I'm interested in if we want to place this behind a config flag (and enable it in fluent bit), or want to do something more.

This does fix our immediate issue with 5 minutes of corrupted chunks on startup though :)

@jamiees2
Copy link
Author

jamiees2 commented Aug 22, 2021

Before:

ESC[1mFluent Bit v1.8.3ESC[0m
* ESC[1mESC[93mCopyright (C) 2019-2021 The Fluent Bit AuthorsESC[0m
* ESC[1mESC[93mCopyright (C) 2015-2018 Treasure DataESC[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/08/17 19:06:36] [ info] [engine] started (pid=22959)
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/24108-1627036093.415782167.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626681987.260484982.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/27785-1628312566.811558862.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626635752.528011015.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626680693.425352689.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626596863.504571931.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626652882.737676217.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626636338.941414083.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626684643.644410485.flb

for 5 minutes

After:

[2021/08/22 11:45:59] [ info] [engine] started (pid=16928)
[2021/08/22 11:45:59] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2021/08/22 11:45:59] [debug] [storage] [cio scan] opening path /var/log/td-agent-bit/cache/
[2021/08/22 11:45:59] [debug] [storage] [cio stream] new stream registered: forward.0
[2021/08/22 11:45:59] [debug] [storage] [cio scan] opening stream forward.0
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626637721.508111593.flb//var/log/td-agent-bit/cache//forward.0/10083-1626637721.508111593.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626637721.508111593.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626637721.508111593.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626687577.352016009.flb//var/log/td-agent-bit/cache//forward.0/10083-1626687577.352016009.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626687577.352016009.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626687577.352016009.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 17840-1628486295.657410446.flb//var/log/td-agent-bit/cache//forward.0/17840-1628486295.657410446.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/17840-1628486295.657410446.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 17840-1628486295.657410446.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626677477.220090314.flb//var/log/td-agent-bit/cache//forward.0/10083-1626677477.220090314.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626677477.220090314.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626677477.220090314.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 27637-1627293232.563310409.flb//var/log/td-agent-bit/cache//forward.0/27637-1627293232.563310409.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/27637-1627293232.563310409.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 27637-1627293232.563310409.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626599501.871938221.flb//var/log/td-agent-bit/cache//forward.0/10083-1626599501.871938221.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626599501.871938221.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626599501.871938221.flb

The startup still took 5+ minutes, but the next startup was super fast :D

@jamiees2
Copy link
Author

hey @edsiper, is there a chance you can take a look at this PR soon/propose an alternative way to address this bug?

@edsiper
Copy link
Member

edsiper commented Aug 25, 2021

I am thinking about what would be the ideal use case. Note that a corrupted chunk can have "many" valid records, what we do in Fluent Bit is process all valid chunks that exist in a corrupted chunk. If we just skip that functionality we will be in a worse situation...

@jamiees2
Copy link
Author

jamiees2 commented Aug 26, 2021

so I spent some time looking through this today, I believe this isn't correct, as far as I can tell the execution flow seems to be the following:

On startup, fluent bit calls cio_load, which eventually calls out to cio_scan_stream_files with a directory. cio_scan_stream_files loops through the files in this directory and runs cio_chunk_open over each of them.

cio_chunk_open takes this chunk, allocates a struct cio_chunk for it, and starts populating it, and adds it to the struct cio_stream -> chunks list It eventually populates the backend using cio_file_open if the stream is backed by the filesystem.

cio_file_open will check if it should open the file, try to open it, and then try to mmap it. If any of these steps fail, it will return NULL, and populate err.

If cio_chunk_open receives a NULL as a return value, it will delete the chunk from the struct cio_stream -> chunks list and free everything. So as a result, from fluent bit's perspective, the file is never loaded at all, and all records of it are cleaned up.

As a result, my belief is that we actually never ever load these files, since the only way we scan them is through cio_scan_stream_files, which means we can't use them to load data.

I'm wondering if there's something I'm missing here, could you potentially link me to some code?

@edsiper
Copy link
Member

edsiper commented Oct 7, 2021

If cio_chunk_open receives a NULL as a return value, it will delete the chunk from the struct cio_stream -> chunks list and free everything. So as a result, from fluent bit's perspective, the file is never loaded at all, and all records of it are cleaned up.

which line is that ?, what conditions would trigger that scenario ?

@jamiees2
Copy link
Author

jamiees2 commented Oct 8, 2021

I'm referring to these lines:
https://github.com/edsiper/chunkio/blob/510a277bc7c290c7a00012b2a6df5555e35947f6/src/cio_chunk.c#L78
https://github.com/edsiper/chunkio/blob/510a277bc7c290c7a00012b2a6df5555e35947f6/src/cio_chunk.c#L85-L90

In cio_file.c, cio_file_open returns NULL if mmap_file returns an error, which includes CIO_CORRUPTED, see https://github.com/edsiper/chunkio/blob/510a277bc7c290c7a00012b2a6df5555e35947f6/src/cio_file.c#L615-L620

CIO_CORRUPTED is what is returned for chunks which fail the crc32 format check, so my belief is that these files are not loaded again at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants