Skip to content

[storage layer] Fluent-bit crashes on high contention if corrupted chunks exists #1950

@dgrala

Description

@dgrala

Bug Report

Describe the bug
It might be related to corrupted files in storage, or just a large backlog there. On start, fluentbit will try to load all the files, and crash soon afterwards with segfault.

To Reproduce

[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581461952.525678621.flb
[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581467643.662417426.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549865.134681941.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549865.134681941.flb
[engine] caught signal (SIGSEGV)
#0  0x5636a8cf8f0e      in  cio_file_st_get_meta_len() at lib/chunkio/include/chunkio/cio_file_st.h:72
#1  0x5636a8cf8f42      in  cio_file_st_get_content() at lib/chunkio/include/chunkio/cio_file_st.h:93
#2  0x5636a8cf93f1      in  cio_chunk_get_content() at lib/chunkio/src/cio_chunk.c:193
#3  0x5636a8a964f5      in  flb_input_chunk_flush() at src/flb_input_chunk.c:550
#4  0x5636a8a7c823      in  flb_engine_dispatch() at src/flb_engine_dispatch.c:146
#5  0x5636a8a79b22      in  flb_engine_flush() at src/flb_engine.c:85
#6  0x5636a8a7b318      in  flb_engine_handle_event() at src/flb_engine.c:247
#7  0x5636a8a7b318      in  flb_engine_start() at src/flb_engine.c:489
#8  0x5636a89ec813      in  main() at src/fluent-bit.c:853
#9  0x7f75cf5ecb96      in  ???() at ???:0
#10 0x5636a89ea9d9      in  ???() at ???:0
#11 0xffffffffffffffff  in  ???() at ???:0
Aborted (core dumped)
  • Steps to reproduce the problem:
    One example we had 1.2GB files stuck in storage. It repro'ed consistently. I moved the files out of storage, ran again, it didn't crash. When I moved them back, it didn't seem to crash. Might be hard to repro, but seems to happen regularly in our prod.

Expected behavior
No crashes

Screenshots

Your Environment

  • Version used: 1.3.2
  • Configuration:
  td-agent-bit.conf: |
    [SERVICE]
        # Rely on the supervisor service (eg kubelet) to restart
        # the fluentbit daemon when the configuration changes.
        Config_Watch              on
        # Given we run in a container stay in the foreground.
        Daemon                    off
        Flush                     1
        HTTP_Server               on
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        Log_Level                 warning
        Parsers_File              parsers.conf
        storage.path              /var/lib/fluentbit/storage/
        storage.sync              full
        storage.checksum          on
        # from https://github.com/fluent/fluent-bit/issues/1362#issuecomment-500166931
        storage.backlog.mem_limit 100M

Problematic input (one of 3):

[INPUT]
        Name              tail
        Tag               kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
        Tag_Regex         (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/lib/fluentbit/input_tail_kube.db
        Docker_Mode       On
        Mem_Buf_Limit     50MB
        Buffer_Chunk_Size 1MB
        Buffer_Max_Size   1MB
        Skip_Long_Lines   On
        storage.type      filesystem
        Refresh_Interval  10
  • Environment name and version (e.g. Kubernetes? What version?):
    k8s

  • Server type and version:

  • Operating System and version:
    NAME="Ubuntu"
    VERSION="18.04.3 LTS (Bionic Beaver)"

  • Filters and plugins:

Additional context
Ideally we would see no crashes. Seems after any crash, the storage files are corrupted and can't be loaded by subsequent runs.

Metadata

Metadata

Assignees

Labels

bugfixedwaiting-for-userWaiting for more information, tests or requested changes

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions