Skip to content

[extension/filestorage] Recover from add. bbolt goroutine panic#48565

Open
briandavis-viz wants to merge 7 commits into
open-telemetry:mainfrom
briandavis-viz:fix/filestorage-recreate-goroutine-panic-35899
Open

[extension/filestorage] Recover from add. bbolt goroutine panic#48565
briandavis-viz wants to merge 7 commits into
open-telemetry:mainfrom
briandavis-viz:fix/filestorage-recreate-goroutine-panic-35899

Conversation

@briandavis-viz

Copy link
Copy Markdown
Contributor

Description

Extends the recreate option to recover from bbolt corruption that panics in goroutines spawned by bbolt.Open (notably the freepages "multiple references" panic). Adds a subprocess pre-check that isolates such panics, while keeping the existing defer/recover for main-goroutine panics.

May 20 12:46:16 hostname-02 otelcol-contrib[118358]: {"level":"info","ts":"2026-05-20T12:46:16.852-0600","msg":"Starting stanza receiver","resource":{"service.instance.id":"24e88d05-3ffb-4c6c-94d7-a452bb930007","service.name":"otelcol-contrib","service.version":"0.136.0"},"otelcol.component.id":"filelog/k3s_service","otelcol.component.kind":"receiver","otelcol.signal":"logs"}
May 20 12:46:16 hostname-02 otelcol-contrib[118358]: panic: freepages: failed to get all reachable pages (page 2: multiple references (stack: [2]))
May 20 12:46:16 hostname-02 otelcol-contrib[118358]: goroutine 379 [running]:
May 20 12:46:16 hostname-02 otelcol-contrib[118358]: go.etcd.io/bbolt.(*DB).freepages.func2()
May 20 12:46:16 hostname-02 otelcol-contrib[118358]: #011go.etcd.io/bbolt@v1.4.3/db.go:1248 +0x8d
May 20 12:46:16 hostname-02 otelcol-contrib[118358]: created by go.etcd.io/bbolt.(*DB).freepages in goroutine 1
May 20 12:46:16 hostname-02 otelcol-contrib[118358]: #011go.etcd.io/bbolt@v1.4.3/db.go:1246 +0x1dc
May 20 12:46:16 hostname-02 systemd[1]: otelcol-contrib.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Link to tracking issue

Additional fix for #35899.

Testing

Ran this on the host that was actively having the issues, and it worked as intended for renaming the corrupted file.

May 20 15:26:12 hostname-02 otelcol-contrib[366526]: {"level":"warn","ts":"2026-05-20T15:26:12.902-0600","msg":"Renamed corrupt bbolt database; a fresh database will be created","resource":{"service.instance.id":"768b24a2-7247-48f6-ba65-7cc668d7ed0d","service.name":"otelcontribcol","service.version":"0.152.0-dev"},"otelcol.component.id":"file_storage/fingerprint","otelcol.component.kind":"extension","original":"/var/lib/otelcol/fingerprint/receiver_filelog_k3s_service","backup":"/var/lib/otelcol/fingerprint/receiver_filelog_k3s_service.2026-05-20T15:26:12.902.backup"}

Documentation

Enhanced existing documentation and added additional notes for related components where applicable, while reviewing code against my own specific configuration.

@briandavis-viz

Copy link
Copy Markdown
Contributor Author

Hey @atoulme and @VihasMakwana, this is an additional fix which covers a different scenario, building on top of #41802.

# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext: |
The previous implementation used `defer recover()` in the calling goroutine, which correctly catches

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need all this detail here

component: extension/file_storage

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Extend the `recreate` option to also recover from bbolt corruption that panics in goroutines spawned by `bbolt.Open`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a changelog message more focused on user? Like under which situation this was a problem or something like that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hows this?

Comment thread extension/storage/filestorage/README.md Outdated

If the database fails to open due to corruption (resulting in a panic), the corrupted file will be automatically renamed to `{filename}.{ISO 8601 timestamp}.backup` and a new data file will be created from scratch. This allows the collector to continue operating even when encountering certain bbolt panics. If no corruption is detected, the existing database continues to be used normally.
There may still be scenarios where manually removing or renaming the file may be required, and this feature flag is not a panacea for all bbolt panics you can encounter.
**How it works.** bbolt can panic in two different ways during `bbolt.Open` when the database is corrupt:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This provides too many internal details that I guess are not interesting for users. I think we should keep the same kind of documentation as before.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to end-user language.

if errors.As(runErr, &exitErr) {
switch exitErr.ExitCode() {
case precheckExitPanic:
return errDBCorruption

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen because something different? My concern is that this can be because something unrelated and we end up losing data.

@swiatekm swiatekm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer to avoid this precheck hack. Is there no other binary we can use to probe this? The bbolt cli has some recovery related commands in it: https://github.com/etcd-io/bbolt/tree/main/cmd/bbolt/command. Surely one of them is immune to this panic.

@briandavis-viz

Copy link
Copy Markdown
Contributor Author

I would really prefer to avoid this precheck hack. Is there no other binary we can use to probe this? The bbolt cli has some recovery related commands in it: etcd-io/bbolt@main/cmd/bbolt/command. Surely one of them is immune to this panic.

@swiatekm Before submitting this, I leveraged bbolt check <filename> (https://github.com/etcd-io/bbolt/blob/main/cmd/bbolt/command/command_check.go) to validate, which output the same information as the exception when running within otel.

With both of your feedback, I realized I can import bbolt and perform the check in-process, which simplifies things a bit and should address @iblancasa's concern as well. However, there's a problem. The current versioned of bbolt doesn't support in-process checks right now...

Our friend @VihasMakwana looks to have already fixed these issues upstream to support this, but it doesn't look like it's released, and subsequently isn't updated in otel yet.

etcd-io/bbolt#1153 - moves the freepages panic from the spawned goroutine to the parent function so recover() works.
etcd-io/bbolt#1164 - adds defer recover() to tx.check, converting the panic into a panicked{} channel error.

From my current understanding, we can't do this until that has changed.

I have the code ready to go for when it is updated in https://github.com/open-telemetry/opentelemetry-collector-contrib/compare/main...briandavis-viz:opentelemetry-collector-contrib:wait-for-v1.5.x?expand=1

@VihasMakwana / @swiatekm / @iblancasa, how would you like to move forward?

@iblancasa

Copy link
Copy Markdown
Contributor

I think we should add something to support last released version until etcd-io/bbolt#1190 is finished. We can create a follow up pr to migrate when that happens.

@swiatekm

Copy link
Copy Markdown
Contributor

If what we need right now is spawning a collector subprocess with a secret init function, then I'm against and would rather wait for bbolt release.

@VihasMakwana

Copy link
Copy Markdown
Contributor

@swiatekm

swiatekm commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

How do you guys feel to pinning bbolt version to https://github.com/etcd-io/bbolt/releases/tag/v1.5.0-rc.0 or https://github.com/etcd-io/bbolt/releases/tag/v1.5.0-beta.0?

I'd rather wait until there's a final release. But we can definitely work on PRs using the beta in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants