[extension/filestorage] Recover from add. bbolt goroutine panic#48565
[extension/filestorage] Recover from add. bbolt goroutine panic#48565briandavis-viz wants to merge 7 commits into
Conversation
# Conflicts: # extension/storage/filestorage/extension.go
|
Hey @atoulme and @VihasMakwana, this is an additional fix which covers a different scenario, building on top of #41802. |
| # These lines will be padded with 2 spaces and then inserted directly into the document. | ||
| # Use pipe (|) for multiline entries. | ||
| subtext: | | ||
| The previous implementation used `defer recover()` in the calling goroutine, which correctly catches |
There was a problem hiding this comment.
I'm not sure if we need all this detail here
| component: extension/file_storage | ||
|
|
||
| # A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`). | ||
| note: Extend the `recreate` option to also recover from bbolt corruption that panics in goroutines spawned by `bbolt.Open`. |
There was a problem hiding this comment.
Can you provide a changelog message more focused on user? Like under which situation this was a problem or something like that.
|
|
||
| If the database fails to open due to corruption (resulting in a panic), the corrupted file will be automatically renamed to `{filename}.{ISO 8601 timestamp}.backup` and a new data file will be created from scratch. This allows the collector to continue operating even when encountering certain bbolt panics. If no corruption is detected, the existing database continues to be used normally. | ||
| There may still be scenarios where manually removing or renaming the file may be required, and this feature flag is not a panacea for all bbolt panics you can encounter. | ||
| **How it works.** bbolt can panic in two different ways during `bbolt.Open` when the database is corrupt: |
There was a problem hiding this comment.
This provides too many internal details that I guess are not interesting for users. I think we should keep the same kind of documentation as before.
There was a problem hiding this comment.
Updated to end-user language.
| if errors.As(runErr, &exitErr) { | ||
| switch exitErr.ExitCode() { | ||
| case precheckExitPanic: | ||
| return errDBCorruption |
There was a problem hiding this comment.
Can this happen because something different? My concern is that this can be because something unrelated and we end up losing data.
swiatekm
left a comment
There was a problem hiding this comment.
I would really prefer to avoid this precheck hack. Is there no other binary we can use to probe this? The bbolt cli has some recovery related commands in it: https://github.com/etcd-io/bbolt/tree/main/cmd/bbolt/command. Surely one of them is immune to this panic.
@swiatekm Before submitting this, I leveraged With both of your feedback, I realized I can import bbolt and perform the check in-process, which simplifies things a bit and should address @iblancasa's concern as well. However, there's a problem. The current versioned of bbolt doesn't support in-process checks right now... Our friend @VihasMakwana looks to have already fixed these issues upstream to support this, but it doesn't look like it's released, and subsequently isn't updated in otel yet. etcd-io/bbolt#1153 - moves the freepages panic from the spawned goroutine to the parent function so recover() works. From my current understanding, we can't do this until that has changed. I have the code ready to go for when it is updated in https://github.com/open-telemetry/opentelemetry-collector-contrib/compare/main...briandavis-viz:opentelemetry-collector-contrib:wait-for-v1.5.x?expand=1 @VihasMakwana / @swiatekm / @iblancasa, how would you like to move forward? |
|
I think we should add something to support last released version until etcd-io/bbolt#1190 is finished. We can create a follow up pr to migrate when that happens. |
|
If what we need right now is spawning a collector subprocess with a secret init function, then I'm against and would rather wait for bbolt release. |
|
How do you guys feel to pinning bbolt version to https://github.com/etcd-io/bbolt/releases/tag/v1.5.0-rc.0 or https://github.com/etcd-io/bbolt/releases/tag/v1.5.0-beta.0? |
I'd rather wait until there's a final release. But we can definitely work on PRs using the beta in the meantime. |
Description
Extends the recreate option to recover from bbolt corruption that panics in goroutines spawned by bbolt.Open (notably the freepages "multiple references" panic). Adds a subprocess pre-check that isolates such panics, while keeping the existing defer/recover for main-goroutine panics.
Link to tracking issue
Additional fix for #35899.
Testing
Ran this on the host that was actively having the issues, and it worked as intended for renaming the corrupted file.
Documentation
Enhanced existing documentation and added additional notes for related components where applicable, while reviewing code against my own specific configuration.