Skip to content

Add WAL Replay support for crash recovery#1954

Merged
jmacd merged 27 commits intoopen-telemetry:mainfrom
AaronRM:wal-replay
Feb 6, 2026
Merged

Add WAL Replay support for crash recovery#1954
jmacd merged 27 commits intoopen-telemetry:mainfrom
AaronRM:wal-replay

Conversation

@AaronRM
Copy link
Copy Markdown
Contributor

@AaronRM AaronRM commented Feb 4, 2026

Change Summary

Add WAL replay support for crash recovery in Quiver. On engine startup, QuiverEngine::open() now replays any WAL entries that were written but not yet finalized to segments, ensuring recover of data which had been written to the WAL, but not yet finalized in a segment file. The implementation includes a new MultiFileWalReader that reads entries across rotated WAL files in global position order, and a ReplayBundle type that decodes WAL entries back into RecordBundle implementations for replay through the normal ingest path. The replay logic respects the persisted cursor to skip already-finalized entries and handles edge cases like truncated entries (crash mid-write) and corrupted entries (CRC mismatch) by stopping replay at the first invalid entry rather than failing startup.

What issue does this PR close?

How are these changes tested?

  • Added unit tests for MultiFileWalReader covering single-file reads, multi-file iteration, mid-stream starts, and WAL position preservation
  • Added unit tests for ReplayBundle verifying IPC payload decoding, multi-slot reconstruction, timestamp handling, and error cases
  • Added tests for end-to-end WAL replay scenarios including recovery of unfinalized bundles, cursor-based deduplication, empty/missing WAL handling, segment finalization during replay, multi-file replay after rotation, and graceful recovery from truncated and corrupted WAL entries.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the rust Pull requests that update Rust code label Feb 4, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 90.53926% with 100 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.61%. Comparing base (0060032) to head (d01da2a).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1954      +/-   ##
==========================================
+ Coverage   85.44%   85.61%   +0.16%     
==========================================
  Files         509      511       +2     
  Lines      157876   161437    +3561     
==========================================
+ Hits       134905   138220    +3315     
- Misses      22437    22683     +246     
  Partials      534      534              
Components Coverage Δ
otap-dataflow 87.38% <90.53%> (+0.19%) ⬆️
query_abstraction 80.61% <ø> (ø)
query_engine 90.23% <ø> (ø)
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 53.50% <ø> (ø)
quiver 91.61% <90.53%> (-0.02%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AaronRM AaronRM marked this pull request as ready for review February 5, 2026 01:10
@AaronRM AaronRM requested a review from a team as a code owner February 5, 2026 01:10
Comment thread rust/otap-dataflow/crates/quiver/src/engine.rs
Comment thread rust/otap-dataflow/crates/quiver/src/engine.rs
Comment thread rust/otap-dataflow/crates/quiver/src/engine.rs Outdated
Comment thread rust/otap-dataflow/crates/quiver/src/engine.rs Outdated
Copy link
Copy Markdown
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Only concern is the stale/missing sidecar edge cases; they’re rare, but worth a follow-up.

@AaronRM
Copy link
Copy Markdown
Contributor Author

AaronRM commented Feb 5, 2026

Looks good overall. Only concern is the stale/missing sidecar edge cases; they’re rare, but worth a follow-up.

Thanks @lalitb! I've incorporated your feedback and also added some additional tests for the edge cases around corrupt & missing cursor sidecar.

Comment on lines +778 to +783
tracing::warn!(
path = %path.display(),
rotation_id,
error = %e,
"failed to open rotated WAL file, skipping"
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andborja @cijothomas Re #1973
I see why @AaronRM is using tracing directly.
This may require us to separate the macros so both can use them or @andborja to come up with a fallback for when messages do not have your preferred instrumentation. Otherwise, you will have a different syntax for event_name from this kind of statement.

                    otel_warn!(
                        FAILED_TO_OPEN_WAL_ROTATED,
                        path = %path.display(),
                        rotation_id,
                        error = %e,
                        "failed to open rotated WAL file, skipping"
                    );

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see why @AaronRM is using tracing directly.

oh why? Why not use the the macros we have?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cijothomas One of the design goals for quiver was to preserve the option to live as an independent crate after incubating in otel-arrow. To preserve that option, we wouldn't want to take a compile-time dependency on otap-dataflow macros from quiver.

We could refactor such that quiver doesn't use tracing directly, but instead invokes callbacks provided at runtime by the calling context (e.g. the durable_buffer processor could provide callbacks to the otel macros).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense! otap-dataflow has its own guidance (https://github.com/open-telemetry/otel-arrow/blob/main/rust/otap-dataflow/docs/telemetry/events-guide.md#event-naming), but it cannot enforce the same on the libraries it uses (like tonic. etc). So quiver is like any other library, and may not follow otap-dataflow guidance. But if you can, use the same guidance. Specifically - use Events aka Logs with EventName.

Comment thread rust/otap-dataflow/crates/quiver/src/engine.rs
@jmacd jmacd added this pull request to the merge queue Feb 6, 2026
Merged via the queue into open-telemetry:main with commit e95eee9 Feb 6, 2026
62 checks passed
@AaronRM AaronRM deleted the wal-replay branch February 6, 2026 18:47
Comment thread rust/otap-dataflow/crates/quiver/src/engine.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[otap-df-quiver] WAL entries not replayed on restart

5 participants