Add WAL Replay support for crash recovery#1954
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1954 +/- ##
==========================================
+ Coverage 85.44% 85.61% +0.16%
==========================================
Files 509 511 +2
Lines 157876 161437 +3561
==========================================
+ Hits 134905 138220 +3315
- Misses 22437 22683 +246
Partials 534 534
🚀 New features to boost your workflow:
|
… by wal_position_start
lalitb
left a comment
There was a problem hiding this comment.
Looks good overall. Only concern is the stale/missing sidecar edge cases; they’re rare, but worth a follow-up.
Thanks @lalitb! I've incorporated your feedback and also added some additional tests for the edge cases around corrupt & missing cursor sidecar. |
| tracing::warn!( | ||
| path = %path.display(), | ||
| rotation_id, | ||
| error = %e, | ||
| "failed to open rotated WAL file, skipping" | ||
| ); |
There was a problem hiding this comment.
@andborja @cijothomas Re #1973
I see why @AaronRM is using tracing directly.
This may require us to separate the macros so both can use them or @andborja to come up with a fallback for when messages do not have your preferred instrumentation. Otherwise, you will have a different syntax for event_name from this kind of statement.
otel_warn!(
FAILED_TO_OPEN_WAL_ROTATED,
path = %path.display(),
rotation_id,
error = %e,
"failed to open rotated WAL file, skipping"
);
There was a problem hiding this comment.
see why @AaronRM is using tracing directly.
oh why? Why not use the the macros we have?
There was a problem hiding this comment.
@cijothomas One of the design goals for quiver was to preserve the option to live as an independent crate after incubating in otel-arrow. To preserve that option, we wouldn't want to take a compile-time dependency on otap-dataflow macros from quiver.
We could refactor such that quiver doesn't use tracing directly, but instead invokes callbacks provided at runtime by the calling context (e.g. the durable_buffer processor could provide callbacks to the otel macros).
There was a problem hiding this comment.
makes sense! otap-dataflow has its own guidance (https://github.com/open-telemetry/otel-arrow/blob/main/rust/otap-dataflow/docs/telemetry/events-guide.md#event-naming), but it cannot enforce the same on the libraries it uses (like tonic. etc). So quiver is like any other library, and may not follow otap-dataflow guidance. But if you can, use the same guidance. Specifically - use Events aka Logs with EventName.
Change Summary
Add WAL replay support for crash recovery in Quiver. On engine startup,
QuiverEngine::open()now replays any WAL entries that were written but not yet finalized to segments, ensuring recover of data which had been written to the WAL, but not yet finalized in a segment file. The implementation includes a newMultiFileWalReaderthat reads entries across rotated WAL files in global position order, and aReplayBundletype that decodes WAL entries back intoRecordBundleimplementations for replay through the normal ingest path. The replay logic respects the persisted cursor to skip already-finalized entries and handles edge cases like truncated entries (crash mid-write) and corrupted entries (CRC mismatch) by stopping replay at the first invalid entry rather than failing startup.What issue does this PR close?
How are these changes tested?
Are there any user-facing changes?
No.