pruner: fall back to disk snapshot root when journal is missing#300
pruner: fall back to disk snapshot root when journal is missing#300
Conversation
When geth is killed uncleanly (SIGKILL before BlockChain.Stop writes the snapshot journal), prune-state fails with: WARN Loaded snapshot journal diskroot=XXX diffs=missing ERROR head doesn't match snapshot: have XXX, want YYY NewPruner now reads the persisted disk snapshot root via rawdb.ReadSnapshotRoot and retries snapshot initialisation with that root when the normal head-based init fails. Prune() then uses the disk root as the pruning target directly, bypassing the requirement for 128 in-memory diff layers that cannot exist when the journal was not written. Normal flow (clean shutdown, journal present) is unchanged. Made-with: Cursor
📝 WalkthroughWalkthroughPruner gained a new Changes
Sequence Diagram(s)sequenceDiagram
participant NewPruner
participant DB
participant SnapTree
participant Prune
NewPruner->>SnapTree: try create snapshot tree
alt success
SnapTree-->>NewPruner: snapshot tree ready
else failure
NewPruner->>DB: read persisted disk-layer root
alt diskRoot non-empty
DB-->>NewPruner: diskRoot
NewPruner->>SnapTree: reconstruct from diskRoot
SnapTree-->>NewPruner: snapshot tree restored
else empty
DB-->>NewPruner: empty -> return error
end
end
Note right of Prune: Later pruning flow
Prune->>SnapTree: get current DiskRoot()
alt DiskRoot == targetRoot
Prune-->>Prune: skip Cap (no abort)
else
Prune->>SnapTree: perform Cap and prune steps
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can get early access to new features in CodeRabbit.Enable the |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
core/state/pruner/pruner.go (1)
324-330: Minor: Log message could be more precise for diskRoot case.When
diskRootis used and the trie node exists, the code reaches line 328 and logs "Selecting user-specified state as the pruning target" becauselayersis empty. This is technically correct but slightly misleading since the root came from the diskRoot fallback, not user specification.This is a minor cosmetic issue since line 276-277 already logs the diskRoot case explicitly, making it clear in the overall log output.
🔧 Optional: More precise logging
} else { if len(layers) > 0 { log.Info("Selecting bottom-most difflayer as the pruning target", "root", root, "height", p.headHeader.Number.Uint64()-127) + } else if p.diskRoot != (common.Hash{}) { + log.Info("Confirmed disk snapshot root as pruning target", "root", root) } else { log.Info("Selecting user-specified state as the pruning target", "root", root) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/state/pruner/pruner.go` around lines 324 - 330, The log message is misleading when the trie root came from the diskRoot fallback; update the else branch that currently checks "if len(layers) > 0" to distinguish the diskRoot case by checking the existing diskRoot (or add a diskRootUsed boolean) and log e.g. "Selecting diskRoot fallback as the pruning target" when diskRoot was used and layers is empty, otherwise keep the "user-specified state" message; refer to symbols layers, diskRoot, root and p.headHeader.Number for locating and implementing the conditional logging change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@core/state/pruner/pruner.go`:
- Around line 324-330: The log message is misleading when the trie root came
from the diskRoot fallback; update the else branch that currently checks "if
len(layers) > 0" to distinguish the diskRoot case by checking the existing
diskRoot (or add a diskRootUsed boolean) and log e.g. "Selecting diskRoot
fallback as the pruning target" when diskRoot was used and layers is empty,
otherwise keep the "user-specified state" message; refer to symbols layers,
diskRoot, root and p.headHeader.Number for locating and implementing the
conditional logging change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bfddd551-fb26-4ba6-834f-01d5e6691ffe
📒 Files selected for processing (1)
core/state/pruner/pruner.go
… log Two follow-up fixes to the journal-missing fallback (56ae344): 1. Skip snaptree.Cap(root, 0) when root is already the disk layer. Cap requires a diffLayer as its target; calling it on a disk-layer-only tree (which is exactly what the fallback produces) returns "snapshot is disk layer" and aborts after all the heavy bloom-filter and DB-sweep work is done. Guard with DiskRoot() != root. 2. Add log lines around the fallback snapshot.New() call to make it visible when snapshot generation must be resumed (async=false blocks until generation finishes, which can take hours for large state).
Code reviewFound 1 issue:
go-ethereum/core/state/pruner/pruner.go Lines 271 to 277 in 56ae344 The go-ethereum/core/state/pruner/pruner.go Lines 188 to 193 in 56ae344 The error is thrown here in go-ethereum/core/state/snapshot/snapshot.go Lines 375 to 379 in 56ae344 A possible fix is to guard the if snaptree.DiskRoot() != root {
if err := snaptree.Cap(root, 0); err != nil {
return err
}
}Similarly, 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
Thanks for the review! This is a valid finding for the first commit ( Cap fix
if snaptree.DiskRoot() != root {
if err := snaptree.Cap(root, 0); err != nil {
return err
}
}When the fallback path is taken, the snapshot tree has only a disk layer at Journal does not need a guard
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/state/pruner/pruner.go`:
- Around line 98-119: The snapshot initialization currently treats any error
from snapshot.New(...) as a missing journal and falls back to
rawdb.ReadSnapshotRoot; change this so the fallback to reading snapDiskRoot only
happens when the original error indicates a missing journal (detect that
specific error or sentinel), otherwise return the original error from
snapshot.New; ensure you capture and log the original init error (from
snapshot.New) with context (e.g. include the error when calling
log.Warn/log.Info), and keep the existing flow that uses rawdb.ReadSnapshotRoot,
snaptree assignment, and Prune() only in the genuine journal-missing branch (use
functions snapshot.New, rawdb.ReadSnapshotRoot, snaptree, headBlock.Root() as
anchors).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5122485d-1bfd-4704-8874-f6f45a814fde
📒 Files selected for processing (1)
core/state/pruner/pruner.go
| if err != nil { | ||
| return nil, err // The relevant snapshot(s) might not exist | ||
| // The snapshot journal may be missing because geth was not shut down | ||
| // cleanly (SIGKILL before BlockChain.Stop could write the journal). | ||
| // Fall back: initialise the snapshot tree with the persisted disk | ||
| // snapshot root so that Prune() can still target that state. | ||
| snapDiskRoot = rawdb.ReadSnapshotRoot(db) | ||
| if snapDiskRoot == (common.Hash{}) { | ||
| return nil, err // No snapshot at all — nothing we can do. | ||
| } | ||
| log.Warn("Snapshot journal missing, falling back to snapshot disk-layer root", | ||
| "snapDiskRoot", snapDiskRoot, "chainHead", headBlock.Root()) | ||
| // If the snapshot was mid-generation when the node was killed, New will | ||
| // resume and wait for generation to finish (async=false). This can take | ||
| // a long time for large state; the log below makes that visible. | ||
| log.Info("Loading snapshot from disk-layer root (may wait for snapshot generation to finish)...", | ||
| "snapDiskRoot", snapDiskRoot) | ||
| snaptree, err = snapshot.New(db, trie.NewDatabase(db), 256, snapDiskRoot, false, false, false) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| log.Info("Snapshot ready", "snapDiskRoot", snapDiskRoot) | ||
| } |
There was a problem hiding this comment.
Don’t treat every snapshot init failure as “missing journal”.
At Line 98, fallback is triggered for any snapshot.New(...) error. That can hide unrelated snapshot failures and route pruning through a stale/incorrect recovery path. Gate fallback to the journal-missing case only, and log the original init error.
Suggested hardening
snaptree, err := snapshot.New(db, trie.NewDatabase(db), 256, headBlock.Root(), false, false, false)
var snapDiskRoot common.Hash
if err != nil {
+ // Only fallback when journal is missing; keep other init errors fatal.
+ // Prefer a typed/sentinel error check if snapshot package exposes one.
+ if !strings.Contains(strings.ToLower(err.Error()), "journal") {
+ return nil, err
+ }
snapDiskRoot = rawdb.ReadSnapshotRoot(db)
if snapDiskRoot == (common.Hash{}) {
return nil, err // No snapshot at all — nothing we can do.
}
log.Warn("Snapshot journal missing, falling back to snapshot disk-layer root",
- "snapDiskRoot", snapDiskRoot, "chainHead", headBlock.Root())
+ "snapDiskRoot", snapDiskRoot, "chainHead", headBlock.Root(), "initErr", err)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/state/pruner/pruner.go` around lines 98 - 119, The snapshot
initialization currently treats any error from snapshot.New(...) as a missing
journal and falls back to rawdb.ReadSnapshotRoot; change this so the fallback to
reading snapDiskRoot only happens when the original error indicates a missing
journal (detect that specific error or sentinel), otherwise return the original
error from snapshot.New; ensure you capture and log the original init error
(from snapshot.New) with context (e.g. include the error when calling
log.Warn/log.Info), and keep the existing flow that uses rawdb.ReadSnapshotRoot,
snaptree assignment, and Prune() only in the genuine journal-missing branch (use
functions snapshot.New, rawdb.ReadSnapshotRoot, snaptree, headBlock.Root() as
anchors).
There was a problem hiding this comment.
The fallback is intentionally broad — in practice, the only scenario where snapshot.New(headBlock.Root()) fails but snapshot.New(snapDiskRoot) succeeds is the journal-missing case. For other failures (e.g. DB corruption), the second snapshot.New call will also fail and the error will surface. No behavior change is needed.
There was a problem hiding this comment.
@curryxbo, that's a fair point — the second snapshot.New call acts as a natural filter for non-journal failures (DB corruption, missing trie, etc.), and the error will surface from there.
The one residual risk worth noting: if the first call fails for a reason unrelated to the journal (e.g., a partial/inconsistent state around headBlock.Root()), but snapDiskRoot is an older yet structurally intact root, the second call could succeed and pruning would proceed silently against that stale target rather than surfacing the original error. This is admittedly low-probability in practice.
If you're comfortable accepting that edge case given the simplicity of the current approach, that's a reasonable call. Acknowledged — no change needed.
Remove the snapshot journal fallback introduced in PR #300. The fallback was designed to handle unclean shutdown (SIGKILL) before prune-state, but this is not triggered in normal operation: pm2 stop, systemctl stop, and kill all send SIGINT/SIGTERM which geth handles gracefully, writing the snapshot journal via BlockChain.Stop(). The correct workflow if the node crashed is to restart it first (which rebuilds the snapshot with rebuild=true), shut down cleanly, then prune. Running prune-state directly after a crash is unsafe regardless. Restoring upstream geth behavior: if snapshot.New fails, return the error and require the user to restart the node cleanly before pruning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Now that BlockChain.Stop() correctly translates zkRoot to mptRoot before calling snaps.Journal() (previous commit), the snapshot journal is always written on clean shutdown. The PR #300 fallback (snapDiskRoot, Cap skip, 128-layer bypass) was a workaround for the journal silently failing to write — that root cause is fixed, so the workaround is no longer needed. Restore the upstream geth pruner behavior: - snapshot.New fails → return error (user should restart node first) - Always require 128 diff layers for the standard HEAD-127 target - Always call Cap(root, 0) without conditional skip Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR #300 added a fallback path in NewPruner that retried snapshot initialisation with the persisted disk snapshot root when the journal was missing, and bypassed the 128-layer requirement in Prune(). This was a workaround for symptoms caused by ZK/MPT root mismatches and missing diff layers, not a fix for the root cause. Revert pruner.go to upstream geth behavior. The actual root causes are addressed in the following commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* ci: support multi-platform Docker image build (amd64 + arm64) (#298) * ci: support multi-platform Docker image build (amd64 + arm64) Use docker/build-push-action with QEMU and buildx to build multi-arch images. Mac arm64 users can now pull and run the image natively. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add workflow_dispatch for manual Docker image build Allow manually triggering the Docker build from GitHub Actions UI with a tag name input, useful for re-building existing tags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: fix incorrect COMMIT and VERSION on manual dispatch Use git rev-parse HEAD for COMMIT and stripped version for VERSION build-arg, so they are correct in both tag-push and workflow_dispatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: fletcher.fan <fletcher.fan@bitget.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * Fix RLP decoding for MorphTx (#299) * implement version-aware RLP decoding for MorphTx * fix morph tx * pruner: fall back to disk snapshot root when journal is missing (#300) * pruner: fall back to disk snapshot root when journal is missing When geth is killed uncleanly (SIGKILL before BlockChain.Stop writes the snapshot journal), prune-state fails with: WARN Loaded snapshot journal diskroot=XXX diffs=missing ERROR head doesn't match snapshot: have XXX, want YYY NewPruner now reads the persisted disk snapshot root via rawdb.ReadSnapshotRoot and retries snapshot initialisation with that root when the normal head-based init fails. Prune() then uses the disk root as the pruning target directly, bypassing the requirement for 128 in-memory diff layers that cannot exist when the journal was not written. Normal flow (clean shutdown, journal present) is unchanged. Made-with: Cursor * pruner: fix Cap panic on disk-layer-only tree and add generation wait log Two follow-up fixes to the journal-missing fallback (56ae344): 1. Skip snaptree.Cap(root, 0) when root is already the disk layer. Cap requires a diffLayer as its target; calling it on a disk-layer-only tree (which is exactly what the fallback produces) returns "snapshot is disk layer" and aborts after all the heavy bloom-filter and DB-sweep work is done. Guard with DiskRoot() != root. 2. Add log lines around the fallback snapshot.New() call to make it visible when snapshot generation must be resumed (async=false blocks until generation finishes, which can take hours for large state). * pruner: rename diskRoot to snapDiskRoot to avoid confusion with diskStateRoot --------- Co-authored-by: corey <corey.zhang@bitget.com> * Revert "pruner: fall back to disk snapshot root when journal is missing (#300)" (#309) This reverts commit b3c5552. Co-authored-by: corey <corey.zhang@bitget.com> * tracers: fix Morph fee-token tracing paths (#308) * tracers: fix Morph fee-token tracing paths Keep Morph fee-token system calls bracketed consistently, forward system-call hooks through mux tracers, and make traceCall precredit alt-fee balances so tracing matches execution more closely. Constraint: Preserve user-visible tracer output while keeping prestate and traceCall behavior correct for Morph fee-token transactions Confidence: medium Scope-risk: moderate Not-tested: Full eth/tracers/internal/tracetest suite still has pre-existing fixture and VM failures on this branch * tracers: fix prestateTracer account discovery when DisableStorage is set * tracers: harden Morph fee-token trace edge cases Prevent flatCallTracer from touching hidden system-call frames and keep traceCall's synthetic fee-token precredits out of prestate views so debug RPCs stay stable and prestate output matches chain state. Constraint: Preserve Morph alt-fee trace execution without leaking synthetic prestate or hidden system-call frames Confidence: high Scope-risk: moderate * core: require balanced system-call trace hooks Only bracket fee-token helper calls when both start and end hooks are present so partial tracer wiring cannot leak system-call depth across a trace. Constraint: Preserve existing V2-over-legacy hook selection while restoring balanced start/end semantics Confidence: high Scope-risk: narrow Not-tested: Full core package outside TestStartSystemCallTrace * fix: handle nil parameter in morph_diskRoot RPC to prevent panic (#311) When morph_diskRoot is called without parameters, blockNrOrHash is nil, causing a nil pointer dereference crash. Default to latest block when no parameter is provided, consistent with other eth RPC methods. * pruner: use teeWriter and HEAD as prune target, fix genesis root validation (#310) * pruner: use teeWriter and HEAD as prune target, fix genesis root validation - Add teeWriter to persist trie nodes to disk during GenerateTrie, ensuring pruning works correctly even after unclean shutdowns. - Use HEAD directly as the pruning target instead of HEAD-127, eliminating unnecessary height rollback on L2 chains where reorgs don't occur. - Resolve genesis root via ReadDiskStateRoot in extractGenesis so that zkTrie roots (overridden via GenesisStateRoot) are correctly mapped to the actual MPT disk root. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * handle error * remove accidentally committed local test zip Keep local-test.zip untracked and out of repository history from this point. Made-with: Cursor * pruner: enforce genesis disk-root mapping Treat missing or invalid disk-state-root mapping for genesis as an explicit error during pruning instead of silently falling back. Made-with: Cursor --------- Co-authored-by: corey <corey.zhang@bitget.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: fletcher.fan <fletcher.fan@bitget.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Segue <huoda.china@163.com> Co-authored-by: corey <coreyx1992@gmail.com> Co-authored-by: corey <corey.zhang@bitget.com> Co-authored-by: panos <pan107104@outlook.com>
When geth is killed uncleanly (SIGKILL before BlockChain.Stop writes the snapshot journal), prune-state fails with:
WARN Loaded snapshot journal diskroot=XXX diffs=missing
ERROR head doesn't match snapshot: have XXX, want YYY
NewPruner now reads the persisted disk snapshot root via rawdb.ReadSnapshotRoot and retries snapshot initialisation with that root when the normal head-based init fails. Prune() then uses the disk root as the pruning target directly, bypassing the requirement for 128 in-memory diff layers that cannot exist when the journal was not written.
Normal flow (clean shutdown, journal present) is unchanged.
Made-with: Cursor
1. Purpose or design rationale of this PR
...
2. PR title
Your PR title must follow conventional commits (as we are doing squash merge for each PR), so it must start with one of the following types:
3. Deployment tag versioning
Has the version in
params/version.gobeen updated?4. Breaking change label
Does this PR have the
breaking-changelabel?Summary by CodeRabbit