Skip to content

op-node/recover-mode: handle l1 origin close to tip gracefully#18556

Closed
geoknee wants to merge 13 commits intodevelopfrom
gk/recover-mode
Closed

op-node/recover-mode: handle l1 origin close to tip gracefully#18556
geoknee wants to merge 13 commits intodevelopfrom
gk/recover-mode

Conversation

@geoknee
Copy link
Contributor

@geoknee geoknee commented Dec 9, 2025

Closes #18350

Adds an action test showing how recover mode can work to recover a chain, with explanatory notes about how operators should not use span batches. This can be developed into a better runbook for recovering from a sequencing window expiry incident.

Run env.RunFaultProofProgram after computing l2SafeHead
(l2SafeHead.Number.Uint64()/2) and replace the FromGenesis call with
RunFaultProofProgram. Fix minor comment typos and wrap a long log line.
@codecov
Copy link

codecov bot commented Dec 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.58%. Comparing base (e355cc4) to head (e1bfc22).
⚠️ Report is 5 commits behind head on develop.

❗ There is a different number of reports uploaded between BASE (e355cc4) and HEAD (e1bfc22). Click for more details.

HEAD has 8 uploads less than BASE
Flag BASE (e355cc4) HEAD (e1bfc22)
cannon-go-tests-64 3 1
contracts-bedrock-tests 6 0
Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #18556      +/-   ##
===========================================
- Coverage    75.34%   66.58%   -8.77%     
===========================================
  Files          189       55     -134     
  Lines        11228     4031    -7197     
===========================================
- Hits          8460     2684    -5776     
+ Misses        2622     1203    -1419     
+ Partials       146      144       -2     
Flag Coverage Δ
cannon-go-tests-64 66.58% <ø> (-0.82%) ⬇️
contracts-bedrock-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 139 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@geoknee geoknee marked this pull request as ready for review December 9, 2025 15:00
@geoknee geoknee requested review from a team as code owners December 9, 2025 15:00
Comment on lines +138 to +144
if errors.Is(err, ethereum.NotFound) {
// If the next origin is not found, it means we are at the end of the chain.
// In this case, we set the next origin to an empty block reference.
los.log.Error("next L1 origin not found, recover mode likely brought L1 origin up to the tip of the chain", "error", err)
los.nextOrigin = eth.L1BlockRef{}
return los.currentOrigin, los.nextOrigin, nil
} else if err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The action test added in this PR fails without this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure is due to this error

    l2_sequencer.go:111:
        	Error Trace:	/Users/georgeknee/code/ethereum-optimism/optimism/op-e2e/actions/helpers/l2_sequencer.go:111
        	            				/Users/georgeknee/code/ethereum-optimism/optimism/op-e2e/actions/proofs/sequence_window_expiry_test.go:90
        	            				/Users/georgeknee/code/ethereum-optimism/optimism/op-e2e/actions/proofs/helpers/matrix.go:47
        	Error:      	Received unexpected error:
        	            	EOF
        	Test:       	Test_ProgramAction_SequenceWindowExpired/HonestClaim-jovian
        	Messages:   	failed to start block building

I'm not sure what this would mean outside of the action test environment, but I think it is probably related to what we saw with #18350.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just because the action test helper being called requires block building to succeed, and it doesn't because of the unavailability of the next l1 origin.

So it does not really tally with the sequencer drift violation, I haven't yet manage to repro that. But still, it is better to avoid the temporary errors and continue block building with the old origin when we get into this situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the test to not get stuck on this, if it can't build a block we just wait until more l1 blocks are available. This means the test actually passes without the patch. But I added an acceptance test which still fails without the patch.

Comment on lines +69 to +70
// Set recover mode on the sequencer:
env.Sequencer.ActSetRecoverMode(t, true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually confirmed that the test fails without recover mode enabled.

Comment on lines +23 to +35
// It seems more difficult (almost impossible) to recover from sequencing window expiry with span batches,
// since the singular batches within are invalidated _atomically_.
// That is to say, if the oldest batch in the span batch fails the sequencing window check
// (l1 origin + seq window < l1 inclusion)
// All following batches are invalidated / dropped as well.
// https://github.com/ethereum-optimism/optimism/blob/73339162d78a1ebf2daadab01736382eed6f4527/op-node/rollup/derive/batches.go#L96-L100
//
// If the same blocks were batched with singular batches, the validation rules are different
// https://github.com/ethereum-optimism/optimism/blob/73339162d78a1ebf2daadab01736382eed6f4527/op-node/rollup/derive/batches.go#L83-L86
// In the case of recover mode, the noTxPool=true condition means autoderviation actually fills
// the gap with identical blocks anyway, meaning the following batches are actually still valid.
bc := helpers.NewBatcherCfg()
bc.ForceSubmitSingularBatch = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually confirmed the test fails without forcing singular batches.

Replace manual error assertions around FindL1Origin with
requireL1OriginAt and remove the now-unused derive import.
@geoknee geoknee marked this pull request as draft December 9, 2025 22:49
Set default L1 block time to 12s to match action helper assumptions.
Increase sequencer window size in the test to 50. Compute drift from
UnsafeL2 headers (use UnsafeL2.L1Origin). Simplify L1 mining to always
BatchMineAndSync and remove the extra numL1Blocks > 10 lag guard.
switch {
case drift == 0:
t.Fatal("drift is zero, this implies the unsafe l2 head is pinned to the l1 head")
case drift > int(tp.MaxSequencerDrift):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that tp.MaxSequencerDrift is probably a complete red herring her since this was changed to a protocol constant with Fjord https://specs.optimism.io/protocol/fjord/derivation.html?highlight=drift#constant-maximum-sequencer-drift.

@geoknee
Copy link
Contributor Author

geoknee commented Dec 12, 2025

In favour of #18589

@geoknee geoknee closed this Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

possible bug in op-node sequencer recover mode

1 participant

Comments