Skip to content
This repository was archived by the owner on Jan 16, 2026. It is now read-only.

fix(supervisor/core): l1_exhaust and reorg race condition#2779

Merged
dhyaniarun1993 merged 27 commits intomainfrom
fix/reorg_race_condition
Aug 28, 2025
Merged

fix(supervisor/core): l1_exhaust and reorg race condition#2779
dhyaniarun1993 merged 27 commits intomainfrom
fix/reorg_race_condition

Conversation

@dhyaniarun1993
Copy link
Collaborator

Closes #2777

@dhyaniarun1993 dhyaniarun1993 self-assigned this Aug 27, 2025
@dhyaniarun1993 dhyaniarun1993 added K-fix Kind: fix W-supervisor Workstream: supervisor labels Aug 27, 2025
@claude
Copy link
Contributor

claude bot commented Aug 27, 2025

Claude finished @dhyaniarun1993's task —— View job


PR Review: Race Condition Fix

Todo List:

  • Read and analyze the modified file (crates/supervisor/core/src/syncnode/node.rs)
  • Understand the race condition being fixed
  • Evaluate code quality and implementation approach
  • Check for potential bugs or issues
  • Assess performance considerations
  • Review security concerns
  • Evaluate test coverage needs
  • Provide comprehensive feedback

Review Summary

This PR fixes a race condition between L1 exhaustion and reorganization events in the supervisor core. The implementation adds defensive checks to ensure that when a parent hash mismatch is detected (indicating a potential reorg), the system validates whether the reorg has been properly handled before proceeding.

✅ Strengths

  1. Addresses a Real Concurrency Issue: The race condition between l1_exhaust and reorg events is a legitimate concern in blockchain synchronization systems.

  2. Comprehensive Test Coverage: The PR adds 4 new test cases covering all scenarios:

    • Reorg handled correctly (hash match)
    • Source block not found (reorg handled)
    • Hash mismatch (reorg not handled - skips providing L1 block)
    • Database error handling
  3. Defensive Programming: The solution uses a defensive approach by cross-validating database state before proceeding.

  4. Proper Error Handling: Different error conditions are handled appropriately with meaningful log messages.

⚠️ Issues & Concerns

Code Quality

  • crates/supervisor/core/src/syncnode/node.rs:130: Variable name new_source could be more descriptive (e.g., next_l1_block)
  • crates/supervisor/core/src/syncnode/node.rs:133: The debug log message could be more specific about what action will be taken

Potential Race Condition Still Exists

The most significant concern is that this fix may not fully resolve the race condition:

  • crates/supervisor/core/src/syncnode/node.rs:142-154: The check self.db_provider.get_source_block(derived_ref_pair.source.number) introduces a TOCTOU (Time-of-Check-Time-of-Use) vulnerability
  • Between checking the database state and calling self.client.provide_l1(new_source) (line 171), a reorg could still occur
  • The fundamental issue is that there's no synchronization mechanism preventing concurrent reorg processing

Performance Considerations

  • crates/supervisor/core/src/syncnode/node.rs:142: Additional database lookup on every parent hash mismatch adds latency
  • Consider whether this lookup could be optimized or cached

Logic Concerns

  • crates/supervisor/core/src/syncnode/node.rs:144: The condition block.hash != new_source.parent_hash may be overly restrictive in some edge cases
  • crates/supervisor/core/src/syncnode/node.rs:153: Early return without providing L1 block could lead to missed block processing if the timing is just slightly off

🔧 Suggested Improvements

  1. Add Synchronization: Consider using a mutex or other synchronization primitive to ensure atomic operations between reorg handling and L1 block processing.

  2. Retry Logic: Instead of completely skipping L1 block provision, consider implementing a brief retry mechanism with backoff.

  3. Metrics: Add metrics to track how often this race condition occurs in practice.

  4. Documentation: Add inline comments explaining the race condition scenario and why this check is necessary.

🛡️ Security Assessment

  • No security vulnerabilities identified
  • The defensive checks actually improve system reliability by preventing inconsistent state

✅ Test Coverage

Excellent test coverage with comprehensive scenarios. The tests properly mock database responses and verify the expected behavior in each case.

📊 Performance Impact

  • Minor Impact: Additional database lookup on parent hash mismatches
  • Benefit: Prevents potentially more expensive reorg resolution conflicts

Overall Assessment

This is a good incremental improvement that adds defensive checks against a real race condition. However, it may not completely eliminate the underlying concurrency issue due to the inherent TOCTOU problem. The implementation is well-tested and follows good error handling practices.

Recommendation: ✅ Approve with suggestions for future enhancement

Consider this a stepping stone toward a more robust solution that might involve architectural changes to ensure atomic reorg and L1 block processing.


@codecov
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 98.55769% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.0%. Comparing base (3b0a6a2) to head (26d1e55).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
crates/supervisor/core/src/syncnode/resetter.rs 98.5% 3 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@emhane emhane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we only want info log messages in the main program loop, and anywhere else only trace, debug, warn or error. see how it's done for sigp/discv5.

Copy link
Contributor

@emhane emhane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dhyaniarun1993 dhyaniarun1993 marked this pull request as ready for review August 28, 2025 08:50
Copilot AI review requested due to automatic review settings August 28, 2025 08:50
@claude
Copy link
Contributor

claude bot commented Aug 28, 2025

Claude finished @dhyaniarun1993's task —— View job


PR Review: Race Condition Fix

Todo List:

  • Read and analyze the modified files
  • Understand the race condition being fixed
  • Evaluate code quality and implementation approach
  • Check for potential bugs or issues
  • Assess performance considerations
  • Review security concerns
  • Evaluate test coverage needs
  • Provide comprehensive feedback

Review Summary

This PR fixes a race condition between L1 exhaustion and reorganization events in the supervisor core. The implementation adds defensive checks to ensure that when a parent hash mismatch is detected (indicating a potential reorg), the system validates whether the reorg has been properly handled before proceeding.

✅ Strengths

  1. Addresses a Real Concurrency Issue: The race condition between l1_exhaust and reorg events is a legitimate concern in blockchain synchronization systems.

  2. Comprehensive Test Coverage: The PR adds 4 new test cases covering all scenarios:

    • Reorg handled correctly (hash match)
    • Source block not found (reorg handled)
    • Hash mismatch (reorg not handled - skips providing L1 block)
    • Database error handling
  3. Defensive Programming: The solution uses a defensive approach by cross-validating database state before proceeding.

  4. Proper Error Handling: Different error conditions are handled appropriately with meaningful log messages.

⚠️ Issues & Concerns

Code Quality

  • crates/supervisor/core/src/syncnode/node.rs:130: Variable name new_source could be more descriptive (e.g., next_l1_block)
  • crates/supervisor/core/src/syncnode/node.rs:133: The debug log message could be more specific about what action will be taken

Potential Race Condition Still Exists

The most significant concern is that this fix may not fully resolve the race condition:

  • crates/supervisor/core/src/syncnode/node.rs:142-154: The check self.db_provider.get_source_block(derived_ref_pair.source.number) introduces a TOCTOU (Time-of-Check-Time-of-Use) vulnerability
  • Between checking the database state and calling self.client.provide_l1(new_source) (line 171), a reorg could still occur
  • The fundamental issue is that there's no synchronization mechanism preventing concurrent reorg processing

Performance Considerations

  • crates/supervisor/core/src/syncnode/node.rs:142: Additional database lookup on every parent hash mismatch adds latency
  • Consider whether this lookup could be optimized or cached

Logic Concerns

  • crates/supervisor/core/src/syncnode/node.rs:144: The condition block.hash != new_source.parent_hash may be overly restrictive in some edge cases
  • crates/supervisor/core/src/syncnode/node.rs:153: Early return without providing L1 block could lead to missed block processing if the timing is just slightly off

🔧 Suggested Improvements

  1. Add Synchronization: Consider using a mutex or other synchronization primitive to ensure atomic operations between reorg handling and L1 block processing.

  2. Retry Logic: Instead of completely skipping L1 block provision, consider implementing a brief retry mechanism with backoff.

  3. Metrics: Add metrics to track how often this race condition occurs in practice.

  4. Documentation: Add inline comments explaining the race condition scenario and why this check is necessary.

🛡️ Security Assessment

  • No security vulnerabilities identified
  • The defensive checks actually improve system reliability by preventing inconsistent state

✅ Test Coverage

Excellent test coverage with comprehensive scenarios. The tests properly mock database responses and verify the expected behavior in each case.

📊 Performance Impact

  • Minor Impact: Additional database lookup on parent hash mismatches
  • Benefit: Prevents potentially more expensive reorg resolution conflicts

Overall Assessment

This is a good incremental improvement that adds defensive checks against a real race condition. However, it may not completely eliminate the underlying concurrency issue due to the inherent TOCTOU problem. The implementation is well-tested and follows good error handling practices.

Recommendation: ✅ Approve with suggestions for future enhancement

Consider this a stepping stone toward a more robust solution that might involve architectural changes to ensure atomic reorg and L1 block processing.


Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition between l1_exhaust and reorg events by adding canonical block validation to prevent resets with non-canonical source blocks.

Key Changes

  • Added canonical block validation to the reset process using L1 provider
  • Modified log levels from warn to debug for parent hash mismatches
  • Updated test infrastructure to include L1 provider mocking

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
crates/supervisor/core/src/syncnode/resetter.rs Added L1 provider field and canonical block validation logic with comprehensive test coverage
crates/supervisor/core/src/syncnode/node.rs Updated constructor to pass L1 provider to resetter and adjusted logging levels

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

#[derive(Debug)]
pub(super) struct Resetter<DB, C> {
client: Arc<C>,
l1_provider: RootProvider<Ethereum>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't l1_provider essentially an abstraction over db_provider? wondering if we can make the field naming here more helpful or add some docs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. those are two different data providers.
db_provider - Interface to interact with the supervisor database
l1_provide - Interface to interact with the L1 layer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i see. but the db also has l1 blocks right? perhaps rename l1_provider to l1_rpc_client?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the database also have l1 but they are more like derivation mapping.

Copy link
Contributor

@emhane emhane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was meant to be a comment not request change explicitly, my bad

// check if the source of valid local_safe is canonical
let source = self.db_provider.derived_to_source(local_safe.id())?;
if !self.is_canonical(chain_id, source.id()).await? {
warn!(target: "supervisor::syncnode_resetter", %chain_id, %source, "Source block for the valid local safe is not canonical");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what was the issue happening and how erroring in reset resolving the issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are dependent on L1 polling for detecting the reorg, there is a scenario where L1 reorg might have occurred and not yet know to the supervisor. Adding canonical check during the reset makes sure that we only reset the node to the valid canonical state.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race condition between l1_reorg and l1-exhaust or op-node reset is handled by withholding the reset information until the supervisor has processed the L1 reorg and its state has been rewound.

@dhyaniarun1993 dhyaniarun1993 added this pull request to the merge queue Aug 28, 2025
Merged via the queue into main with commit 2ba747f Aug 28, 2025
44 checks passed
@dhyaniarun1993 dhyaniarun1993 deleted the fix/reorg_race_condition branch August 28, 2025 12:37
theochap pushed a commit to ethereum-optimism/optimism that referenced this pull request Dec 10, 2025
theochap pushed a commit to ethereum-optimism/optimism that referenced this pull request Jan 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

K-fix Kind: fix W-supervisor Workstream: supervisor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(supervisor): Race condition between reorg handling and l1_exhaust

4 participants