Skip to content

Conversation

@julienrbrt
Copy link
Member

@julienrbrt julienrbrt commented Nov 20, 2025

PR adding retry logic and better payload verification (as found in the audit).
It was one shot by Claude Sonnet, so still checking it :D done and updated.

@claude
Copy link
Contributor

claude bot commented Nov 20, 2025

Claude finished @julienrbrt's task —— View job


Comprehensive Code Review

I've completed a thorough review of the retry logic and payload verification implementation. Overall, this is a solid improvement that properly implements Engine API specification compliance with retry logic for transient failures. The test coverage is excellent. However, I've identified several areas for improvement across code quality, design, and maintainability.


🔴 High Priority Issues

1. Unreachable Code in retryWithBackoffOnPayloadStatus (execution/evm/execution.go:110)

Issue: Line 110 is unreachable. The for loop will always return early via one of these paths:

  • Success (line 83)
  • Non-retriable error (lines 88, 93)
  • Max retries exceeded (line 98)
return fmt.Errorf("max retries (%d) exceeded for %s", maxRetries, operation)

Fix: Remove line 110 entirely.

Rationale: Dead code creates confusion and suggests the logic may not be fully understood. The gemini-code-assist reviewer also caught this.


2. Inconsistent Error Handling Between InitChain and ExecuteTxs/setFinal

Problem: The retry pattern is clean in InitChain (lines 189-214) but becomes convoluted in ExecuteTxs and setFinal. Specifically:

  • In ExecuteTxs (lines 293-332): The forkchoiceUpdatedV3 RPC call happens, then the closure captures and modifies payloadID from outside scope
  • Similar issue with newPayloadV4 call (lines 342-368)
  • In setFinal (lines 399-416): Same pattern

Recommendation: Refactor to move the entire RPC operation inside the retry closure for consistency and clarity, matching the InitChain pattern:

// Current (complex):
var payloadID *engine.PayloadID
err = retryWithBackoffOnPayloadStatus(ctx, func() error {
    var forkchoiceResult engine.ForkChoiceResponse
    err := c.engineClient.CallContext(...)
    // ... validation ...
    payloadID = forkchoiceResult.PayloadID  // Modifying captured variable
    return nil
}, ...)

// Better (consistent with InitChain):
var payloadID *engine.PayloadID
err = retryWithBackoffOnPayloadStatus(ctx, func() error {
    var forkchoiceResult engine.ForkChoiceResponse
    err := c.engineClient.CallContext(...)
    if err != nil {
        return fmt.Errorf("forkchoice update failed: %w", err)
    }
    
    if err := validatePayloadStatus(forkchoiceResult.PayloadStatus); err != nil {
        // ... logging ...
        return err
    }
    
    if forkchoiceResult.PayloadID == nil {
        // ... logging ...
        return fmt.Errorf("returned nil PayloadID")
    }
    
    payloadID = forkchoiceResult.PayloadID
    return nil
}, ...)

This is the same feedback from gemini-code-assist and I agree it would significantly improve maintainability.


3. Missing Retry Logic for engine_getPayloadV4 (execution/evm/execution.go:336)

Question from Review: Should line 336's engine_getPayloadV4 call also have retry logic?

err = c.engineClient.CallContext(ctx, &payloadResult, "engine_getPayloadV4", *payloadID)

Analysis: According to the Engine API spec, engine_getPayloadV4 can also return SYNCING status in the payload status field within the response envelope. If the EL is syncing, this call could fail transiently.

Recommendation: Add retry wrapper for consistency and resilience, or document why it's safe to skip (e.g., payload is already built, just retrieval).


🟡 Medium Priority Issues

4. Function Parameter operation Could Be More Descriptive

The operation parameter in retryWithBackoffOnPayloadStatus is used for error messages. Consider more descriptive operation names:

  • Current: "InitChain", "ExecuteTxs forkchoice", "ExecuteTxs newPayload"
  • Better: "InitChain.forkchoiceUpdatedV3", "ExecuteTxs.forkchoiceUpdatedV3", "ExecuteTxs.newPayloadV4"

This provides clearer context in logs when debugging.


5. Error Channel Usage Pattern (errorCh chan<- error)

Observation: The errorCh parameter is added to executor.go, submitter.go, and syncer.go but the actual retry logic doesn't use it.

Question: What's the intended use of errorCh? Should permanent failures (ErrInvalidPayloadStatus) be sent to this channel to trigger node shutdown? The comment says "Channel to report critical execution client failures" but the retry logic doesn't leverage it.

Suggestion: If the intent is to report INVALID payloads as critical failures, consider:

if errors.Is(err, ErrInvalidPayloadStatus) {
    select {
    case s.errorCh <- err:
    default:
    }
    return err
}

Otherwise, clarify in comments when errorCh should be used.


🟢 Code Quality & Best Practices

6. Excellent Test Coverage

The execution_status_test.go file is comprehensive:

  • ✅ Status validation for all Engine API states (VALID, INVALID, SYNCING, ACCEPTED)
  • ✅ Retry scenarios (success on first attempt, after retries, exhaustion)
  • ✅ Context cancellation and timeout handling
  • ✅ RPC error handling (both wrapped and unwrapped)
  • ✅ Proper use of table-driven tests

Praise: This is production-quality test coverage.


7. Constants and Documentation

  • ✅ Well-defined constants (MaxPayloadStatusRetries, InitialRetryBackoff)
  • ✅ Clear error types (ErrInvalidPayloadStatus, ErrPayloadSyncing)
  • ✅ Comprehensive function documentation explaining Engine API spec compliance

8. Concurrency Safety

The mutex usage in setFinal (lines 381-396) properly protects shared state before constructing the forkchoice state. Good pattern.


🔒 Security Review

9. Logging of Sensitive Data

Check: The logging statements include latestValidHash and validationError from the Engine API. These are safe to log and don't contain secrets. ✅

10. Context Handling

✅ Proper context propagation and cancellation handling throughout
✅ Context timeouts respected in retry logic


⚡ Performance Considerations

11. Exponential Backoff Configuration

Current settings:

  • Max retries: 3
  • Initial backoff: 1 second
  • Backoff multiplier: 2x

Timeline: 1s + 2s + 4s = 7 seconds maximum delay

Assessment: Reasonable for block production use case. However:

Consideration: If the EL is syncing from a cold start, 3 retries may not be enough. The PR description mentions "preventing unnecessary node shutdowns during transient execution engine sync operations."

Question: Should these be configurable? Or should there be different retry policies for:

  • InitChain (node startup - more retries?)
  • ExecuteTxs (block production - fast failure?)
  • SetFinal (finalization - more retries?)

📋 Architecture & Design

12. Response to Previous Feedback

Alex's suggestion (alpe): Extract retry logic to utility or use decorator pattern

Julien's response: "This is a separate go module, as it depends only on core we have nothing else available."

My take: While I understand the module isolation constraint, consider:

  1. The retry logic is generic enough that it could live in a pkg/retry or internal/retry package within the same module
  2. However, since it's currently only used in one place and specifically tailored to Engine API payload status semantics, the current approach is acceptable
  3. If retry patterns emerge elsewhere in the execution layer, then extract to a shared package

Decision: Current approach is fine given the constraints. Consider extraction if usage expands.


🎯 Summary & Recommendations

Overall Assessment: This PR successfully implements Engine API-compliant retry logic with excellent test coverage. The core logic is sound, but consistency and maintainability could be improved.

Required Changes:

  1. ✅ Remove unreachable code (line 110)
  2. ✅ Refactor ExecuteTxs and setFinal retry patterns to match InitChain consistency
  3. ⚠️ Evaluate and add retry logic for engine_getPayloadV4 or document why it's safe to skip

Recommended Improvements:
4. Clarify errorCh usage pattern or leverage it for critical failures
5. More descriptive operation names in retry error messages
6. Consider whether retry configuration should vary by operation

Strengths:

  • ✅ Proper Engine API specification compliance
  • ✅ Comprehensive test coverage
  • ✅ Clear documentation and error types
  • ✅ Appropriate concurrency safety measures

Files Reviewed

  • execution/evm/execution.go - Main implementation (454 additions)
  • execution/evm/execution_status_test.go - Test coverage (253 additions)
  • block/internal/executing/executor.go - Error channel integration
  • block/internal/submitting/submitter.go - Error channel integration
  • block/internal/syncing/syncer.go - Error channel integration
  • CHANGELOG.md - Documentation

@github-actions
Copy link
Contributor

github-actions bot commented Nov 20, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 24, 2025, 9:33 AM

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @julienrbrt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reliability and resilience of the EVM execution client by introducing robust payload status validation and a sophisticated retry mechanism. By adhering to the Engine API specification, the system now intelligently handles transient SYNCING or ACCEPTED states with exponential backoff, preventing premature node shutdowns. Conversely, it ensures immediate failure for permanent INVALID statuses, streamlining error handling and improving overall stability.

Highlights

  • Improved EVM Payload Status Validation: Enhanced the validation of EVM execution client payload statuses to align with the Engine API specification, ensuring more robust error handling.
  • Retry Logic for Transient States: Implemented exponential backoff retry logic for SYNCING and ACCEPTED payload statuses in InitChain, ExecuteTxs, and SetFinal methods, preventing unnecessary node shutdowns during temporary execution engine unavailability.
  • Immediate Failure on Invalid Status: Configured the system to immediately fail on INVALID payload statuses, preventing unnecessary retries for permanent errors and improving efficiency.
  • New Utility Functions for Status Handling: Introduced validatePayloadStatus to categorize Engine API payload statuses and retryWithBackoff to provide a generic, context-aware mechanism for retrying operations with exponential backoff.
  • Comprehensive Testing: Added a new dedicated test file with unit tests for the payload status validation and retry logic, covering various scenarios including successful retries, exhaustion of retries, immediate failures, and context cancellation/timeout.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces retry logic for handling transient SYNCING statuses from the EVM execution client, which is a great improvement for node stability. The core changes involve a new retryWithBackoff helper and its integration into InitChain, ExecuteTxs, and SetFinal.

The overall approach is solid, and the new tests are comprehensive. However, I've identified a few areas for improvement:

  • The implementation of the retry logic in ExecuteTxs and setFinal is unnecessarily complex and inconsistent with the cleaner pattern used in InitChain. I've left comments suggesting a refactoring to improve clarity and robustness.
  • The retryWithBackoff helper contains some unreachable code that should be removed.
  • A minor cleanup in validatePayloadStatus is also suggested.

Addressing these points will make the new retry mechanism more maintainable and consistent across the codebase.

@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.82%. Comparing base (bfa745c) to head (a00d4ea).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2863      +/-   ##
==========================================
- Coverage   64.93%   64.82%   -0.11%     
==========================================
  Files          81       81              
  Lines        7350     7350              
==========================================
- Hits         4773     4765       -8     
- Misses       2036     2043       +7     
- Partials      541      542       +1     
Flag Coverage Δ
combined 64.82% <ø> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt marked this pull request as draft November 20, 2025 09:07
@julienrbrt julienrbrt marked this pull request as ready for review November 20, 2025 10:31
Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry/backoff code looks good.
It looks familiar to other retry/backoff logic code that exists already. Would it make sense to extract it to some utility?

Have you thought about encapsulating the retry logic in a decorator to the engineClient.CallContext ? This way, the retry would be hidden and the code can be slightly more readable.

// get payload
var payloadResult engine.ExecutionPayloadEnvelope
err = c.engineClient.CallContext(ctx, &payloadResult, "engine_getPayloadV4", *forkchoiceResult.PayloadID)
err = c.engineClient.CallContext(ctx, &payloadResult, "engine_getPayloadV4", *payloadID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a retry for this call, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked and ExecutionPayloadEnvelope doesn't return a payload status, so it cannot be retried with the same logic. I am updated the API naming to clarify the retry function usage (a00d4ea)

@julienrbrt
Copy link
Member Author

The retry/backoff code looks good.
It looks familiar to other retry/backoff logic code that exists already. Would it make sense to extract it to some utility?

This is a seperate go module, as it depends only on core we have nothing else available.

Let me check for your other point.

tac0turtle
tac0turtle previously approved these changes Nov 24, 2025
Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK, there is the one question from alex

@julienrbrt julienrbrt added this pull request to the merge queue Nov 24, 2025
Merged via the queue into main with commit c161967 Nov 24, 2025
28 of 30 checks passed
@julienrbrt julienrbrt deleted the julien/ai-execution-evm branch November 24, 2025 10:56
alpe added a commit that referenced this pull request Nov 24, 2025
* main:
  chore: remove extra github action yml file (#2882)
  fix(execution/evm): verify payload status (#2863)
  feat: fetch included da height from store (#2880)
  chore: better output on errors (#2879)
  refactor!: create da client and split cache interface (#2878)
  chore!: rename `evm-single` and `grpc-single` (#2839)
  build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876)
  chore: parallel cache de/serialization (#2868)
  chore: bump blob size (#2877)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants