Skip to content

Conversation

@julienrbrt
Copy link
Member

Overview

Once the SaveBlockData has been saved we must update the height and the state.
context.TODO() should be reverted to the real context (e.ctx) once #2274 has been implemented, this prevents context cancellation messing up state.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @julienrbrt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the block execution logic to enhance cancellation robustness. It temporarily modifies the context used for updating block height and state, switching from the executor's context to context.TODO(). This change ensures that these critical post-save operations are not prematurely cancelled, addressing a potential issue until a more comprehensive solution is implemented.

Highlights

  • Improved Cancellation Handling: Temporarily modifies the context used for critical state updates to prevent premature cancellation, ensuring data consistency after block saving.
  • Context Replacement: Replaced the executor's context (e.ctx) with context.TODO() for the e.store.SetHeight and e.updateState calls within the produceBlock method.
  • Temporary Solution: This change is explicitly noted as a temporary measure to prevent context cancellation from affecting state updates, pending the implementation of a more comprehensive solution tracked in issue [Epic]: Atomic behaviour #2274.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link
Contributor

claude bot commented Oct 7, 2025

Claude finished @julienrbrt's task —— View job


Code Review

I've reviewed this PR that improves cancellation handling in the block executor. Here's my comprehensive analysis:


Summary

This PR switches from e.ctx to context.TODO() for two critical post-save operations (SetHeight and updateState) to prevent context cancellation from leaving the system in an inconsistent state. This is a temporary measure until issue #2274 implements proper atomic behavior.


Review Findings

✅ Positive Aspects

  1. Clear Intent & Documentation: The inline comment explicitly explains why context.TODO() is being used and references the tracking issue ([Epic]: Atomic behaviour #2274), making the temporary nature clear to future maintainers.

  2. Correct Problem Identification: The change correctly identifies that after SaveBlockData succeeds, the height and state updates are critical operations that should complete even if the parent context is cancelled.

  3. Minimal Scope: The change is surgical and only affects the two specific operations that need protection.


⚠️ Areas of Concern

1. Incomplete Cancellation Protection (block/internal/executing/executor.go:386-392)

While SetHeight and updateState now use context.TODO(), the broadcasting operations immediately following still use the cancellable context:

// Lines 395-397
g, ctx := errgroup.WithContext(e.ctx)  // Still uses cancellable context
g.Go(func() error { return e.headerBroadcaster.WriteToStoreAndBroadcast(ctx, header) })
g.Go(func() error { return e.dataBroadcaster.WriteToStoreAndBroadcast(ctx, data) })

Analysis:

  • If cancellation occurs after state updates but before broadcasting, the node will have committed state locally but failed to propagate to the network
  • The code comment on line 400 says "don't fail block production on broadcast error" which suggests this is intentional
  • However, there's an inconsistency: broadcast errors are logged but the node's local state has moved forward

Recommendation: Consider one of these approaches:

  • Document why broadcast failure is acceptable after state commit (e.g., broadcasts will be retried via sync mechanisms)
  • OR use context.TODO() for broadcasts as well if they're critical to consistency
  • OR implement a background retry mechanism for failed broadcasts
2. Observability Gap (block/internal/executing/executor.go:386-392)

When using context.TODO(), there's no timeout or cancellation signal. If these operations hang indefinitely, the executor goroutine will be stuck.

Current behavior:

if err := e.store.SetHeight(context.TODO(), newHeight); err != nil {
    return fmt.Errorf("failed to update store height: %w", err)
}

Recommendation: Add timeout-based context instead of context.TODO():

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

if err := e.store.SetHeight(ctx, newHeight); err != nil {
    return fmt.Errorf("failed to update store height: %w", err)
}

if err := e.updateState(ctx, newState); err != nil {
    return fmt.Errorf("failed to update state: %w", err)
}

This provides:

  • Protection against indefinite hangs
  • Bounded resource usage
  • Better operational visibility if these operations start timing out
3. Error Recovery Strategy (block/internal/executing/executor.go:386-392)

If SetHeight or updateState fails after SaveBlockData succeeds, the system is in an inconsistent state:

  • Block data is saved to disk
  • But the height marker and in-memory state haven't been updated
  • On restart, initializeState will reload the old state and try to produce the same height block again

Questions:

  • Does SaveBlockData allow overwriting the same height?
  • What happens when the node restarts and tries to produce the same block again?

Recommendation: Consider adding recovery logic or documenting the expected behavior during restart scenarios.

4. Testing Implications

This change modifies critical error paths but I don't see corresponding test updates mentioned.

Recommendation: Ensure tests cover:

  • Context cancellation during various stages of produceBlock
  • Cancellation after SaveBlockData but before SetHeight
  • Verification that state updates complete despite parent context cancellation
  • Restart scenarios after partial failures

📋 Code Quality & Style

Go Conventions: Code follows Go idioms
Error Handling: Proper error wrapping with context
Logging: No changes to logging needed
⚠️ Context Handling: Uses context.TODO() which is generally a code smell, though justified here as temporary


🔒 Security & Resilience

No Security Issues: No credential exposure or validation gaps introduced
⚠️ Resilience Concern: Potential for indefinite hangs without timeout on critical path
Concurrency: Atomic state operations remain safe


🎯 Recommendation

APPROVE with suggestions for immediate merge, but recommend follow-up improvements:

  1. Short-term (before merge if possible):

    • Replace context.TODO() with context.WithTimeout(context.Background(), reasonable_timeout)
    • Add a comment explaining the broadcast failure acceptance or fix the inconsistency
  2. Medium-term (track in [Epic]: Atomic behaviour #2274 or separate issue):

    • Implement proper atomic behavior for the entire block production flow
    • Add comprehensive tests for cancellation scenarios
    • Document restart/recovery behavior when state updates fail
  3. Long-term (part of [Epic]: Atomic behaviour #2274):

    • Implement true transactional behavior using database transactions or two-phase commit
    • Ensure SaveBlockData, SetHeight, and UpdateState are atomic

The core change is sound and improves the current situation by preventing mid-operation cancellation corruption. The temporary nature is well-documented and tracked in #2274.


@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedOct 7, 2025, 1:38 PM

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential state inconsistency issue by ensuring that state and height updates are not cancelled after a block has been saved. This is achieved by using context.TODO() for these critical operations. My review suggests a small improvement to use context.Background() instead, as it's more idiomatic for this use case. I've also proposed an improved comment to clarify the reasoning for this change.

Comment on lines +384 to 392
// Once the SaveBlockData has been saved we must update the height and the state.
// context.TODO() should be reverted to the real context (e.ctx) once https://github.com/evstack/ev-node/issues/2274 has been implemented, this prevents context cancellation
if err := e.store.SetHeight(context.TODO(), newHeight); err != nil {
return fmt.Errorf("failed to update store height: %w", err)
}

if err := e.updateState(e.ctx, newState); err != nil {
if err := e.updateState(context.TODO(), newState); err != nil {
return fmt.Errorf("failed to update state: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While context.TODO() works, context.Background() is more idiomatic for creating a non-cancellable context. It more clearly communicates the intent to run a critical operation that must not be cancelled.

I've updated the code to use context.Background() and improved the comment to clarify the rationale.

Suggested change
// Once the SaveBlockData has been saved we must update the height and the state.
// context.TODO() should be reverted to the real context (e.ctx) once https://github.com/evstack/ev-node/issues/2274 has been implemented, this prevents context cancellation
if err := e.store.SetHeight(context.TODO(), newHeight); err != nil {
return fmt.Errorf("failed to update store height: %w", err)
}
if err := e.updateState(e.ctx, newState); err != nil {
if err := e.updateState(context.TODO(), newState); err != nil {
return fmt.Errorf("failed to update state: %w", err)
}
// Once block data is saved, height and state must be updated even if the context is cancelled.
// Using context.Background() prevents inconsistent state. See issue #2274.
if err := e.store.SetHeight(context.Background(), newHeight); err != nil {
return fmt.Errorf("failed to update store height: %w", err)
}
if err := e.updateState(context.Background(), newState); err != nil {
return fmt.Errorf("failed to update state: %w", err)
}

@codecov
Copy link

codecov bot commented Oct 7, 2025

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.17%. Comparing base (a491aee) to head (a5ea75e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
block/internal/executing/executor.go 0.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2741   +/-   ##
=======================================
  Coverage   62.17%   62.17%           
=======================================
  Files          79       79           
  Lines        8497     8497           
=======================================
  Hits         5283     5283           
  Misses       2721     2721           
  Partials      493      493           
Flag Coverage Δ
combined 62.17% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt added this pull request to the merge queue Oct 7, 2025
Merged via the queue into main with commit 184f42f Oct 7, 2025
32 of 33 checks passed
@julienrbrt julienrbrt deleted the julien/ctx branch October 7, 2025 15:10
@github-project-automation github-project-automation bot moved this to Done in Evolve Oct 7, 2025
alpe added a commit that referenced this pull request Oct 9, 2025
* main:
  feat(store)!: add batching for atomicity  (#2746)
  refactor(apps): rollback cmd updates (#2744)
  chore: add makefile for tools (#2743)
  chore: fix markdown lint (#2742)
  build(deps): Bump the all-go group across 5 directories with 6 updates (#2738)
  refactor(block): improve cancellation (#2741)
  chore: make the prompt go oriented  (#2739)
  perf(block): use `sync/atomic` instead of mutexes (#2735)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants