Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Nov 12, 2025

Implement failover via RAFT

  • Improve Cache startup/shutdown with parallelization
  • Publish to RAFT cluster in executor
  • Sync DB after each block created in executor
  • Add new RaftReceiver to sync when in aggregator follower mode
  • Introduce failoverState to switch between follower/leader mode
  • Provide RAFT node details via http endpoint

@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 25, 2025, 8:22 AM

@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 41.44144% with 520 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.25%. Comparing base (8cd0fb8) to head (5de9f0e).

Files with missing lines Patch % Lines
pkg/raft/node.go 12.50% 168 Missing ⚠️
pkg/raft/node_mock.go 45.08% 74 Missing and 21 partials ⚠️
block/internal/syncing/raft_retriever.go 0.00% 60 Missing ⚠️
node/full.go 32.81% 36 Missing and 7 partials ⚠️
node/failover.go 74.45% 22 Missing and 13 partials ⚠️
block/internal/syncing/syncer.go 28.88% 30 Missing and 2 partials ⚠️
block/internal/executing/executor.go 6.66% 24 Missing and 4 partials ⚠️
pkg/raft/election.go 79.26% 12 Missing and 5 partials ⚠️
pkg/rpc/server/http.go 6.66% 13 Missing and 1 partial ⚠️
block/internal/syncing/assert.go 57.89% 4 Missing and 4 partials ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2836      +/-   ##
==========================================
- Coverage   64.76%   62.25%   -2.51%     
==========================================
  Files          81       88       +7     
  Lines        7350     8063     +713     
==========================================
+ Hits         4760     5020     +260     
- Misses       2049     2467     +418     
- Partials      541      576      +35     
Flag Coverage Δ
combined 62.25% <41.44%> (-2.51%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alpe added 4 commits November 17, 2025 10:55
* main:
  fix: remove duplicate error logging in light node shutdown (#2841)
  chore: fix incorrect function name in comment (#2840)
  chore: remove sequencer go.mod (#2837)
* main:
  build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851)
  build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
@alpe alpe changed the title [WIP] feat: High availabilty via RAFT feat: High availabilty via RAFT Nov 17, 2025
alpe added 4 commits November 18, 2025 11:00
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
@alpe alpe marked this pull request as ready for review November 19, 2025 09:54
@alpe
Copy link
Contributor Author

alpe commented Nov 19, 2025

@claude review again

@claude
Copy link
Contributor

claude bot commented Nov 19, 2025

Claude finished @alpe's task —— View job


Code Review: High Availability via RAFT

I've completed a comprehensive review of this RAFT-based high availability implementation. This is a significant feature adding distributed consensus for failover support.


1. Code Quality & Architecture

Excellent Design Choices

  • Clean abstraction: The pkg/raft package properly encapsulates RAFT complexity with well-defined interfaces (Runnable, RaftNode)
  • Failover pattern: The DynamicLeaderElection in pkg/raft/election.go is well-structured for managing leader/follower transitions
  • Modular integration: RAFT integration via interfaces (common.RaftNode) keeps coupling low

Architectural Concerns

pkg/raft/node.go:211 - JSON serialization bottleneck

data, err := json.Marshal(state) // todo:use protobuf
  • Issue: JSON adds significant overhead for 100ms block times. At high throughput, this will cause latency issues.
  • Impact: Every block broadcast serializes via JSON; deserialization happens on all followers
  • Fix: Priority TODO - migrate to protobuf. The TODO comment acknowledges this, but it should be addressed before merging given the 100ms target block time mentioned in review comments.

pkg/raft/node.go:274-276 - Insufficient documentation

func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {
	n.fsm.applyCh = ch
}
  • Issue: The godoc says "channel must have sufficient buffer space" but doesn't specify how much
  • Concern: In production, undersized buffers will silently drop updates (line 296)
  • Fix: Document recommended buffer size or provide a constructor/config option

node/failover.go - Naming confusion

  • newSyncMode and newAggregatorMode are constructors but names don't follow New* convention
  • Consider: NewSyncMode and NewAggregatorMode for consistency with Go conventions

block/internal/syncing/raft_retriever.go:0% - Zero test coverage

  • This is a critical path for follower synchronization
  • No tests for the raft apply loop, error handling, or block consumption
  • Risk: Follower nodes may fail silently in edge cases

2. Error Handling & Resilience

Strong Points

  • Proper context cancellation throughout
  • Good use of errors.Join for multi-error scenarios
  • Retry logic with backoff in executor

Critical Issues

pkg/raft/election.go:96-104 - Double-signing prevention logic

if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
    d.logger.Info().Msg("became leader, but not synced. Pass on leadership")
    if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
        return err // ⚠️ Terminates the election loop entirely
    }
    continue
}
  • Critical: On leadership transfer failure, the entire node stops (election loop exits)
  • Impact: A transient RAFT error can take down the node permanently
  • Risk: If no suitable leader exists, this node crashes instead of retrying
  • Fix: Log error but continue the loop; let RAFT retry the election

block/internal/executing/executor.go:452 - No rollback on RAFT broadcast failure

if err := e.raftNode.Broadcast(e.ctx, raftState); err != nil {
    return fmt.Errorf("failed to propose block to raft: %w", err)
}
if err := batch.Commit(); err != nil { // ⚠️ Committed to local store even if RAFT fails
    return fmt.Errorf("failed to commit batch: %w", err)
}
  • Issue: Block is committed to local DB after RAFT broadcast fails
  • Impact: Leader's local state diverges from cluster state
  • Risk: On restart, leader may try to produce blocks from inconsistent state
  • Fix: Broadcast to RAFT before committing to local store, or implement compensating rollback

block/internal/syncing/syncer.go:181 - Incomplete catch-up check

return s.headerStore.Store().Height() > currentHeight ||
    s.dataStore.Store().Height() > currentHeight ||
    s.raftRetriever != nil && s.raftRetriever.Height() > currentHeight
  • Observation: Per your comment response, DA is not checked here by design (performance)
  • Concern: If DA has newer state but P2P/RAFT don't, node will think it's synced
  • Clarification needed: Confirm this is intentional based on sync strategy priority

3. Security

Concerns

block/internal/syncing/raft_retriever.go:112-114 - Weak proposer validation

if err := assertExpectedProposer(r.genesis, header.ProposerAddress); err != nil {
    r.logger.Debug().Err(err).Msg("unexpected proposer")
    return nil // ⚠️ Silently ignores blocks from wrong proposer
}
  • Issue: Malicious RAFT follower could inject blocks from unauthorized proposers
  • Impact: Security relies solely on RAFT cluster membership, not cryptographic verification
  • Fix: Log at WARN level minimum; consider returning error to trigger cluster investigation

pkg/raft/node.go:285 - No validation in FSM.Apply

func (f *FSM) Apply(log *raft.Log) interface{} {
    var state RaftBlockState
    if err := json.Unmarshal(log.Data, &state); err != nil {
        f.logger.Error().Err(err).Msg("unmarshal block state")
        return err
    }
    if err := f.state.Load().assertValid(state); err != nil { // ⚠️ Only basic validation
        return err
    }
  • Concern: assertValid only checks height/timestamp monotonicity (types.go:15-23)
  • Missing: No signature verification, no hash chain validation
  • Risk: Compromised leader can inject invalid blocks that followers accept via RAFT
  • Fix: Add cryptographic validation in FSM before accepting state updates

Secrets Management - No issues found

  • JWT secrets, signing keys properly externalized
  • No hardcoded credentials

4. Performance & Resource Efficiency

Bottlenecks

pkg/raft/node.go:211 - JSON overhead (repeated from Architecture)

  • At 100ms block time (600 blocks/min), JSON marshaling/unmarshaling will add 10-50ms latency per block
  • Measurement needed: Profile actual overhead under load

block/internal/syncing/syncer.go:355 - Aggressive polling

ticker := time.NewTicker(10 * time.Millisecond)
  • Issue: 100 checks/sec for pending events is CPU-intensive
  • Better: Use notification channels or increase interval to 50-100ms

node/failover.go:187-231 - Shutdown complexity

  • Sequential shutdown of multiple services with 3s timeout each
  • Risk: Total shutdown time could exceed 15 seconds in worst case
  • Better: Use errgroup for parallel shutdown with shared timeout

Resource Leaks

pkg/raft/node.go:92-95 - No cleanup on error

r, err := raft.NewRaft(raftConfig, fsm, logStore, stableStore, snapshotStore, transport)
if err != nil {
    return nil, fmt.Errorf("create raft: %w", err)
}
  • Issue: If NewRaft fails, bolt stores and transport are not closed
  • Fix: Defer cleanup or use named returns with error handling

5. Testing & Reliability

Test Coverage Issues (41.34% patch coverage)

Major gaps:

  1. pkg/raft/node.go: 12.50% - Core RAFT logic barely tested

    • Snapshot/restore not tested
    • Peer add/remove not tested
    • Bootstrap scenarios limited
  2. block/internal/syncing/raft_retriever.go: 0% - Completely untested

    • Follower sync path is critical
    • Error handling unverified
  3. node/failover.go: 74.45% - Mode switching needs more coverage

    • Leader→Follower transition edge cases
    • Concurrent state changes

Positive Testing

  • pkg/raft/election_test.go has good coverage of leadership scenarios
  • test/e2e/failover_e2e_test.go validates end-to-end failover flow
  • E2E test checks for double-signing - excellent!

Missing Test Scenarios

  1. Network partition during leader election
  2. Follower crashes during RAFT apply
  3. RAFT log divergence after network split
  4. Leader produces block but crashes before broadcasting
  5. Race between P2P sync and RAFT sync

6. Documentation & Developer Experience

Strengths

  • Clear ADR likely exists (references in commit messages)
  • Configuration well-documented in config.go
  • E2E test serves as good integration example

Gaps

Configuration defaults unclear

  • config/defaults.go should set sensible RAFT defaults but review doesn't show them
  • Users need guidance on:
    • Minimum cluster size (3? 5?)
    • Heartbeat timeout recommendations
    • Snapshot frequency tuning

Migration path undocumented

  • How do existing single-node deployments upgrade to RAFT cluster?
  • Can RAFT be enabled without data migration?

Operational runbooks missing

  • How to add/remove nodes from cluster?
  • How to recover from split-brain?
  • Monitoring/alerting recommendations?

7. Specific Code Review Comments

pkg/raft/election.go:95-104 (CRITICAL)

Double-signing prevention stops node entirely on transfer failure

if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
    // the leadership transfer can fail due to no suitable leader. Better stop than double sign on old state
    return err // ❌ Terminates node
}

Recommendation: Change to:

if err := d.node.leadershipTransfer(); err != nil {
    d.logger.Warn().Err(err).Msg("leadership transfer failed, will retry on next election")
    // Continue participating in elections
}
continue

block/internal/executing/executor.go:434-460 (HIGH)

Commit order creates inconsistency risk

Current flow:

  1. Broadcast to RAFT (can fail)
  2. Commit to local DB (always succeeds)

If step 1 fails but step 2 succeeds, leader has uncommitted block locally.

Recommendation:

  • Option A: Move RAFT broadcast before batch.Commit()
  • Option B: Add compensating rollback on RAFT failure
  • Option C: Make batch.Commit() idempotent so retry is safe

block/internal/syncing/raft_retriever.go:99-128 (MEDIUM)

No test coverage for critical follower path

This code consumes blocks from RAFT leader. Zero tests means:

  • Unmarshaling errors not tested
  • Validation bypass scenarios not tested
  • Event processing failures not tested

Recommendation: Add unit tests covering:

func TestRaftRetrieverConsumeBlock(t *testing.T) {
    // Test invalid header
    // Test unexpected proposer
    // Test event processing failure
    // Test context cancellation
}

node/full.go:216-220 (LOW)

Initialization order unclear

if e.raftNode != nil {
    // ensure node is fully synced before producing any blocks
    if raftState := e.raftNode.GetState(); raftState.Height != 0 && raftState.Height != state.LastBlockHeight {
        return fmt.Errorf("invalid state: node is not synced with the chain: raft %d != %d state", raftState.Height, state.LastBlockHeight)
    }
}

Question: When does this check trigger? On first start raftState.Height is 0, so check passes. On restart after failover, heights should match. When would they differ?


Summary & Recommendations

Critical (Must Fix Before Merge)

  1. Fix election loop termination (election.go:101) - node stops on transfer failure
  2. Fix commit ordering (executor.go:452) - RAFT broadcast after DB commit creates inconsistency
  3. Add RAFT retriever tests (raft_retriever.go) - 0% coverage on critical path
  4. Migrate JSON to protobuf (node.go:211) - performance bottleneck at 100ms block time

High Priority (Fix Soon After Merge)

  1. Improve FSM validation - add signature/hash validation in RAFT state machine
  2. Fix resource leaks - cleanup bolt stores on initialization failure
  3. Add split-brain tests - network partition scenarios
  4. Document operational procedures - runbooks for add/remove nodes, recovery

Medium Priority (Technical Debt)

  1. Increase pending poll interval - 10ms too aggressive, use 50-100ms
  2. Add buffer size guidance - document SetApplyCallback buffer requirements
  3. Improve proposer validation logging - silent ignore → warning
  4. Parallel shutdown - reduce failover downtime

Low Priority (Nice to Have)

  1. Follow naming conventions - newSyncMode → NewSyncMode
  2. Add migration docs - upgrade path for existing deployments
  3. Clarify initialization checks - when does raft height != state height?

Overall Assessment

This is a solid foundational implementation of RAFT-based HA. The architecture is clean, the integration is thoughtful, and the E2E test validates the core failover scenario.

However, there are critical correctness issues around error handling and commit ordering that could cause cluster divergence or node crashes in production. The low test coverage (41%) and zero coverage on follower sync paths is concerning for a consensus system.

Recommendation: Address the 4 critical issues before merging. The performance optimization (JSON→protobuf) is essential for the stated 100ms block time goal but could be done in a follow-up PR with benchmarks.

Great work overall - this is a complex feature and the implementation shows careful consideration of the distributed systems challenges. With the fixes above, this will be production-ready.


@evstack evstack deleted a comment from claude bot Nov 19, 2025
@alpe alpe requested a review from julienrbrt November 19, 2025 11:35
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first walkthrough, I'll go read about github.com/hashicorp/raft and come back to review election.go and node.go

return nil
}

// Height returns the current height stored
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to know the height of the p2p (go header) store? (I am still reviewing, this may get clearer). We can have the app height from the evolve store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the node switches from sync to aggregator mode, the internal state is key to prevent double signing.
The Syncer has a isCatchingUpState method now that checks the stores for any height > current.
it is called within the leader election loop to transfer leadership in case it is not fully synced, yet.

}

// SetApplyCallback sets a callback to be called when log entries are applied
func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what is this for? the go doc is very light

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel is passed by the syncer to receive first level state updates from within the raft cluster. This should be the fastest communication channel available.

}()

// Check raft leadership if raft is enabled
if e.raftNode != nil && !e.raftNode.IsLeader() {
Copy link
Member

@julienrbrt julienrbrt Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: i wonder how this will play with different sequencers.
In #2797 you can get to that path without node key (to sign). I suppose we'll need to add a condition for based sequencing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I was only preparing for single sequencer. Base would not work with raft as there are no aggregators.

leaderFactory := func() (raftpkg.Runnable, error) {
logger.Info().Msg("Starting aggregator-MODE")
nodeConfig.Node.Aggregator = true
nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand this. is the aggregator broadcasting to no one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the aggregator is required to broadcast to at least one node part of a larger mesh other wise p2p will not work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more who calls whom. The aggregator gets called not otherwise. Starting all nodes with p2p-peer setup makes sense though. When a ha cluster is setup, the raft leader gets the aggregator role and I clear the peers when the p2p stack is restarted.
There is an error thrown somewhere when peers are not empty.

node/full.go Outdated
func initRaftNode(nodeConfig config.Config, logger zerolog.Logger) (*raftpkg.Node, error) {
raftDir := nodeConfig.Raft.RaftDir
if raftDir == "" {
raftDir = filepath.Join(nodeConfig.RootDir, "raft")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should be using DefaultConfig() value if empty.

bc *block.Components
}

func newSyncMode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i was a tiny bit confused this was moved here instead of full.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the constructors. Naming could be better, I guess.

}
return setupFailoverState(nodeConfig, nodeKey, database, genesis, logger, mainKV, rktStore, blockComponentsFn, raftNode)
}
func newAggregatorMode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return fmt.Errorf("not leader")
}

data, err := json.Marshal(state) // todo:use protobuf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the todo? size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should migrate to protobuf here. json will cause overhead, at 100ms we need to minimise it as much as possible

* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
alpe added 3 commits November 20, 2025 17:24
* main:
  chore: bump da (#2866)
  chore: bump  core (#2865)
* main:
  chore: fix some comments (#2874)
  chore: bump node in evm-single (#2875)
  refactor(syncer,cache): use compare and swap loop and add comments (#2873)
  refactor: use state da height as well (#2872)
  refactor: retrieve highest da height in cache (#2870)
  chore: change from event count to start and end height (#2871)
github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025
## Overview

Speed up cache write/loads via parallel execution.  

Pulled from  #2836
github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025
## Overview

Minor updates to make it easier to trace errors

Extracted from #2836
alpe added 5 commits November 24, 2025 16:21
* main:
  chore: remove extra github action yml file (#2882)
  fix(execution/evm): verify payload status (#2863)
  feat: fetch included da height from store (#2880)
  chore: better output on errors (#2879)
  refactor!: create da client and split cache interface (#2878)
  chore!: rename `evm-single` and `grpc-single` (#2839)
  build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876)
  chore: parallel cache de/serialization (#2868)
  chore: bump blob size (#2877)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants