Skip to content

Conversation

@tac0turtle
Copy link
Contributor

Overview

sync service fix for when we are not on genesis but have an empty store

in 1.0.0-beta.9 we fixed writing to p2p store but the issue is when the store is empty AND we are not in genesis then the store is not initialised, this pr aims to fix this

@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 20, 2025, 8:23 AM

@claude
Copy link
Contributor

claude bot commented Nov 12, 2025

Claude finished @tac0turtle's task —— View job


Comprehensive Code Review

I've completed a thorough review of PR #2834 focusing on the sync service fix for non-zero height starts with empty store. Below is my analysis organized by key focus areas.


1. Code Quality & Architecture

✅ Strengths

  • Clean atomic store initialization: The storeInitialized atomic boolean pattern (pkg/sync/sync_service.go:58, 140-147) properly prevents race conditions during concurrent initialization
  • Improved bootstrap logic: The initFromP2PWithRetry method (lines 315-381) now intelligently detects starting height from local store instead of requiring a trusted hash
  • Proper separation of concerns: New RPC endpoints follow existing patterns and are well-structured

⚠️ Concerns

Critical: Incomplete store initialization atomicity (pkg/sync/sync_service.go:341-346)

if syncService.storeInitialized.CompareAndSwap(false, true) {
    if _, err := syncService.initStore(ctx, trusted); err != nil {
        syncService.storeInitialized.Store(false)  // ⚠️ Race window
        return false, fmt.Errorf("failed to initialize the store: %w", err)
    }
}

Issue: There's a potential race between checking the atomic and initializing. If initFromP2PWithRetry runs concurrently with WriteToStoreAndBroadcast, both could attempt initialization. Consider using a mutex for the entire initialization section.

Moderate: Inconsistent error handling in P2P store info (pkg/rpc/server/server.go:238-248)

if head, err := store.Head(ctx); err == nil {
    snapshot.Head = toP2PStoreEntry(head)
} else if !errors.Is(err, goheader.ErrEmptyStore) && !errors.Is(err, goheader.ErrNotFound) {
    return nil, fmt.Errorf("failed to read %s head: %w", label, err)
}

Concern: Silent failures for ErrEmptyStore and ErrNotFound mean clients receive incomplete data without indication. The P2PStoreSnapshot should include a status field or the API should distinguish between "empty" and "error".

Minor: Magic number for timeout (pkg/sync/sync_service.go:357)

timeoutTimer := time.NewTimer(time.Minute * 10)

Should be a configurable constant.


2. Error Handling & Resilience

✅ Strengths

  • Exponential backoff: Well-implemented in initFromP2PWithRetry with sensible max backoff (lines 354-379)
  • Context-aware shutdown: Proper use of context in Stop() method
  • Graceful degradation: The replay mechanism (block/internal/common/replay.go:70-79) handles execution layer being ahead without failing

⚠️ Concerns

Critical: Silent error in broadcast during initialization (pkg/sync/sync_service.go:162-169)

if (firstStart && errors.Is(err, pubsub.ValidationError{Reason: pubsub.RejectValidationIgnored})) ||
    ((storeInitialized) && errors.Is(err, pubsub.ValidationError{Reason: pubsub.RejectValidationFailed})) {
    return nil  // ⚠️ Silently swallowing errors
}

Issue: Complex conditional logic silently swallows broadcast errors. This makes debugging difficult. At minimum, add detailed debug logging explaining why each error type is safe to ignore.

Moderate: No validation of heightToQuery (pkg/sync/sync_service.go:327-335)

switch {
case errors.Is(headErr, header.ErrNotFound), errors.Is(headErr, header.ErrEmptyStore):
    heightToQuery = syncService.genesis.InitialHeight
case headErr != nil:
    return false, fmt.Errorf("failed to inspect local store head: %w", headErr)
default:
    heightToQuery = head.Height()  // ⚠️ No validation
}

Issue: Should validate that head.Height() >= genesis.InitialHeight. Corrupted store could return invalid height.


3. Security

✅ Strengths

  • No exposed secrets: No hardcoded credentials or keys
  • Proper input validation: RPC endpoints validate identifiers before processing

⚠️ Concerns

Low: Potential DoS via GetP2PStoreInfo (pkg/rpc/server/server.go:185-211)

The endpoint has no rate limiting and could be expensive if stores are very large. Consider:

  • Adding request context timeout
  • Limiting response size
  • Rate limiting at HTTP layer

Low: Genesis DA height validation (pkg/rpc/server/server.go:175-177)

if len(resp.Msg.GetValue()) != 8 {
    return nil, connect.NewError(connect.CodeNotFound, errors.New("invalid metadata value"))
}

Should use CodeInvalidArgument or CodeDataLoss instead of CodeNotFound for data corruption.


4. Performance & Resource Efficiency

✅ Strengths

  • Non-blocking syncer start: Proper use of goroutines and channels
  • Efficient store querying: Single lookup for height determination

⚠️ Concerns

Moderate: Unbounded retry loop (pkg/sync/sync_service.go:360-380)

for {
    ok, err := tryInit(ctx)
    if ok {
        return nil
    }
    // ... retry logic
}

Issue: While there's a 10-minute timeout, the loop could perform many retries with small backoff. Consider:

  • Max retry count
  • Logging every Nth retry to avoid log spam
  • Backoff jitter to prevent thundering herd

Minor: Multiple store.Head() calls (pkg/sync/sync_service.go:234, 327)

The initStore and initFromP2PWithRetry both call store.Head(). These could be combined in some code paths.


5. Testing & Reliability

✅ Strengths

  • Excellent test coverage: New test TestHeaderSyncServiceInitFromHigherHeight specifically validates the fix (pkg/sync/sync_service_test.go:117-168)
  • Comprehensive RPC tests: Good mock-based testing for new endpoints
  • Integration test exists: TestHeaderSyncServiceRestart validates restart scenarios

⚠️ Concerns

Critical: Missing test case - concurrent initialization

Should add a test that attempts concurrent calls to WriteToStoreAndBroadcast during initFromP2PWithRetry to validate the atomicity guarantees.

Moderate: Test data realism (pkg/sync/sync_service_test.go:158-165)

headerConfig := types.HeaderConfig{
    Height:   genesisDoc.InitialHeight + 5,  // ⚠️ Jumping 5 heights
    // ...
}

This test creates a header at height 6 when store is empty. While this validates the fix, should also test with heights 1, 2, 3... to ensure no intermediate state issues.

Low: Client test uses wrong peer IDs (pkg/rpc/client/client_test.go:208-211)

require.Equal(t, "3tSMH9AUGpeoe4", resultPeers[0].Id)
// ... but setup was:
{
    ID:    "3bM8hezDN5",  // ⚠️ Mismatch

Appears to be deterministic hashing, but test expectations don't match setup. Fragile.


6. Documentation & Developer Experience

✅ Strengths

  • CHANGELOG is clear: Well-documented breaking changes and fixed issues
  • Good function comments: initFromP2PWithRetry has excellent documentation
  • Proto documentation: RPC methods have clear descriptions

⚠️ Concerns

Moderate: Missing migration guide

The removal of TrustedHash is a BREAKING CHANGE. CHANGELOG mentions it (line 29), but should include:

  • Migration steps for users upgrading
  • How to handle existing config files with trusted_hash set
  • Impact on light clients (if any)

Minor: Incomplete TODO (pkg/sync/sync_service.go:36)

// TODO: when we add pruning we can remove this
const ninetyNineYears = 99 * 365 * 24 * time.Hour

Should reference a tracking issue for pruning feature.

Minor: Confusing protobuf field numbering (proto/evnode/v1/state_rpc.proto:68, 82-83)

message GetGenesisDaHeightResponse {
  uint64 height = 3;  // ⚠️ Why 3? Should be 1
}

message P2PStoreSnapshot {
  string        label  = 1;
  uint64        height = 2;
  P2PStoreEntry head   = 4;  // ⚠️ Skipped 3
  P2PStoreEntry tail   = 6;  // ⚠️ Skipped 5
}

While protobuf allows non-sequential numbering, this suggests removed fields. Should document or use sequential numbering for clarity.


7. Specific File Reviews

pkg/sync/sync_service.go (Core Fix)

Rating: 🟡 Good with critical concerns

The fix correctly addresses the empty store initialization issue. However:

  • Line 341-346: Race condition risk (see section 1)
  • Line 327-335: Should validate retrieved height
  • Line 357: Hardcoded timeout should be configurable

pkg/rpc/server/server.go (New Endpoints)

Rating: 🟢 Good

Well-structured additions. Minor suggestions:

  • Line 113-129: DA height fetch could be extracted to helper function (repeated pattern)
  • Line 175-177: Wrong error code for invalid data

block/internal/common/replay.go (Relaxed Validation)

Rating: 🟢 Excellent

The change to allow execution ahead of target (lines 70-79) is well-reasoned and properly documented. Good defensive programming.


Summary & Recommendations

Must Fix Before Merge

  1. Sync service atomicity: Address the race condition in store initialization (use mutex or redesign)
  2. Add concurrent initialization test: Validate thread safety
  3. Improve error handling: Add detailed logging for swallowed broadcast errors
  4. Fix protobuf field numbering: Use sequential or document why skipped

Should Fix (High Priority)

  1. Add migration guide: Document TrustedHash removal migration path
  2. Validate height from store: Ensure retrieved height >= genesis height
  3. Make timeout configurable: Extract magic number to config

Nice to Have

  1. Add P2PStoreSnapshot status field: Indicate empty vs error states
  2. Extract DA height fetching: Reduce code duplication
  3. Add retry count limit: Prevent excessive retry attempts
  4. Add backoff jitter: Prevent thundering herd

Overall Assessment

This PR successfully fixes the sync service initialization issue for non-zero height starts with empty stores. The approach is sound, removing the problematic TrustedHash requirement in favor of intelligent height detection from local store state. Test coverage is good and the CHANGELOG is clear.

However, there are critical thread-safety concerns around the store initialization atomicity that must be addressed before merging. The current implementation has a race window that could cause issues in production.

Recommendation: Request changes to address the critical atomicity issue, then approve with remaining items as follow-up improvements.


@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 68.86792% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.89%. Comparing base (a75b4bc) to head (6e6d2e8).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/sync/sync_service.go 21.87% 24 Missing and 1 partial ⚠️
pkg/rpc/server/server.go 86.36% 3 Missing and 3 partials ⚠️
pkg/rpc/client/client.go 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2834      +/-   ##
==========================================
+ Coverage   64.67%   64.89%   +0.22%     
==========================================
  Files          81       81              
  Lines        7173     7242      +69     
==========================================
+ Hits         4639     4700      +61     
- Misses       1995     1998       +3     
- Partials      539      544       +5     
Flag Coverage Δ
combined 64.89% <68.86%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tac0turtle tac0turtle marked this pull request as ready for review November 12, 2025 14:14
@tac0turtle tac0turtle changed the title fix: sync service fix: sync service for non zero height starts Nov 12, 2025
@tac0turtle tac0turtle changed the title fix: sync service for non zero height starts fix: sync service for non zero height starts with empty store Nov 12, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2025

PR Preview Action v1.6.2
Preview removed because the pull request was closed.
2025-11-20 11:29 UTC

alpe
alpe previously approved these changes Nov 12, 2025
Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utAck 👍

@tac0turtle
Copy link
Contributor Author

testing a few other fixes so will wait off on merging this

<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.

NOTE: PR titles should follow semantic commits:
https://www.conventionalcommits.org/en/v1.0.0/
-->

## Overview

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 

Ex: Closes #<issue number>
-->


add store-info command to inspect p2p store to see what is present
<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.

NOTE: PR titles should follow semantic commits:
https://www.conventionalcommits.org/en/v1.0.0/
-->

## Overview

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 

Ex: Closes #<issue number>
-->

This pr removes the trsuted hash approach to sync. this works for
celestia node since they do not reconstruct state so they can jump to a
height that is closer to the head. With Evolve this assumption is
incorrect, we ned to reconstruct state, meaning we either need to sync
from genesis or download a db snapshot then start the node from there.
heightToQuery uint64
)

if syncService.conf.Node.TrustedHash != "" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this as we can not start syncing from a random height as we are in the process of reconstructing state. While this is something that is useful for light mode, we should aim to simplify the system before working on light mode as today its not a priority.

Instead we check the p2p store to see if we have a height, if not then we go to genesis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that actually? You should be able to ignore the previous state checks for that block and just start from an abitrary block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future when we begin pruning state, i believe with how goheader works we wont be able to use the genesis height as our trusted height if its not present. In this case we should use what is present in the store. We can also opt to use the trusted hash but possibly from the other store.

The trusted hash approach is useful for things like the light node but becuase this isnt a concern for today then we should optimize for what is being used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so it cannot be used to fetch that height to the p2p peers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will request the height from a peer, you are correct, but we cant do much with that info. If we are syncing from p2p or da then we would have a store already so its best to trust what we have already. If we have an empty store, we will either sync from genesis or grab a hosted snapshot, so we will have something in the store.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I mean in the use case you want to start to sync from a specific height with an empty state. I still do no understand why it wouldn't be supported.
We'd init the store at that height, and like with a pruned state, you won't be able to query prior to that trusted height. but that wouldn't differ from a pruned state.

Anyway, I agree that it is fine to delete it to simplify, but I do believe it shouldn't take much eventually to make it work. In my head it'd simply work list this:

  • Fetch height X (as trusted height) from peer
  • Init store with that height
  • In syncing (as i believe the edge case you've just synced a node and it immediately failover to this one as executor (raft) does not need to be supported) our ev-node store would be empty, but we can have a dummy X-1 height, and sync X-1->X verification for this special case
  • X -> X+1 would then work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this scenario you will need to provide some sort of state otherwise syncing wont work as we wont be able to verify apphashes and other hashes. If we want to support this then we need a form of state sync, which id argue is overkill. I think its better to provide a protocol that queries for latest snapshot from a s3 bucket or something similar. There isnt a trust issue here because once you download if something is tampered you will fail at syncing.

Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first walkthrough

heightToQuery uint64
)

if syncService.conf.Node.TrustedHash != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that actually? You should be able to ignore the previous state checks for that block and just start from an abitrary block?

@tac0turtle tac0turtle marked this pull request as draft November 15, 2025 20:27
@tac0turtle tac0turtle marked this pull request as ready for review November 18, 2025 11:07
julienrbrt
julienrbrt previously approved these changes Nov 18, 2025
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK!

heightToQuery uint64
)

if syncService.conf.Node.TrustedHash != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I mean in the use case you want to start to sync from a specific height with an empty state. I still do no understand why it wouldn't be supported.
We'd init the store at that height, and like with a pruned state, you won't be able to query prior to that trusted height. but that wouldn't differ from a pruned state.

Anyway, I agree that it is fine to delete it to simplify, but I do believe it shouldn't take much eventually to make it work. In my head it'd simply work list this:

  • Fetch height X (as trusted height) from peer
  • Init store with that height
  • In syncing (as i believe the edge case you've just synced a node and it immediately failover to this one as executor (raft) does not need to be supported) our ev-node store would be empty, but we can have a dummy X-1 height, and sync X-1->X verification for this special case
  • X -> X+1 would then work fine.

@tac0turtle tac0turtle added this pull request to the merge queue Nov 20, 2025
@tac0turtle tac0turtle removed this pull request from the merge queue due to a manual request Nov 20, 2025
@tac0turtle tac0turtle merged commit 9d4c64c into main Nov 20, 2025
56 of 59 checks passed
@tac0turtle tac0turtle deleted the marko/sync_service_fix branch November 20, 2025 11:28
alpe added a commit that referenced this pull request Nov 20, 2025
* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants