Skip to content

feat(evm/sync): implement dynamic state sync orchestration#5051

Draft
powerslider wants to merge 58 commits intomasterfrom
powerslider/dynamic-state-sync-poc
Draft

feat(evm/sync): implement dynamic state sync orchestration#5051
powerslider wants to merge 58 commits intomasterfrom
powerslider/dynamic-state-sync-poc

Conversation

@powerslider
Copy link
Copy Markdown
Contributor

Why this should be merged

DO NOT MERGE!!!

This is only a PoC branch that implements the full dynamic state sync orchestration flow. All of these changes will be partitioned in more review friendly parts with narrow responsibility scopes.

How this works

  • Add dynamic engine components: coordinator, block queue, pivot policy, sync target, and dynamic executor.
  • Extend engine client to start static/dynamic executors and route engine block events (accept/reject/verify) to the active executor.
  • Delegate wrapped block lifecycle hooks to sync client in coreth and subnet-evm, with deferred execution support.
  • Add UpdateTarget to types.Syncer and provide no-op compatibility implementations for existing syncers.
  • Plumb dynamic sync config through coreth and subnet-evm:
    • state-sync-dynamic-enabled
    • state-sync-pivot-interval (default 10000)
  • Document dynamic mode behavior in sync/config docs.
  • Add and refine tests for dynamic flow, queue/coordinator/pivot behavior, and sync VM dynamic mode coverage.

Signed-off-by: Tsvetan Dimitrov (tsvetan.dimitrov@avalabs.org)

How this was tested

UT and later e2e tests

Need to be documented in RELEASES.md?

for now no

- Add dynamic engine components: coordinator, block queue, pivot policy,
  sync target, and dynamic executor.
- Extend engine client to start static/dynamic executors and route
  engine block events (accept/reject/verify) to the active executor.
- Delegate wrapped block lifecycle hooks to sync client in coreth
  and subnet-evm, with deferred execution support.
- Add `UpdateTarget` to `types.Syncer` and provide no-op compatibility
  implementations for existing syncers.
- Plumb dynamic sync config through coreth and subnet-evm:
  - `state-sync-dynamic-enabled`
  - `state-sync-pivot-interval` (default `10000`)
- Document dynamic mode behavior in sync/config docs.
- Add and refine tests for dynamic flow, queue/coordinator/pivot behavior,
  and sync VM dynamic mode coverage.

Signed-off-by: Tsvetan Dimitrov (tsvetan.dimitrov@avalabs.org)
@powerslider powerslider self-assigned this Mar 5, 2026
@powerslider powerslider requested a review from a team as a code owner March 5, 2026 12:40
@powerslider powerslider added the DO NOT MERGE This PR must not be merged in its current state label Mar 5, 2026
Copilot AI review requested due to automatic review settings March 5, 2026 12:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a proof-of-concept “dynamic” state sync orchestration path for EVM-based VMs (coreth + subnet-evm), allowing the state sync target to be updated during sync and deferring engine block lifecycle operations until sync finalization.

Changes:

  • Introduces dynamic sync orchestration primitives in evm/sync/engine (coordinator, queue, pivot policy, dynamic executor, sync target) and extends the engine client/executor interfaces to handle engine block events.
  • Plumbs new dynamic state sync config flags (state-sync-dynamic-enabled, state-sync-pivot-interval) through coreth and subnet-evm, and documents behavior.
  • Updates/expands unit tests across sync engine, syncers, and VM sync flows; updates types.Syncer with UpdateTarget plus compatibility implementations.

Reviewed changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
graft/subnet-evm/plugin/evm/wrapped_block.go Routes engine Accept/Reject/Verify into the sync client with optional deferral.
graft/subnet-evm/plugin/evm/vm.go Plumbs dynamic sync config into the sync engine client config.
graft/subnet-evm/plugin/evm/syncervm_test.go Adds static vs dynamic state sync mode coverage and renames expected→want fields.
graft/subnet-evm/plugin/evm/config/default_config.go Adds defaults for dynamic sync enablement and pivot interval.
graft/subnet-evm/plugin/evm/config/config.md Documents new dynamic sync config keys.
graft/subnet-evm/plugin/evm/config/config.go Adds dynamic sync fields to VM config struct.
graft/evm/sync/types/types.go Extends types.Syncer with UpdateTarget.
graft/evm/sync/handlers/code_request_test.go Test renames (expected→want).
graft/evm/sync/handlers/block_request_test.go Test renames (expected→want).
graft/evm/sync/evmstate/sync_test.go Test renames (expected→want) and minor variable naming cleanup.
graft/evm/sync/evmstate/state_syncer.go Adds no-op UpdateTarget for compatibility.
graft/evm/sync/evmstate/firewood_syncer.go Adds no-op UpdateTarget for compatibility (and imports message).
graft/evm/sync/engine/sync_target.go Adds internal non-serializable Syncable used to advance targets.
graft/evm/sync/engine/registry_test.go Updates registry tests for new interface and expected→want naming.
graft/evm/sync/engine/registry.go Adds UpdateSyncTarget fanout to registered syncers.
graft/evm/sync/engine/pivot_policy_test.go Adds tests for pivot throttling behavior.
graft/evm/sync/engine/pivot_policy.go Adds pivot throttling policy used for forwarding target updates.
graft/evm/sync/engine/executor_static.go Extends static executor with no-op engine event handlers.
graft/evm/sync/engine/executor_dynamic_test.go Adds tests for dynamic executor deferral + target update behavior.
graft/evm/sync/engine/executor_dynamic.go Implements dynamic executor with block deferral and target update forwarding.
graft/evm/sync/engine/doubles_test.go Adds mock EthBlockWrapper and updates test syncer shim for new interface.
graft/evm/sync/engine/coordinator_test.go Adds tests for coordinator state handling, lifecycle, and queue replay.
graft/evm/sync/engine/coordinator.go Adds coordinator orchestration, queue replay, and target update logic.
graft/evm/sync/engine/client.go Extends engine client with dynamic/static executors, engine block event routing, and new config fields.
graft/evm/sync/engine/block_queue_test.go Adds tests for queue ordering, pruning, dedupe, and concurrency.
graft/evm/sync/engine/block_queue.go Implements a concurrent block operation queue with verify dedupe + pruning.
graft/evm/sync/code/syncer.go Adds no-op UpdateTarget for compatibility.
graft/evm/sync/client/client_test.go Test renames (expected→want).
graft/evm/sync/block/syncer_test.go Refactors parameterized tests and adds “no network requests when on disk” coverage.
graft/evm/sync/block/syncer.go Adds no-op UpdateTarget for compatibility.
graft/evm/sync/README.md Documents static vs dynamic engine execution modes and behavior.
graft/evm/message/block_sync_summary_test.go Test renames (expected→want).
graft/coreth/plugin/evm/wrapped_block.go Routes engine Accept/Reject/Verify into the sync client with optional deferral.
graft/coreth/plugin/evm/vmtest/test_syncervm.go Adds coreth VM integration test coverage for dynamic state sync mode.
graft/coreth/plugin/evm/vm.go Plumbs dynamic sync config into the sync engine client config.
graft/coreth/plugin/evm/config/default_config.go Adds defaults for dynamic sync enablement and pivot interval.
graft/coreth/plugin/evm/config/config.md Documents new dynamic sync config keys.
graft/coreth/plugin/evm/config/config.go Adds dynamic sync fields to VM config struct.
graft/coreth/plugin/evm/atomic/sync/syncer_test.go Test renames (expected→want).
graft/coreth/plugin/evm/atomic/sync/syncer.go Adds no-op UpdateTarget for compatibility.
graft/coreth/plugin/evm/atomic/sync/summary_test.go Test renames (expected→want).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread graft/evm/sync/README.md
Comment on lines +71 to +76
### Engine execution modes

`evm/sync/engine` has two execution modes:

- Static mode: default behavior, equivalent to the original state sync flow.
- Dynamic mode: syncers continue running while accepted blocks can advance the sync target (with pivot throttling), and block operations (`Accept`, `Reject`, `Verify`) are deferred and replayed in FIFO order after sync finalization.
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description explicitly says "DO NOT MERGE!!!" but the code/docs here introduce user-facing configuration and behavior changes. Before landing to main, please ensure the PR metadata matches intent (remove the DO NOT MERGE note / split into smaller PRs as described) so reviewers and release tooling don’t treat this as merge-ready feature work.

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +99
// Notify sync client that engine accepted a block.
// If the block was enqueued for deferred processing, skip immediate execution.
if client := b.vm.SyncerClient(); client != nil {
deferred, err := client.OnEngineAccept(b)
if err != nil {
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deferred-operation hook runs on every Accept call, including when deferred operations are later replayed by the coordinator. During replay, calling Accept will re-enter OnEngineAccept and can re-enqueue the same operation instead of executing it, potentially causing an infinite loop / unbounded queue growth. Consider adding a replay marker (e.g., a context value set by the coordinator) so wrappedBlock skips calling the sync client for internal replays, or otherwise bypass the OnEngine* hooks during deferred operation execution.

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +90
// Notify sync client that engine accepted a block.
// If the block was enqueued for deferred processing, skip immediate execution.
if client := b.vm.SyncerClient(); client != nil {
deferred, err := client.OnEngineAccept(b)
if err != nil {
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as coreth: the sync-client notification is invoked for every Accept call, including when deferred operations are replayed. Without a way to distinguish engine-triggered lifecycle calls from coordinator replays, replay will re-enter OnEngineAccept and can re-queue instead of executing, leading to a non-terminating replay loop. Add a replay marker (context value) or bypass the sync hook during deferred operation execution.

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +244
switch op.operation {
case OpAccept:
err = op.block.Accept(ctx)
case OpReject:
err = op.block.Reject(ctx)
case OpVerify:
err = op.block.Verify(ctx)
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeBlockOperations calls op.block.Accept/Reject/Verify directly. For real VM blocks (wrappedBlock), these methods now call the sync client's OnEngine* hooks first; during deferred replay this will re-enter the executor and can re-enqueue operations instead of executing them (potentially never draining the queue). To make replay actually execute, pass a dedicated replay context (or otherwise disable/bypass OnEngine* during replay).

Copilot uses AI. Check for mistakes.
Comment thread graft/evm/sync/engine/executor_dynamic.go Outdated
Comment thread graft/evm/sync/engine/block_queue.go Outdated
…ate sync

The static code queue assumes a single sync session from start to finish.
Dynamic state sync needs to pivot to a new sync target at any time, which
requires discarding stale in-flight code hashes from the abandoned session
and starting a fresh session without restarting the entire syncer.

Introduce SessionedQueue, an event-driven code queue that emits
session-tagged events (SessionStart, SessionEnd, CodeHash) so the syncer
can distinguish current-session work from stale hashes. The syncer's new
syncFromEvents loop manages per-session worker groups via sessionRunner,
starting and tearing down workers on session boundaries.

Key design decisions:
- Session lifecycle is managed by session.Manager[T], a generic
  coordinator that tracks monotonic session IDs, serializes
  Start/RequestPivot/RestartIfPending transitions under a mutex, and
  cancels the current session context with ErrPivotRequested on pivot.
- PivotTo sends the SessionEnd boundary event before any irreversible
  mutations (DB marker cleanup, session-manager pivot) so that a timeout
  or send failure leaves durable state unchanged.
- Boundary events use a configurable send timeout (default 5s) to fail
  fast under backpressure rather than blocking the producer indefinitely.

Also clean up syncer and session tests: removes over-abstracted test
helpers, uses table-driven tests only where assertion patterns are
uniform.
During StateExecutingBatch, block.Accept/Reject/Verify calls re-enter
the dynamic executor via OnEngineAccept/Reject/Verify. The previous
implementation re-enqueued these operations, creating an infinite loop
(Accept -> OnEngineAccept -> enqueue -> dequeue -> Accept ...).

- Return deferred=false during StateExecutingBatch so the block operation
  executes directly without re-enqueueing.
- Also fix the blockQueue doc
  comment to accurately describe pruning behavior (removeBelowHeight,
  not full drain).
…lization

- Add CAS-guarded state transitions (markAborted, beginFinalizing) with
  sticky terminal states, serialize UpdateSyncTarget via updateMu, and
  ensure queue pruning only occurs after successful syncer fanout.
- Add targetEpoch for tracking successful commit target updates.
Implement UpdateTarget for block syncer so dynamic state sync can notify
it of newer targets mid-sync. The syncer finishes its current pass then
optionally runs one catch-up pass when drift exceeds blocksToFetch.
…und static syncer

- Introduce HashDBDynamicSyncer as a pivot-loop wrapper that rebuilds a
  fresh HashDBSyncer on each target update, keeping static sync unaware
  of dynamic pivot logic
- Rename stateSync to HashDBSyncer.
- Split sync_test.go into hashdb_syncer_test.go and hashdb_dynamic_syncer_test.go.
- Extract AssertDBConsistency and helpers into synctest package.

This change introduces dynamic state sync to the current hashdb static
syncer implementation. This will mostly be used to test the entire
dynamic state sync flow end to end. It will be debated later if this
change will be persistent or removed in favor of the Firewood syncer
implementation.
The atomic syncer syncs to a fixed initial target and does not pivot,
relying on batch replay to fill the gap between its target and the
coordinator's advancing commit target. This requires two infrastructure
changes:

- TargetReporter interface - syncers report their target height so the
  coordinator prunes only down to the slowest syncer, preserving blocks
  the atomic syncer needs for gap filling during batch replay.
- Partial data tolerance in OnFinishAfterCommit - skip ApplyToSharedMemory
  when the atomic trie does not cover the full commit target range,
  deferring to inline application during batch replay and cursor cleanup
  on VM restart.
- Forward UpdateTarget to the underlying merklesync.Syncer which natively
  supports target updates by re-queuing completed work items.
- Firewood uses a regular code.Queue in both static and dynamic modes since it
  does not need the SessionedQueue cancel-restart protocol (for now, maybe
  ChangeProof implementation might change that!).
- Restructure client.go syncer wiring into per-backend helpers with
  firewood as the top-level branch.
- Extract CodeRequestQueue interface into the types package.
- Restore WithFinalizeCodeQueue for static HashDB path that was dropped during
  merge. Fix subnet-evm JSON config double-bracing in syncervm_test.go.
Signed-off-by: Tsvetan Dimitrov <tsvetan.dimitrov23@gmail.com>
- Run bazel-generate-metadata to update BUILD.bazel files after adding
  common.Hash import to types/types.go and other sync package changes.
Signed-off-by: Tsvetan Dimitrov <tsvetan.dimitrov23@gmail.com>
Copy link
Copy Markdown
Contributor

@alarso16 alarso16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the distinction between a pivot/session/pivot_session/pivot_interval is super poorly defined. when things are necessary vs. when they are not I find confusing, largely because of the huge diff.

I left a few large structural comments, but didn't look closely at implementation, since there's no point in this large-PR format

// Persist the block to the raw DB. During dynamic sync the commit-target
// block may not have been fetched by the block syncer if the target
// advanced beyond its initial fetch window.
rawdb.WriteBlock(bc.db, block)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting edge case. I do think this should be handled by the block syncer

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block syncer is the right layer conceptually, but there's a timing gap that makes it insufficient on its own.

The block syncer runs concurrently with the state syncer and finishes independently. Its UpdateTarget records the new height, but the catch-up pass only triggers when drift exceeds blocksToFetch (256). With small advances during dynamic sync, no catch-up fires. You can envision it the following way:

  1. Block syncer starts, fetches blocks 1-256
  2. Block syncer's UpdateTarget records height 265
  3. Block syncer checks: drift (9) < blocksToFetch (256), no catch-up
  4. Block syncer finishes Sync, returns nil
  5. State syncer still running...
  6. Another block arrives, UpdateTarget(268)
  7. State syncer finishes
  8. Coordinator sets commitTarget = 268
  9. AcceptSync -> ResetToStateSyncedBlock(block 268) -> needs block 268 in DB

At step 9, the block syncer is long gone. We could add a targeted fetch of the latest target block before the syncer returns (between steps 3 and 4), but that only covers the target at the time the syncer finishes, not later pivots like step 6. The commit target can advance after the block syncer finishes from late block injections while the state syncer is still running.

ResetToStateSyncedBlock is the one place that has the correct block object and runs at the exact moment the block is needed. The write is two lines (WriteBlock + WriteCanonicalHash) and is idempotent if the block syncer did fetch it. I'd rather have a reliable two-line write in the right place at the right time than a more complex block syncer change that still needs a fallback for the timing gap. I hope this makes it clear. I am open to other suggestions, but at the time being I don't see a better practical solution.


// DrainAcceptorQueue blocks until all pending accepted blocks have been
// fully processed by the async acceptor.
DrainAcceptorQueue()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this can be exposed in its current form. The behavior of calling a sync.WaitGroup.Wait multiple times or concurrently with Add is super suspicious

// PivotSession represents one sync session inside a DynamicSyncer. When the
// target changes, the current session is cancelled and Rebuild creates a
// fresh session for the new target.
type PivotSession interface {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type doesn't make any sense to me. Is this just for your janky implementations, and will eventually be completely removed for real ones?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I don't agree at all with this. I think PivotSession is a real abstraction, not a temporary hack. It exists because we have two concrete implementations today (HashDB and atomic) that share the same session-restart loop, but differ in what happens between sessions:

  • HashDB: pivots the code queue, wipes snapshot, rebuilds the inner syncer
  • Atomic: commits progress, resets the trie to last committed, rebuilds the inner syncer

Without the interface, the DynamicSyncer would need to know about code queues, snapshots, and atomic tries. PivotSession keeps the loop generic and the session-specific cleanup where it belongs.

Firewood doesn't need this since merklesync handles target updates internally. But firewood doesn't go through DynamicSyncer at all. It has its own UpdateTarget path. PivotSession only applies to syncers that use the session-restart model.

Even if we move the EVM state syncer to change proofs via firewood, the atomic syncer will still need the session-restart model because the atomic trie is a separate data structure with its own sync protocol. PivotSession would still serve that use case. Also the interface costs four methods and saves duplicating the loop, the mutex, the target tracking, and the cancel coordination in two places. If it ends up with only one consumer long-term, we can inline it then, simplify it, change it or whatever fits those needs at that time. For the time being I think this is the solid option to go with.

Comment thread graft/evm/sync/types/types.go Outdated
// This is defined here to avoid circular dependencies with the leaf package.
// TargetReporter reports the height a syncer is working toward. The
// coordinator uses the minimum across all reporters to preserve queued
// blocks that slower syncers still need. Syncers that do not implement
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an import edge case to acknowledge, but I don't think that requiring syncers to track when they should pivot makes sense. If any of the syncers are "too slow", then the sync process CAN'T finish - it's impossible, no way to get around it. This adds a lot of unneeded complexity

…e tip

After batch replay, the blockchain's lastAccepted advances but
chain.State still reports the commit-target height from AcceptSync.
This causes the engine to see a stale tip during bootstrapping.

- Propagate the blockchain's actual last accepted block into chain.State
  after draining the acceptor queue, so both layers agree on the height.

Signed-off-by: Tsvetan Dimitrov (tsvetan.dimitrov@avalabs.org)
…ether

With the atomic syncer using the session-restart model, all dynamic
syncers pivot to the same target. MinTargetHeight always equals the
commit target, making the TargetReporter interface and the minimum-
height prune check dead code.

- Remove TargetReporter, MinTargetHeight, and the associated coordinator
  logic. The block queue prune now uses the pivot target height directly.
…astructure

The pivot session now owns the code queue and code syncer directly,
creating fresh instances per session. This eliminates the need for
SessionedQueue, session.Manager, and the syncFromEvents event-driven
code syncer, which existed solely to keep the code syncer alive across
pivots.

With the restart approach, code already fetched survives in the DB
(content-addressed), unfetched markers are recovered automatically by
the new queue, and the code syncer exits cleanly via context cancellation.

Signed-off-by: Tsvetan Dimitrov (tsvetan.dimitrov@avalabs.org)
- Refresh rpcchainvm client-side chain.State cache on StateSyncDone to
  prevent nonce mismatch crash when verifying pre-sync blocks against
  post-sync state.
- Expose StateSyncTargetHeight() from proposervm so the bootstrapper
  fetches only blocks above the sync target, allowing pivot triggers
  to fire while state sync is still running.
- The bootstrapper's sequential ancestor walk can't converge on a live
  chain.
- Transition directly to NormalOp where the consensus engine
  fetches missing blocks from peers and the dynamic executor defers
  them for pivot triggers.
- Skip proposervm lastAccepted rollback so the consensus engine starts
  at the sync target height.
- Block fetching walked ancestors toward genesis instead of stopping
  at the sync target because the lastAccepted height was stale at 0.
- Use the sync target as the floor for ancestor discovery.
- Transitioning to NormalOp while state sync is running cancels the
  sync context and crashes with "invalid new chain".
- The bootstrapper's execute loop already drives pivots through the
  dynamic executor without needing the NormalOp transition.
- Skip re-syncing unchanged storage tries on pivot by checking if their
  trie nodes already exist in the local DB.
- Preserve segment markers across root changes so partially synced tries
  can resume.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE This PR must not be merged in its current state

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants