Skip to content

perf(p2p): Implement adaptive peer timeouts for snap sync#6117

Draft
pablodeymo wants to merge 26 commits into
mainfrom
refactor/adaptive-peer-timeouts
Draft

perf(p2p): Implement adaptive peer timeouts for snap sync#6117
pablodeymo wants to merge 26 commits into
mainfrom
refactor/adaptive-peer-timeouts

Conversation

@pablodeymo
Copy link
Copy Markdown
Contributor

Summary

This PR implements adaptive timeouts for peer requests in snap sync. Instead of using a fixed 15-second timeout for all peers, we track each peer's response latency and set timeouts dynamically based on their historical performance.

Related Roadmap Item: 1.7 Peer Connection Optimization

Motivation

Currently, all peer requests use a fixed PEER_REPLY_TIMEOUT = 15 seconds. This causes issues:

  • Fast peers are underutilized: A peer responding in 50ms still has to wait up to 15s if it times out once
  • Slow peer detection is delayed: Takes 15s to detect an unresponsive peer
  • Network conditions ignored: Different peers have different latencies based on geographic location, load, etc.

Proposed Solution

Track each peer's response latency using an Exponential Moving Average (EMA) and set timeouts to 3x their average response time.

Key Components

  1. LatencyTracker struct - EMA calculation with α=0.2

    • Needs 3+ samples before using adaptive timeout
    • Falls back to 15s default otherwise
  2. Timeout Bounds:

    • Minimum: 2 seconds (don't time out too aggressively)
    • Maximum: 30 seconds (don't wait forever)
    • Multiplier: 3x average latency
  3. Changes Required:

    • Add latency: LatencyTracker to PeerData
    • Add record_latency() and get_peer_timeout() to PeerTable
    • Update make_request() to measure and record latency
    • Update ~10 call sites to use adaptive timeouts

Example Impact

Peer Type Avg Latency Current Timeout Adaptive Timeout
Fast peer 50ms 15s 150ms
Normal peer 200ms 15s 600ms
Slow peer 500ms 15s 1.5s
Very slow peer 5s 15s 15s (capped)

Expected Impact: 20-30% improvement in peer utilization, faster detection of slow/dead peers.

Implementation Plan

See docs/plans/adaptive_peer_timeouts.md for the detailed implementation plan including:

  • Code examples for each component
  • Unit test strategy
  • Metrics to add
  • Rollout strategy
  • Risk assessment

Files to Modify

File Changes
discv4/peer_table.rs Add LatencyTracker, update PeerData, add methods
snap/constants.rs Add timeout constants
peer_handler.rs Update make_request(), update call sites
snap/client.rs Update call sites (~8 locations)

Testing Plan

  • Unit tests for LatencyTracker EMA calculation
  • Unit tests for timeout bounds (min/max clamping)
  • Integration tests with mock peers of varying latencies
  • Manual testing on Sepolia/Holesky
  • Mainnet validation

Checklist

  • Implementation plan created
  • LatencyTracker struct implemented
  • PeerData updated with latency tracking
  • PeerTable methods added
  • make_request() updated to record latency
  • Call sites updated to use adaptive timeouts
  • Constants added
  • Unit tests added
  • Integration tests added
  • Metrics added (optional)
  • Documentation updated
  • CHANGELOG entry added

…tory

Reorganize state_healing.rs and storage_healing.rs into a shared
sync/healing/ module structure with clearer naming conventions:

- Create sync/healing/ directory with mod.rs, types.rs, state.rs, storage.rs
- Rename MembatchEntryValue to HealingQueueEntry
- Rename MembatchEntry to StorageHealingQueueEntry
- Rename Membatch type to StorageHealingQueue
- Rename children_not_in_storage_count to missing_children_count
- Rename membatch variables to healing_queue throughout
- Extract shared HealingQueueEntry and StateHealingQueue types to types.rs
- Update sync.rs imports to use new healing module
Reorganize snap protocol code for better maintainability:

- Split rlpx/snap.rs into rlpx/snap/ directory:
  - codec.rs: RLP encoding/decoding for snap messages
  - messages.rs: Snap protocol message types
  - mod.rs: Module re-exports

- Split snap.rs into snap/ directory:
  - constants.rs: Snap sync constants and configuration
  - server.rs: Snap protocol server implementation
  - mod.rs: Module re-exports

- Move snap server tests to dedicated tests/ directory
- Update imports in p2p.rs, peer_handler.rs, and code_collector.rs
Document the phased approach for reorganizing snap sync code:
- Phase 1: rlpx/snap module split
- Phase 2: snap module split with server extraction
- Phase 3: healing module unification
Split the large sync.rs (1631 lines) into focused modules:

- sync/full.rs (~260 lines): Full sync implementation
  - sync_cycle_full(), add_blocks_in_batch(), add_blocks()

- sync/snap_sync.rs (~1100 lines): Snap sync implementation
  - sync_cycle_snap(), snap_sync(), SnapBlockSyncState
  - store_block_bodies(), update_pivot(), block_is_stale()
  - validate_state_root(), validate_storage_root(), validate_bytecodes()
  - insert_accounts(), insert_storages() (both rocksdb and non-rocksdb)

- sync.rs (~285 lines): Orchestration layer
  - Syncer struct with start_sync() and sync_cycle()
  - SyncMode, SyncError, AccountStorageRoots types
  - Re-exports for public API
…p/client.rs

Move all snap protocol client-side request methods from peer_handler.rs
to a dedicated snap/client.rs module:
- request_account_range and request_account_range_worker
- request_bytecodes
- request_storage_ranges and request_storage_ranges_worker
- request_state_trienodes
- request_storage_trienodes

Also moves related types: DumpError, RequestMetadata, SnapClientError,
RequestStateTrieNodesError, RequestStorageTrieNodes.

This reduces peer_handler.rs from 2,060 to 670 lines (~68% reduction),
leaving it focused on ETH protocol methods (block headers/bodies).

Added SnapClientError variant to SyncError for proper error handling.
Updated plan_snap_sync.md to mark Phase 4 as complete.
…napError type

Implement Phase 5 of snap sync refactoring plan - Error Handling.

- Create snap/error.rs with unified SnapError enum covering all snap protocol errors
- Update server functions (process_account_range_request, process_storage_ranges_request,
  process_byte_codes_request, process_trie_nodes_request) to return Result<T, SnapError>
- Remove SnapClientError and RequestStateTrieNodesError, consolidate into SnapError
- Keep RequestStorageTrieNodesError struct for request ID tracking in storage healing
- Add From<SnapError> for PeerConnectionError to support error propagation in message handlers
- Update sync module to use SyncError::Snap variant
- Update healing modules (state.rs, storage.rs) to use new error types
- Move DumpError struct to error.rs module
- Update test return types to use SnapError
- Mark Phase 5 as completed in plan document

All phases of the snap sync refactoring are now complete.
Change missing_children_count from u64 to usize in HealingQueueEntry
and node_missing_children function to match StorageHealingQueueEntry
and be consistent with memory structure counting conventions.
Resolve conflicts from #5977 and #6018 merge to main:
- Keep modular sync structure (sync.rs delegates to full.rs and snap_sync.rs)
- Keep snap client code in snap/client.rs (removed from peer_handler.rs)
- Add InsertingAccountRanges metric from #6018 to snap_sync.rs
- Remove unused info import from peer_handler.rs
@github-actions github-actions Bot added the performance Block execution throughput and performance in general label Feb 3, 2026
@pablodeymo pablodeymo changed the base branch from main to refactor/snapsync-healing-unification February 3, 2026 20:54
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 3, 2026

Lines of code report

Total lines added: 4055
Total lines removed: 2327
Total lines changed: 6382

Detailed view
+------------------------------------------------------+-------+-------+
| File                                                 | Lines | Diff  |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/peer_handler.rs         | 546   | -1187 |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/rlpx/error.rs           | 131   | +11   |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/rlpx/snap/codec.rs      | 284   | +284  |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/rlpx/snap/messages.rs   | 64    | +64   |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/rlpx/snap/mod.rs        | 7     | +7    |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/snap/client.rs          | 1195  | +1195 |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/snap/constants.rs       | 22    | +22   |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/snap/error.rs           | 104   | +104  |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/snap/mod.rs             | 11    | +11   |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/snap/server.rs          | 153   | +153  |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/sync.rs                 | 241   | -1140 |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/sync/full.rs            | 243   | +243  |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/sync/healing/mod.rs     | 7     | +7    |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/sync/healing/state.rs   | 386   | +386  |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/sync/healing/storage.rs | 615   | +615  |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/sync/healing/types.rs   | 9     | +9    |
+------------------------------------------------------+-------+-------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs       | 944   | +944  |
+------------------------------------------------------+-------+-------+

@pablodeymo pablodeymo added syncing Snap sync, Full sync snapsync and removed syncing Snap sync, Full sync labels Feb 4, 2026
@pablodeymo pablodeymo mentioned this pull request Feb 5, 2026
1 task
Base automatically changed from refactor/snapsync-healing-unification to main February 6, 2026 21:52
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 9, 2026

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 65.891 ± 0.155 65.668 66.151 1.00
head 66.124 ± 0.326 65.578 66.589 1.00 ± 0.01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Block execution throughput and performance in general snapsync

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants