fix: shard channel saturation causing upload degradation at 1000-node scale#80
Conversation
… scale After ~18-20 hours of running a 1000-node testnet with continuous uploads, identity exchange timeouts accumulate and upload times degrade from ~3 min to 13+ min. Root cause: the per-shard inbound message channels (32 slots) fill up because the shard consumer holds write locks across async calls, blocking all 8 consumers. Messages — including identity announces — are dropped, causing 15-second timeouts on every new connection. Fix 1: Increase channel capacity - MESSAGE_RECV_CHANNEL_CAPACITY: 256 -> 2048 (256 per shard vs 32) - MIN_SHARD_CHANNEL_CAPACITY: 16 -> 128 Fix 2: Don't hold write locks across .await - Scope the peer_to_channel write lock to a block that drops immediately after extracting is_new_peer and inserted - The async register_connection_peer_id() call now runs without holding any write locks, eliminating head-of-line blocking across all shards - Trades a microsecond consistency window for dramatically better throughput under sustained load Evidence from testnet rc-17 (1000 nodes, 20+ hours): - Nodes showed 6,000-18,000+ dropped messages (total_drops counter) - Identity exchange timeouts: 119 in a single upload attempt - Upload degradation: 229s -> 527s over 9 hours (hourly averages) - CPU was not saturated (20-39%), confirming lock contention not compute - Bandwidth dropped 40-70% as nodes processed fewer messages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| @@ -1743,17 +1747,12 @@ impl TransportHandle { | |||
| .insert(*app_id); | |||
| } | |||
There was a problem hiding this comment.
Consistency window broader than described — stale peer state, not just omission
The PR description characterises the window as "hole-punch lookups may briefly miss a peer", which is the omission direction (transport unaware of new peer). There is also a commission direction: peer_to_channel and channel_to_peers are now updated in two separate lock acquisitions, and a concurrent remove_channel_mappings_static call can race between them.
Concrete failure path: (1) Shard acquires p2c.write(), inserts peer_X → {channel_X}, drops the lock. (2) Tokio schedules the disconnect handler; remove_channel_mappings_static acquires p2c.write() (sees the entry) then c2p.write() — but c2p has no channel_X entry yet, so c2p.remove(channel_id) returns None and cleanup exits without removing peer_X from p2c. (3) The shard resumes and inserts the now-stale channel_X → peer_X entry into c2p.
Both maps permanently show the peer as connected until restart or reconnect. is_peer_connected() / connected_peers() / peer_count() return incorrect results, and a spurious PeerConnected event is emitted. The race window is extremely narrow and the performance trade-off is well-justified. A future hardening option is to update both maps in one scoped block:
let (is_new_peer, inserted) = {
let mut p2c = peer_to_channel.write().await;
let mut c2p = channel_to_peers.write().await; // hold both briefly
let is_new = !p2c.contains_key(app_id);
let channels = p2c.entry(*app_id).or_default();
let ins = channels.insert(channel_id.clone());
if ins {
c2p.entry(channel_id.clone()).or_default().insert(*app_id);
}
(is_new, ins)
}; // both locks dropped; register_connection_peer_id runs without any lockPrompt To Fix With AI
This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1733-1748
Comment:
**Consistency window broader than described — stale peer state, not just omission**
The PR description characterises the window as "hole-punch lookups may briefly miss a peer", which is the *omission* direction (transport unaware of new peer). There is also a *commission* direction: `peer_to_channel` and `channel_to_peers` are now updated in two separate lock acquisitions, and a concurrent `remove_channel_mappings_static` call can race between them.
Concrete failure path: (1) Shard acquires `p2c.write()`, inserts `peer_X → {channel_X}`, drops the lock. (2) Tokio schedules the disconnect handler; `remove_channel_mappings_static` acquires `p2c.write()` (sees the entry) then `c2p.write()` — but `c2p` has no `channel_X` entry yet, so `c2p.remove(channel_id)` returns `None` and cleanup exits **without** removing `peer_X` from `p2c`. (3) The shard resumes and inserts the now-stale `channel_X → peer_X` entry into `c2p`.
Both maps permanently show the peer as connected until restart or reconnect. `is_peer_connected()` / `connected_peers()` / `peer_count()` return incorrect results, and a spurious `PeerConnected` event is emitted. The race window is extremely narrow and the performance trade-off is well-justified. A future hardening option is to update both maps in one scoped block:
```rust
let (is_new_peer, inserted) = {
let mut p2c = peer_to_channel.write().await;
let mut c2p = channel_to_peers.write().await; // hold both briefly
let is_new = !p2c.contains_key(app_id);
let channels = p2c.entry(*app_id).or_default();
let ins = channels.insert(channel_id.clone());
if ins {
c2p.entry(channel_id.clone()).or_default().insert(*app_id);
}
(is_new, ins)
}; // both locks dropped; register_connection_peer_id runs without any lock
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
After ~18-20 hours of running a 1000-node testnet with continuous uploads, identity exchange
timeouts accumulate and upload times degrade from ~3 min to 13+ min. This PR addresses two
root causes in the sharded inbound message dispatcher introduced in #70.
Root Cause
Per-shard channel capacity too small —
MESSAGE_RECV_CHANNEL_CAPACITY / 8 = 32slots pershard. Under sustained 1000-node load, shard consumers can't drain fast enough and the
dispatcher drops messages via
try_send, including identity announces.Write lock held across
.await—run_shard_consumerholdspeer_to_channel.write()acrossthe async
register_connection_peer_id()call. All 8 shard consumers contend on the same globalRwLock, so one shard's async registration blocks all others, causing head-of-line blocking.Changes
Fix 1: Increase channel capacity
MESSAGE_RECV_CHANNEL_CAPACITY: 256 → 2048 (256 per shard vs 32)MIN_SHARD_CHANNEL_CAPACITY: 16 → 128Fix 2: Don't hold write locks across
.awaitpeer_to_channelwrite lock to a block that drops immediately after extractingis_new_peerandinsertedregister_connection_peer_id()now runs without holding any write locksdramatically better throughput under sustained load
Evidence from testnet
ant-rc-17(1000 nodes, 33+ hours)total_dropscounter in "shard channel full" warnings)Timeline
Test plan
cargo check— clean buildcargo test --lib— 282 passed, 0 failed🤖 Generated with Claude Code
Greptile Summary
This PR fixes two root causes of shard channel saturation at 1000-node scale: it raises
MESSAGE_RECV_CHANNEL_CAPACITYfrom 256 → 2048 (giving 256 slots per shard vs. 32), and scopes thepeer_to_channelwrite lock to a tight block so it is no longer held across the asyncregister_connection_peer_id()call, eliminating the head-of-line blocking that was starving all 8 shard consumers. Both changes are well-motivated by the testnet evidence provided.Confidence Score: 4/5
Safe to merge with awareness — the core fix is well-motivated and the only remaining concern is a narrow consistency race that the PR itself acknowledges.
Both changes are directly validated by testnet evidence. The one new concern (stale p2c/c2p state when a disconnect races the split lock acquisitions) requires extremely precise scheduling and results in degraded reliability rather than data loss; the PR explicitly traded this window for dramatically better throughput. The stale comment is cosmetic. No P0/P1 issues found.
src/transport_handle.rs — the split p2c/c2p lock acquisition in run_shard_consumer and the now-inaccurate memory comment in start_message_receiving_system.
Important Files Changed
Sequence Diagram
sequenceDiagram participant T as Transport participant D as Dispatcher Task participant S0 as Shard 0 Consumer participant S1 as Shard 1 Consumer participant State as Shared State T->>D: message (addr, bytes) D->>D: hash addr to shard index alt shard channel has capacity D->>S0: try_send succeeds else shard channel full D->>D: increment drop_counter, warn every 64th drop end T->>D: message (addr2, bytes2) D->>S1: try_send succeeds par Shard consumers run in parallel S0->>State: p2c.write() insert peer-channel, drop lock immediately S0->>State: c2p.write() insert channel-peer, drop lock S0->>T: register_connection_peer_id() with no lock held S0->>State: peer_user_agents.write() if new peer, broadcast PeerConnected and S1->>State: p2c.write() insert peer-channel, drop lock immediately S1->>State: c2p.write() insert channel-peer, drop lock S1->>T: register_connection_peer_id() with no lock held endComments Outside Diff (1)
src/transport_handle.rs, line 1584-1589 (link)The comment was accurate when
MESSAGE_RECV_CHANNEL_CAPACITYwas 256 (per_shard = 32, aggregate =8 × 32 = 256 ≈ old single-channel capacity). After raising the constant to 2048, the aggregate shard capacity is8 × 256 = 2048— an 8× increase. The phrase "keeping memory usage comparable" no longer holds.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix: shard channel saturation causing up..." | Re-trigger Greptile