Skip to content

fix: shard channel saturation causing upload degradation at 1000-node scale#80

Merged
jacderida merged 1 commit intorc-2026.4.1from
fix/shard-channel-saturation-rc
Apr 13, 2026
Merged

fix: shard channel saturation causing upload degradation at 1000-node scale#80
jacderida merged 1 commit intorc-2026.4.1from
fix/shard-channel-saturation-rc

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

@jacderida jacderida commented Apr 13, 2026

Summary

After ~18-20 hours of running a 1000-node testnet with continuous uploads, identity exchange
timeouts accumulate and upload times degrade from ~3 min to 13+ min. This PR addresses two
root causes in the sharded inbound message dispatcher introduced in #70.

Root Cause

  1. Per-shard channel capacity too smallMESSAGE_RECV_CHANNEL_CAPACITY / 8 = 32 slots per
    shard. Under sustained 1000-node load, shard consumers can't drain fast enough and the
    dispatcher drops messages via try_send, including identity announces.

  2. Write lock held across .awaitrun_shard_consumer holds peer_to_channel.write() across
    the async register_connection_peer_id() call. All 8 shard consumers contend on the same global
    RwLock, so one shard's async registration blocks all others, causing head-of-line blocking.

Changes

Fix 1: Increase channel capacity

  • MESSAGE_RECV_CHANNEL_CAPACITY: 256 → 2048 (256 per shard vs 32)
  • MIN_SHARD_CHANNEL_CAPACITY: 16 → 128

Fix 2: Don't hold write locks across .await

  • Scope the peer_to_channel write lock to a block that drops immediately after extracting
    is_new_peer and inserted
  • register_connection_peer_id() now runs without holding any write locks
  • Trades a microsecond consistency window (hole-punch lookups may briefly miss a peer) for
    dramatically better throughput under sustained load

Evidence from testnet ant-rc-17 (1000 nodes, 33+ hours)

  • Shard drops: 6,000–18,000+ per node (total_drops counter in "shard channel full" warnings)
  • Identity timeouts: 119 in a single client upload attempt (peer discovery phase)
  • Upload degradation: hourly avg 229s → 527s over 9 hours, accelerating after hour 30
  • CPU not saturated: 20–39% utilization, confirming lock contention not compute bound
  • Bandwidth dropped 40–70%: nodes processed fewer messages as channels overflowed
  • All 997 node processes remained healthy — the issue is purely in message processing throughput

Timeline

Hours Running 50MB Upload Avg Status
11–17 230–270s Stable
18–22 ~300s Gradual drift
28+ 350s+ Accelerating
33 527s+ Severely degraded

Test plan

  • cargo check — clean build
  • cargo test --lib — 282 passed, 0 failed
  • Deploy to 1000-node testnet with increased client load (10 VMs) to verify fix

🤖 Generated with Claude Code

Greptile Summary

This PR fixes two root causes of shard channel saturation at 1000-node scale: it raises MESSAGE_RECV_CHANNEL_CAPACITY from 256 → 2048 (giving 256 slots per shard vs. 32), and scopes the peer_to_channel write lock to a tight block so it is no longer held across the async register_connection_peer_id() call, eliminating the head-of-line blocking that was starving all 8 shard consumers. Both changes are well-motivated by the testnet evidence provided.

Confidence Score: 4/5

Safe to merge with awareness — the core fix is well-motivated and the only remaining concern is a narrow consistency race that the PR itself acknowledges.

Both changes are directly validated by testnet evidence. The one new concern (stale p2c/c2p state when a disconnect races the split lock acquisitions) requires extremely precise scheduling and results in degraded reliability rather than data loss; the PR explicitly traded this window for dramatically better throughput. The stale comment is cosmetic. No P0/P1 issues found.

src/transport_handle.rs — the split p2c/c2p lock acquisition in run_shard_consumer and the now-inaccurate memory comment in start_message_receiving_system.

Important Files Changed

Filename Overview
src/transport_handle.rs Adds MIN_SHARD_CHANNEL_CAPACITY = 128, scopes peer_to_channel.write() to a short block in run_shard_consumer so it is dropped before the async register_connection_peer_id() call; channel_to_peers is now updated in a separate lock acquisition, introducing a narrow window where a concurrent disconnect can leave both maps stale.
src/network.rs Raises MESSAGE_RECV_CHANNEL_CAPACITY from 256 to 2048; straightforward constant change with clear production justification.

Sequence Diagram

sequenceDiagram
    participant T as Transport
    participant D as Dispatcher Task
    participant S0 as Shard 0 Consumer
    participant S1 as Shard 1 Consumer
    participant State as Shared State

    T->>D: message (addr, bytes)
    D->>D: hash addr to shard index
    alt shard channel has capacity
        D->>S0: try_send succeeds
    else shard channel full
        D->>D: increment drop_counter, warn every 64th drop
    end
    T->>D: message (addr2, bytes2)
    D->>S1: try_send succeeds

    par Shard consumers run in parallel
        S0->>State: p2c.write() insert peer-channel, drop lock immediately
        S0->>State: c2p.write() insert channel-peer, drop lock
        S0->>T: register_connection_peer_id() with no lock held
        S0->>State: peer_user_agents.write() if new peer, broadcast PeerConnected
    and
        S1->>State: p2c.write() insert peer-channel, drop lock immediately
        S1->>State: c2p.write() insert channel-peer, drop lock
        S1->>T: register_connection_peer_id() with no lock held
    end
Loading

Comments Outside Diff (1)

  1. src/transport_handle.rs, line 1584-1589 (link)

    P2 Stale comment — memory footprint is now 8× larger, not "comparable"

    The comment was accurate when MESSAGE_RECV_CHANNEL_CAPACITY was 256 (per_shard = 32, aggregate = 8 × 32 = 256 ≈ old single-channel capacity). After raising the constant to 2048, the aggregate shard capacity is 8 × 256 = 2048 — an 8× increase. The phrase "keeping memory usage comparable" no longer holds.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/transport_handle.rs
    Line: 1584-1589
    
    Comment:
    **Stale comment — memory footprint is now 8× larger, not "comparable"**
    
    The comment was accurate when `MESSAGE_RECV_CHANNEL_CAPACITY` was 256 (`per_shard = 32`, aggregate = `8 × 32 = 256 ≈ old single-channel capacity`). After raising the constant to 2048, the aggregate shard capacity is `8 × 256 = 2048` — an 8× increase. The phrase "keeping memory usage comparable" no longer holds.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1733-1748

Comment:
**Consistency window broader than described — stale peer state, not just omission**

The PR description characterises the window as "hole-punch lookups may briefly miss a peer", which is the *omission* direction (transport unaware of new peer). There is also a *commission* direction: `peer_to_channel` and `channel_to_peers` are now updated in two separate lock acquisitions, and a concurrent `remove_channel_mappings_static` call can race between them.

Concrete failure path: (1) Shard acquires `p2c.write()`, inserts `peer_X → {channel_X}`, drops the lock. (2) Tokio schedules the disconnect handler; `remove_channel_mappings_static` acquires `p2c.write()` (sees the entry) then `c2p.write()` — but `c2p` has no `channel_X` entry yet, so `c2p.remove(channel_id)` returns `None` and cleanup exits **without** removing `peer_X` from `p2c`. (3) The shard resumes and inserts the now-stale `channel_X → peer_X` entry into `c2p`.

Both maps permanently show the peer as connected until restart or reconnect. `is_peer_connected()` / `connected_peers()` / `peer_count()` return incorrect results, and a spurious `PeerConnected` event is emitted. The race window is extremely narrow and the performance trade-off is well-justified. A future hardening option is to update both maps in one scoped block:

```rust
let (is_new_peer, inserted) = {
    let mut p2c = peer_to_channel.write().await;
    let mut c2p = channel_to_peers.write().await;   // hold both briefly
    let is_new = !p2c.contains_key(app_id);
    let channels = p2c.entry(*app_id).or_default();
    let ins = channels.insert(channel_id.clone());
    if ins {
        c2p.entry(channel_id.clone()).or_default().insert(*app_id);
    }
    (is_new, ins)
}; // both locks dropped; register_connection_peer_id runs without any lock
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1584-1589

Comment:
**Stale comment — memory footprint is now 8× larger, not "comparable"**

The comment was accurate when `MESSAGE_RECV_CHANNEL_CAPACITY` was 256 (`per_shard = 32`, aggregate = `8 × 32 = 256 ≈ old single-channel capacity`). After raising the constant to 2048, the aggregate shard capacity is `8 × 256 = 2048` — an 8× increase. The phrase "keeping memory usage comparable" no longer holds.

```suggestion
        // Per-shard capacity: divide the global channel budget across shards so
        // each consumer has enough headroom for bursts. Raised to 2048 total
        // (256 per shard) to eliminate the channel-full drops observed at
        // 1000-node scale. Floor at `MIN_SHARD_CHANNEL_CAPACITY` for small
        // global-capacity configurations.
        let per_shard_capacity = (MESSAGE_RECV_CHANNEL_CAPACITY / MESSAGE_DISPATCH_SHARDS)
            .max(MIN_SHARD_CHANNEL_CAPACITY);
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: shard channel saturation causing up..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

… scale

After ~18-20 hours of running a 1000-node testnet with continuous uploads,
identity exchange timeouts accumulate and upload times degrade from ~3 min
to 13+ min. Root cause: the per-shard inbound message channels (32 slots)
fill up because the shard consumer holds write locks across async calls,
blocking all 8 consumers. Messages — including identity announces — are
dropped, causing 15-second timeouts on every new connection.

Fix 1: Increase channel capacity
- MESSAGE_RECV_CHANNEL_CAPACITY: 256 -> 2048 (256 per shard vs 32)
- MIN_SHARD_CHANNEL_CAPACITY: 16 -> 128

Fix 2: Don't hold write locks across .await
- Scope the peer_to_channel write lock to a block that drops immediately
  after extracting is_new_peer and inserted
- The async register_connection_peer_id() call now runs without holding
  any write locks, eliminating head-of-line blocking across all shards
- Trades a microsecond consistency window for dramatically better
  throughput under sustained load

Evidence from testnet rc-17 (1000 nodes, 20+ hours):
- Nodes showed 6,000-18,000+ dropped messages (total_drops counter)
- Identity exchange timeouts: 119 in a single upload attempt
- Upload degradation: 229s -> 527s over 9 hours (hourly averages)
- CPU was not saturated (20-39%), confirming lock contention not compute
- Bandwidth dropped 40-70% as nodes processed fewer messages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/transport_handle.rs
Comment on lines 1733 to 1748
@@ -1743,17 +1747,12 @@ impl TransportHandle {
.insert(*app_id);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Consistency window broader than described — stale peer state, not just omission

The PR description characterises the window as "hole-punch lookups may briefly miss a peer", which is the omission direction (transport unaware of new peer). There is also a commission direction: peer_to_channel and channel_to_peers are now updated in two separate lock acquisitions, and a concurrent remove_channel_mappings_static call can race between them.

Concrete failure path: (1) Shard acquires p2c.write(), inserts peer_X → {channel_X}, drops the lock. (2) Tokio schedules the disconnect handler; remove_channel_mappings_static acquires p2c.write() (sees the entry) then c2p.write() — but c2p has no channel_X entry yet, so c2p.remove(channel_id) returns None and cleanup exits without removing peer_X from p2c. (3) The shard resumes and inserts the now-stale channel_X → peer_X entry into c2p.

Both maps permanently show the peer as connected until restart or reconnect. is_peer_connected() / connected_peers() / peer_count() return incorrect results, and a spurious PeerConnected event is emitted. The race window is extremely narrow and the performance trade-off is well-justified. A future hardening option is to update both maps in one scoped block:

let (is_new_peer, inserted) = {
    let mut p2c = peer_to_channel.write().await;
    let mut c2p = channel_to_peers.write().await;   // hold both briefly
    let is_new = !p2c.contains_key(app_id);
    let channels = p2c.entry(*app_id).or_default();
    let ins = channels.insert(channel_id.clone());
    if ins {
        c2p.entry(channel_id.clone()).or_default().insert(*app_id);
    }
    (is_new, ins)
}; // both locks dropped; register_connection_peer_id runs without any lock
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1733-1748

Comment:
**Consistency window broader than described — stale peer state, not just omission**

The PR description characterises the window as "hole-punch lookups may briefly miss a peer", which is the *omission* direction (transport unaware of new peer). There is also a *commission* direction: `peer_to_channel` and `channel_to_peers` are now updated in two separate lock acquisitions, and a concurrent `remove_channel_mappings_static` call can race between them.

Concrete failure path: (1) Shard acquires `p2c.write()`, inserts `peer_X → {channel_X}`, drops the lock. (2) Tokio schedules the disconnect handler; `remove_channel_mappings_static` acquires `p2c.write()` (sees the entry) then `c2p.write()` — but `c2p` has no `channel_X` entry yet, so `c2p.remove(channel_id)` returns `None` and cleanup exits **without** removing `peer_X` from `p2c`. (3) The shard resumes and inserts the now-stale `channel_X → peer_X` entry into `c2p`.

Both maps permanently show the peer as connected until restart or reconnect. `is_peer_connected()` / `connected_peers()` / `peer_count()` return incorrect results, and a spurious `PeerConnected` event is emitted. The race window is extremely narrow and the performance trade-off is well-justified. A future hardening option is to update both maps in one scoped block:

```rust
let (is_new_peer, inserted) = {
    let mut p2c = peer_to_channel.write().await;
    let mut c2p = channel_to_peers.write().await;   // hold both briefly
    let is_new = !p2c.contains_key(app_id);
    let channels = p2c.entry(*app_id).or_default();
    let ins = channels.insert(channel_id.clone());
    if ins {
        c2p.entry(channel_id.clone()).or_default().insert(*app_id);
    }
    (is_new, ins)
}; // both locks dropped; register_connection_peer_id runs without any lock
```

How can I resolve this? If you propose a fix, please make it concise.

@jacderida jacderida merged commit 04327f5 into rc-2026.4.1 Apr 13, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants