fix: shard channel saturation causing upload degradation at 1000-node scale by jacderida · Pull Request #80 · saorsa-labs/saorsa-core

jacderida · 2026-04-13T13:30:18Z

Summary

After ~18-20 hours of running a 1000-node testnet with continuous uploads, identity exchange
timeouts accumulate and upload times degrade from ~3 min to 13+ min. This PR addresses two
root causes in the sharded inbound message dispatcher introduced in #70.

Root Cause

Per-shard channel capacity too small — MESSAGE_RECV_CHANNEL_CAPACITY / 8 = 32 slots per
shard. Under sustained 1000-node load, shard consumers can't drain fast enough and the
dispatcher drops messages via try_send, including identity announces.
Write lock held across .await — run_shard_consumer holds peer_to_channel.write() across
the async register_connection_peer_id() call. All 8 shard consumers contend on the same global
RwLock, so one shard's async registration blocks all others, causing head-of-line blocking.

Changes

Fix 1: Increase channel capacity

MESSAGE_RECV_CHANNEL_CAPACITY: 256 → 2048 (256 per shard vs 32)
MIN_SHARD_CHANNEL_CAPACITY: 16 → 128

Fix 2: Don't hold write locks across `.await`

Scope the peer_to_channel write lock to a block that drops immediately after extracting
is_new_peer and inserted
register_connection_peer_id() now runs without holding any write locks
Trades a microsecond consistency window (hole-punch lookups may briefly miss a peer) for
dramatically better throughput under sustained load

Evidence from testnet `ant-rc-17` (1000 nodes, 33+ hours)

Shard drops: 6,000–18,000+ per node (total_drops counter in "shard channel full" warnings)
Identity timeouts: 119 in a single client upload attempt (peer discovery phase)
Upload degradation: hourly avg 229s → 527s over 9 hours, accelerating after hour 30
CPU not saturated: 20–39% utilization, confirming lock contention not compute bound
Bandwidth dropped 40–70%: nodes processed fewer messages as channels overflowed
All 997 node processes remained healthy — the issue is purely in message processing throughput

Timeline

Hours Running	50MB Upload Avg	Status
11–17	230–270s	Stable
18–22	~300s	Gradual drift
28+	350s+	Accelerating
33	527s+	Severely degraded

Test plan

cargo check — clean build
cargo test --lib — 282 passed, 0 failed
Deploy to 1000-node testnet with increased client load (10 VMs) to verify fix

🤖 Generated with Claude Code

Greptile Summary

This PR fixes two root causes of shard channel saturation at 1000-node scale: it raises MESSAGE_RECV_CHANNEL_CAPACITY from 256 → 2048 (giving 256 slots per shard vs. 32), and scopes the peer_to_channel write lock to a tight block so it is no longer held across the async register_connection_peer_id() call, eliminating the head-of-line blocking that was starving all 8 shard consumers. Both changes are well-motivated by the testnet evidence provided.

Confidence Score: 4/5

Safe to merge with awareness — the core fix is well-motivated and the only remaining concern is a narrow consistency race that the PR itself acknowledges.

Both changes are directly validated by testnet evidence. The one new concern (stale p2c/c2p state when a disconnect races the split lock acquisitions) requires extremely precise scheduling and results in degraded reliability rather than data loss; the PR explicitly traded this window for dramatically better throughput. The stale comment is cosmetic. No P0/P1 issues found.

src/transport_handle.rs — the split p2c/c2p lock acquisition in run_shard_consumer and the now-inaccurate memory comment in start_message_receiving_system.

Important Files Changed

Filename	Overview
src/transport_handle.rs	Adds MIN_SHARD_CHANNEL_CAPACITY = 128, scopes peer_to_channel.write() to a short block in run_shard_consumer so it is dropped before the async register_connection_peer_id() call; channel_to_peers is now updated in a separate lock acquisition, introducing a narrow window where a concurrent disconnect can leave both maps stale.
src/network.rs	Raises MESSAGE_RECV_CHANNEL_CAPACITY from 256 to 2048; straightforward constant change with clear production justification.

Sequence Diagram

sequenceDiagram
    participant T as Transport
    participant D as Dispatcher Task
    participant S0 as Shard 0 Consumer
    participant S1 as Shard 1 Consumer
    participant State as Shared State

    T->>D: message (addr, bytes)
    D->>D: hash addr to shard index
    alt shard channel has capacity
        D->>S0: try_send succeeds
    else shard channel full
        D->>D: increment drop_counter, warn every 64th drop
    end
    T->>D: message (addr2, bytes2)
    D->>S1: try_send succeeds

    par Shard consumers run in parallel
        S0->>State: p2c.write() insert peer-channel, drop lock immediately
        S0->>State: c2p.write() insert channel-peer, drop lock
        S0->>T: register_connection_peer_id() with no lock held
        S0->>State: peer_user_agents.write() if new peer, broadcast PeerConnected
    and
        S1->>State: p2c.write() insert peer-channel, drop lock immediately
        S1->>State: c2p.write() insert channel-peer, drop lock
        S1->>T: register_connection_peer_id() with no lock held
    end

Comments Outside Diff (1)

src/transport_handle.rs, line 1584-1589 (link)

Stale comment — memory footprint is now 8× larger, not "comparable"

The comment was accurate when MESSAGE_RECV_CHANNEL_CAPACITY was 256 (per_shard = 32, aggregate = 8 × 32 = 256 ≈ old single-channel capacity). After raising the constant to 2048, the aggregate shard capacity is 8 × 256 = 2048 — an 8× increase. The phrase "keeping memory usage comparable" no longer holds.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1584-1589

Comment:
**Stale comment — memory footprint is now 8× larger, not "comparable"**

The comment was accurate when `MESSAGE_RECV_CHANNEL_CAPACITY` was 256 (`per_shard = 32`, aggregate = `8 × 32 = 256 ≈ old single-channel capacity`). After raising the constant to 2048, the aggregate shard capacity is `8 × 256 = 2048` — an 8× increase. The phrase "keeping memory usage comparable" no longer holds.



How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1733-1748

Comment:
**Consistency window broader than described — stale peer state, not just omission**

The PR description characterises the window as "hole-punch lookups may briefly miss a peer", which is the *omission* direction (transport unaware of new peer). There is also a *commission* direction: `peer_to_channel` and `channel_to_peers` are now updated in two separate lock acquisitions, and a concurrent `remove_channel_mappings_static` call can race between them.

Concrete failure path: (1) Shard acquires `p2c.write()`, inserts `peer_X → {channel_X}`, drops the lock. (2) Tokio schedules the disconnect handler; `remove_channel_mappings_static` acquires `p2c.write()` (sees the entry) then `c2p.write()` — but `c2p` has no `channel_X` entry yet, so `c2p.remove(channel_id)` returns `None` and cleanup exits **without** removing `peer_X` from `p2c`. (3) The shard resumes and inserts the now-stale `channel_X → peer_X` entry into `c2p`.

Both maps permanently show the peer as connected until restart or reconnect. `is_peer_connected()` / `connected_peers()` / `peer_count()` return incorrect results, and a spurious `PeerConnected` event is emitted. The race window is extremely narrow and the performance trade-off is well-justified. A future hardening option is to update both maps in one scoped block:

```rust
let (is_new_peer, inserted) = {
    let mut p2c = peer_to_channel.write().await;
    let mut c2p = channel_to_peers.write().await;   // hold both briefly
    let is_new = !p2c.contains_key(app_id);
    let channels = p2c.entry(*app_id).or_default();
    let ins = channels.insert(channel_id.clone());
    if ins {
        c2p.entry(channel_id.clone()).or_default().insert(*app_id);
    }
    (is_new, ins)
}; // both locks dropped; register_connection_peer_id runs without any lock
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1584-1589

Comment:
**Stale comment — memory footprint is now 8× larger, not "comparable"**

The comment was accurate when `MESSAGE_RECV_CHANNEL_CAPACITY` was 256 (`per_shard = 32`, aggregate = `8 × 32 = 256 ≈ old single-channel capacity`). After raising the constant to 2048, the aggregate shard capacity is `8 × 256 = 2048` — an 8× increase. The phrase "keeping memory usage comparable" no longer holds.

```suggestion
        // Per-shard capacity: divide the global channel budget across shards so
        // each consumer has enough headroom for bursts. Raised to 2048 total
        // (256 per shard) to eliminate the channel-full drops observed at
        // 1000-node scale. Floor at `MIN_SHARD_CHANNEL_CAPACITY` for small
        // global-capacity configurations.
        let per_shard_capacity = (MESSAGE_RECV_CHANNEL_CAPACITY / MESSAGE_DISPATCH_SHARDS)
            .max(MIN_SHARD_CHANNEL_CAPACITY);
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix: shard channel saturation causing up..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

… scale After ~18-20 hours of running a 1000-node testnet with continuous uploads, identity exchange timeouts accumulate and upload times degrade from ~3 min to 13+ min. Root cause: the per-shard inbound message channels (32 slots) fill up because the shard consumer holds write locks across async calls, blocking all 8 consumers. Messages — including identity announces — are dropped, causing 15-second timeouts on every new connection. Fix 1: Increase channel capacity - MESSAGE_RECV_CHANNEL_CAPACITY: 256 -> 2048 (256 per shard vs 32) - MIN_SHARD_CHANNEL_CAPACITY: 16 -> 128 Fix 2: Don't hold write locks across .await - Scope the peer_to_channel write lock to a block that drops immediately after extracting is_new_peer and inserted - The async register_connection_peer_id() call now runs without holding any write locks, eliminating head-of-line blocking across all shards - Trades a microsecond consistency window for dramatically better throughput under sustained load Evidence from testnet rc-17 (1000 nodes, 20+ hours): - Nodes showed 6,000-18,000+ dropped messages (total_drops counter) - Identity exchange timeouts: 119 in a single upload attempt - Upload degradation: 229s -> 527s over 9 hours (hourly averages) - CPU was not saturated (20-39%), confirming lock contention not compute - Bandwidth dropped 40-70% as nodes processed fewer messages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-04-13T13:36:52Z

@@ -1743,17 +1747,12 @@ impl TransportHandle {
                                .insert(*app_id);
                        }


Consistency window broader than described — stale peer state, not just omission

The PR description characterises the window as "hole-punch lookups may briefly miss a peer", which is the omission direction (transport unaware of new peer). There is also a commission direction: peer_to_channel and channel_to_peers are now updated in two separate lock acquisitions, and a concurrent remove_channel_mappings_static call can race between them.

Concrete failure path: (1) Shard acquires p2c.write(), inserts peer_X → {channel_X}, drops the lock. (2) Tokio schedules the disconnect handler; remove_channel_mappings_static acquires p2c.write() (sees the entry) then c2p.write() — but c2p has no channel_X entry yet, so c2p.remove(channel_id) returns None and cleanup exits without removing peer_X from p2c. (3) The shard resumes and inserts the now-stale channel_X → peer_X entry into c2p.

Both maps permanently show the peer as connected until restart or reconnect. is_peer_connected() / connected_peers() / peer_count() return incorrect results, and a spurious PeerConnected event is emitted. The race window is extremely narrow and the performance trade-off is well-justified. A future hardening option is to update both maps in one scoped block:

let (is_new_peer, inserted) = { let mut p2c = peer_to_channel.write().await; let mut c2p = channel_to_peers.write().await; // hold both briefly let is_new = !p2c.contains_key(app_id); let channels = p2c.entry(*app_id).or_default(); let ins = channels.insert(channel_id.clone()); if ins { c2p.entry(channel_id.clone()).or_default().insert(*app_id); } (is_new, ins) }; // both locks dropped; register_connection_peer_id runs without any lock

Prompt To Fix With AI

This is a comment left during a code review. Path: src/transport_handle.rs Line: 1733-1748 Comment: **Consistency window broader than described — stale peer state, not just omission** The PR description characterises the window as "hole-punch lookups may briefly miss a peer", which is the *omission* direction (transport unaware of new peer). There is also a *commission* direction: `peer_to_channel` and `channel_to_peers` are now updated in two separate lock acquisitions, and a concurrent `remove_channel_mappings_static` call can race between them. Concrete failure path: (1) Shard acquires `p2c.write()`, inserts `peer_X → {channel_X}`, drops the lock. (2) Tokio schedules the disconnect handler; `remove_channel_mappings_static` acquires `p2c.write()` (sees the entry) then `c2p.write()` — but `c2p` has no `channel_X` entry yet, so `c2p.remove(channel_id)` returns `None` and cleanup exits **without** removing `peer_X` from `p2c`. (3) The shard resumes and inserts the now-stale `channel_X → peer_X` entry into `c2p`. Both maps permanently show the peer as connected until restart or reconnect. `is_peer_connected()` / `connected_peers()` / `peer_count()` return incorrect results, and a spurious `PeerConnected` event is emitted. The race window is extremely narrow and the performance trade-off is well-justified. A future hardening option is to update both maps in one scoped block: ```rust let (is_new_peer, inserted) = { let mut p2c = peer_to_channel.write().await; let mut c2p = channel_to_peers.write().await; // hold both briefly let is_new = !p2c.contains_key(app_id); let channels = p2c.entry(*app_id).or_default(); let ins = channels.insert(channel_id.clone()); if ins { c2p.entry(channel_id.clone()).or_default().insert(*app_id); } (is_new, ins) }; // both locks dropped; register_connection_peer_id runs without any lock ``` How can I resolve this? If you propose a fix, please make it concise.

mickvandijke approved these changes Apr 13, 2026

View reviewed changes

greptile-apps bot reviewed Apr 13, 2026

View reviewed changes

jacderida merged commit 04327f5 into rc-2026.4.1 Apr 13, 2026
4 of 5 checks passed

jacderida mentioned this pull request Apr 14, 2026

fix: spawn accept loop registration to prevent handshake channel stall #82

Merged

2 tasks

grumbach mentioned this pull request Apr 15, 2026

fix: release peers write guard before identity announce send #86

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: shard channel saturation causing upload degradation at 1000-node scale#80

fix: shard channel saturation causing upload degradation at 1000-node scale#80
jacderida merged 1 commit intorc-2026.4.1from
fix/shard-channel-saturation-rc

jacderida commented Apr 13, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1743,17 +1747,12 @@ impl TransportHandle {
		.insert(*app_id);
		}

Conversation

jacderida commented Apr 13, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Fix 1: Increase channel capacity

Fix 2: Don't hold write locks across .await

Evidence from testnet ant-rc-17 (1000 nodes, 33+ hours)

Timeline

Test plan

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacderida commented Apr 13, 2026 •

edited by greptile-apps bot

Loading

Fix 2: Don't hold write locks across `.await`

Evidence from testnet `ant-rc-17` (1000 nodes, 33+ hours)