Skip to content

fix: increase handshake channel capacity from 32 to 1024#59

Merged
jacderida merged 1 commit into
rc-2026.4.1from
fix/handshake-channel-capacity
Apr 14, 2026
Merged

fix: increase handshake channel capacity from 32 to 1024#59
jacderida merged 1 commit into
rc-2026.4.1from
fix/handshake-channel-capacity

Conversation

@jacderida
Copy link
Copy Markdown
Contributor

@jacderida jacderida commented Apr 14, 2026

Summary

  • The bounded mpsc::channel(32) in nat_traversal_api.rs carries completed connections from the accept loop to the reader task spawn path. When the consumer stalls under write-lock contention, the channel fills and all subsequent connection handoffs block silently.
  • Increase capacity from 32 to 1024 to provide sufficient buffer for transient stalls.

Context

In the ant-rc-18 testnet (1000 nodes, 17+ hours), the channel filled and 1,079 connections were accepted at the QUIC level but never forwarded to spawn reader tasks. This stalled identity exchange and degraded upload times from ~175s to 358s+.

Companion PR

WithAutonomi/saorsa-core#82 — the primary fix, which spawns the accept loop's registration work into a separate task so it never blocks on write locks. This channel capacity increase provides defense in depth.

Test plan

  • cargo check passes

🤖 Generated with Claude Code

Greptile Summary

Increases the bounded mpsc handshake channel capacity from 32 to 1024 in both NatTraversalEndpoint constructors to prevent the accept pipeline from stalling when the consumer (accept_connection_direct) is briefly blocked on write-lock contention in saorsa-core. The PR explicitly describes this as defense-in-depth alongside the primary fix in saorsa-core#82.

Confidence Score: 5/5

  • Safe to merge — one-line capacity change with clear rationale, no logic or security concerns.
  • The diff is two identical single-line changes, each accompanied by thorough inline comments explaining the production evidence. The bounded channel is correct for the design (backpressure when the consumer is healthy), and 1024 provides ample headroom for the stalls addressed by the companion fix. The only finding is a P2 style suggestion to extract a named constant.
  • No files require special attention.

Important Files Changed

Filename Overview
src/nat_traversal_api.rs Increases bounded mpsc handshake channel capacity from 32 to 1024 at two constructor sites (lines 1482 and 1958); change is minimal, well-commented, and directly addresses the observed testnet backlog.

Sequence Diagram

sequenceDiagram
    participant QL as Quinn Accept
    participant AL as spawn_accept_loop (tokio::spawn)
    participant HS as Handshake Task (tokio::spawn)
    participant CH as mpsc::channel(1024)
    participant ACD as accept_connection_direct

    QL->>AL: endpoint.accept() → Connecting
    AL->>HS: tokio::spawn handshake task
    HS->>HS: connecting.await (TLS/PQC handshake)
    HS->>CH: tx.send(Ok((addr, conn))).await
    Note over CH: buffered capacity: 1024\n(was 32 — filled under lock contention)
    ACD->>CH: rx.recv().await
    CH-->>ACD: Ok((addr, conn))
    ACD-->>ACD: return connection to caller
    Note over ACD: stalls under saorsa-core write-lock\n→ channel fills → new handoffs block
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/nat_traversal_api.rs
Line: 1482

Comment:
**Capacity duplicated in two constructors**

The value `1024` appears at both line 1482 and line 1958. A named constant would make future adjustments atomic and signal intent:

```rust
/// Buffer for completed handshakes. Sized above the peak backlog observed
/// in the ant-rc-18 testnet (1 079 connections) so transient consumer
/// stalls don't block the accept loop.
const HANDSHAKE_CHANNEL_CAPACITY: usize = 1024;
```

Then `mpsc::channel(HANDSHAKE_CHANNEL_CAPACITY)` at both sites.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: increase handshake channel capacity..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

The bounded mpsc channel that carries completed connections from the
accept loop (nat_traversal_api) to the reader task spawn path
(p2p_endpoint) had a capacity of 32. When the consumer stalled
briefly under write-lock contention in saorsa-core's accept loop,
the channel filled and all subsequent connection handoffs blocked
indefinitely — the send result was silently discarded with `let _ =`.

After 15+ hours in a 1000-node testnet, this caused 1,079 connections
to be accepted at the QUIC level but never forwarded to spawn reader
tasks, stalling identity exchange and degrading upload times from
~175s to 358s+.

Increase to 1024 to provide sufficient buffer for transient stalls.
The primary fix is in saorsa-core (spawning registration work off
the accept loop), but the larger buffer provides defense in depth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/nat_traversal_api.rs
// can stall briefly under write-lock contention in saorsa-core's accept loop.
// A small buffer (32) caused the pipeline to back up after 15+ hours in a
// 1000-node testnet, blocking all new connection handoffs.
let (hs_tx, hs_rx) = mpsc::channel(1024);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Capacity duplicated in two constructors

The value 1024 appears at both line 1482 and line 1958. A named constant would make future adjustments atomic and signal intent:

/// Buffer for completed handshakes. Sized above the peak backlog observed
/// in the ant-rc-18 testnet (1 079 connections) so transient consumer
/// stalls don't block the accept loop.
const HANDSHAKE_CHANNEL_CAPACITY: usize = 1024;

Then mpsc::channel(HANDSHAKE_CHANNEL_CAPACITY) at both sites.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/nat_traversal_api.rs
Line: 1482

Comment:
**Capacity duplicated in two constructors**

The value `1024` appears at both line 1482 and line 1958. A named constant would make future adjustments atomic and signal intent:

```rust
/// Buffer for completed handshakes. Sized above the peak backlog observed
/// in the ant-rc-18 testnet (1 079 connections) so transient consumer
/// stalls don't block the accept loop.
const HANDSHAKE_CHANNEL_CAPACITY: usize = 1024;
```

Then `mpsc::channel(HANDSHAKE_CHANNEL_CAPACITY)` at both sites.

How can I resolve this? If you propose a fix, please make it concise.

@jacderida jacderida merged commit 7986b01 into rc-2026.4.1 Apr 14, 2026
8 of 9 checks passed
jacderida added a commit that referenced this pull request Apr 14, 2026
…-capacity"

This reverts commit 7986b01, reversing
changes made to f2b30ad.
jacderida added a commit that referenced this pull request Apr 14, 2026
…city

revert: undo PR #59 handshake channel capacity increase
@jacderida jacderida mentioned this pull request Apr 14, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants