fix: increase handshake channel capacity from 32 to 1024#59
Merged
Conversation
The bounded mpsc channel that carries completed connections from the accept loop (nat_traversal_api) to the reader task spawn path (p2p_endpoint) had a capacity of 32. When the consumer stalled briefly under write-lock contention in saorsa-core's accept loop, the channel filled and all subsequent connection handoffs blocked indefinitely — the send result was silently discarded with `let _ =`. After 15+ hours in a 1000-node testnet, this caused 1,079 connections to be accepted at the QUIC level but never forwarded to spawn reader tasks, stalling identity exchange and degrading upload times from ~175s to 358s+. Increase to 1024 to provide sufficient buffer for transient stalls. The primary fix is in saorsa-core (spawning registration work off the accept loop), but the larger buffer provides defense in depth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| // can stall briefly under write-lock contention in saorsa-core's accept loop. | ||
| // A small buffer (32) caused the pipeline to back up after 15+ hours in a | ||
| // 1000-node testnet, blocking all new connection handoffs. | ||
| let (hs_tx, hs_rx) = mpsc::channel(1024); |
There was a problem hiding this comment.
Capacity duplicated in two constructors
The value 1024 appears at both line 1482 and line 1958. A named constant would make future adjustments atomic and signal intent:
/// Buffer for completed handshakes. Sized above the peak backlog observed
/// in the ant-rc-18 testnet (1 079 connections) so transient consumer
/// stalls don't block the accept loop.
const HANDSHAKE_CHANNEL_CAPACITY: usize = 1024;Then mpsc::channel(HANDSHAKE_CHANNEL_CAPACITY) at both sites.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/nat_traversal_api.rs
Line: 1482
Comment:
**Capacity duplicated in two constructors**
The value `1024` appears at both line 1482 and line 1958. A named constant would make future adjustments atomic and signal intent:
```rust
/// Buffer for completed handshakes. Sized above the peak backlog observed
/// in the ant-rc-18 testnet (1 079 connections) so transient consumer
/// stalls don't block the accept loop.
const HANDSHAKE_CHANNEL_CAPACITY: usize = 1024;
```
Then `mpsc::channel(HANDSHAKE_CHANNEL_CAPACITY)` at both sites.
How can I resolve this? If you propose a fix, please make it concise.
mickvandijke
approved these changes
Apr 14, 2026
2 tasks
jacderida
added a commit
that referenced
this pull request
Apr 14, 2026
…city revert: undo PR #59 handshake channel capacity increase
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mpsc::channel(32)innat_traversal_api.rscarries completed connections from the accept loop to the reader task spawn path. When the consumer stalls under write-lock contention, the channel fills and all subsequent connection handoffs block silently.Context
In the ant-rc-18 testnet (1000 nodes, 17+ hours), the channel filled and 1,079 connections were accepted at the QUIC level but never forwarded to spawn reader tasks. This stalled identity exchange and degraded upload times from ~175s to 358s+.
Companion PR
WithAutonomi/saorsa-core#82 — the primary fix, which spawns the accept loop's registration work into a separate task so it never blocks on write locks. This channel capacity increase provides defense in depth.
Test plan
cargo checkpasses🤖 Generated with Claude Code
Greptile Summary
Increases the bounded
mpschandshake channel capacity from 32 to 1024 in bothNatTraversalEndpointconstructors to prevent the accept pipeline from stalling when the consumer (accept_connection_direct) is briefly blocked on write-lock contention insaorsa-core. The PR explicitly describes this as defense-in-depth alongside the primary fix in saorsa-core#82.Confidence Score: 5/5
Important Files Changed
Sequence Diagram
sequenceDiagram participant QL as Quinn Accept participant AL as spawn_accept_loop (tokio::spawn) participant HS as Handshake Task (tokio::spawn) participant CH as mpsc::channel(1024) participant ACD as accept_connection_direct QL->>AL: endpoint.accept() → Connecting AL->>HS: tokio::spawn handshake task HS->>HS: connecting.await (TLS/PQC handshake) HS->>CH: tx.send(Ok((addr, conn))).await Note over CH: buffered capacity: 1024\n(was 32 — filled under lock contention) ACD->>CH: rx.recv().await CH-->>ACD: Ok((addr, conn)) ACD-->>ACD: return connection to caller Note over ACD: stalls under saorsa-core write-lock\n→ channel fills → new handoffs blockPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix: increase handshake channel capacity..." | Re-trigger Greptile