fix: spawn accept loop registration to prevent handshake channel stall#82
Conversation
The accept loop in TransportHandle took two write locks inline (peers + active_connections) for every accepted connection. Under 1000-node scale, contention on these locks caused the loop to fall behind draining the bounded handshake channel (cap 32 in saorsa-transport). Once the channel filled, all new connection handoffs blocked, reader tasks were never spawned, and identity exchange timed out — degrading upload times from ~175s to 358s+ after 17 hours of operation. Fix: spawn the registration work (write locks) into a separate task so the accept loop immediately returns to draining the channel. This mirrors the sharding fix already applied to the message receiving system (same file, same root cause pattern). Includes a stress test that floods a node with 40 concurrent connections and asserts >90% complete identity exchange. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bafc148 to
4dccb64
Compare
|
Re: Doc/constant mismatch (tests/accept_loop_stress.rs:21) Fixed — updated the module doc to say 40 and corrected the tolerance comment from "5% rounded up" to "~7.5% tolerance". Re: Stale entry window widened by spawned task (src/transport_handle.rs:1545) Good observation. The stale entry window is wider with the spawned task, but the cleanup-on-next-send path already handles this. Investigating whether the lifecycle monitor's Re: Dropped JoinHandle (src/transport_handle.rs:1545) Fixed — the spawned task's handle is now awaited in a second lightweight task that logs a warning on failure. |
…-stall Revert PR #82: accept loop write lock stall fix
Summary
TransportHandletook two write locks inline (peers+active_connections) for every accepted connection. Under 1000-node scale, contention on these locks caused the loop to fall behind draining the bounded handshake channel (cap 32 in saorsa-transport). Once full, all new connection handoffs blocked, reader tasks were never spawned, and identity exchange timed out.tests/accept_loop_stress.rs) that floods a node with 40 concurrent connections and asserts >90% complete identity exchange.Evidence from ant-rc-18 testnet (1000 nodes, 17+ hours)
Upload times degraded from ~175s to 358s+ after 17 hours. Root cause investigation:
nat_traversal_api.rs:4062andp2p_endpoint.rs:2476Companion PR
saorsa-labs/saorsa-transport
fix/handshake-channel-capacity— increases the handshake channel from 32 to 1024 as a belt-and-suspenders measure.Test plan
cargo checkpassescargo test --test accept_loop_stresspasses (40 concurrent connections, >90% identity exchange success)🤖 Generated with Claude Code
Greptile Summary
This PR fixes the accept loop stall that caused 1,079 dropped connections and identity exchange timeouts on the
ant-rc-18testnet by moving the two blocking write-lock operations (peers+active_connections) off the hot accept path into a detachedtokio::spawntask — the same pattern already used by the sharded message dispatcher. The accompanying stress test (tests/accept_loop_stress.rs) validates ≥90% identity exchange success under 40 concurrent connections.Confidence Score: 5/5
Important Files Changed
Sequence Diagram
sequenceDiagram participant T as Transport (QUIC) participant AL as Accept Loop participant RT as Registration Task (spawned) participant LM as Lifecycle Monitor T->>AL: accept_any() returns remote_sock AL->>AL: rate_limiter.check_ip() AL->>RT: tokio::spawn (peers.clone, active_connections.clone) Note over AL: immediately loops back to accept_any() AL->>T: accept_any() [ready for next connection] par Registration Task runs concurrently RT->>RT: "register_new_channel(&peers, channel_id, addr)" RT->>RT: active_connections.insert(channel_id) and Lifecycle Monitor handles transport events T->>LM: "ConnectionEvent::Established { remote_address }" LM->>LM: active_connections.insert(channel_id) LM->>LM: peers.insert or update status LM->>T: send_to_peer_optimized (identity announce) T-->>LM: ConnectionEvent::Lost / Failed LM->>LM: active_connections.remove(channel_id) LM->>LM: peers.remove(channel_id) end Note over RT,LM: ⚠ If Lost fires before RT runs,<br/>RT re-inserts a stale entryPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix: spawn accept loop registration to p..." | Re-trigger Greptile