Skip to content

fix: spawn accept loop registration to prevent handshake channel stall#82

Merged
jacderida merged 1 commit intorc-2026.4.1from
fix/accept-loop-write-lock-stall
Apr 14, 2026
Merged

fix: spawn accept loop registration to prevent handshake channel stall#82
jacderida merged 1 commit intorc-2026.4.1from
fix/accept-loop-write-lock-stall

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

@jacderida jacderida commented Apr 14, 2026

Summary

  • The accept loop in TransportHandle took two write locks inline (peers + active_connections) for every accepted connection. Under 1000-node scale, contention on these locks caused the loop to fall behind draining the bounded handshake channel (cap 32 in saorsa-transport). Once full, all new connection handoffs blocked, reader tasks were never spawned, and identity exchange timed out.
  • Fix: spawn the registration work into a separate task so the accept loop immediately returns to draining the channel. Same pattern as the sharding fix already applied to the message receiving system.
  • Includes a stress test (tests/accept_loop_stress.rs) that floods a node with 40 concurrent connections and asserts >90% complete identity exchange.

Evidence from ant-rc-18 testnet (1000 nodes, 17+ hours)

Upload times degraded from ~175s to 358s+ after 17 hours. Root cause investigation:

  • Shard channel full warnings: 0 (PR fix: shard channel saturation causing upload degradation at 1000-node scale #80 fix confirmed working)
  • Identity exchange timeouts: 208 in 30 minutes on a single client
  • Genesis bootstrap node: 7,253 "Accepted connection (unified path)" vs 6,174 "spawning reader task" — 1,079 connections lost to the stalled channel
  • Node-side comparison showed "Accepted connection" logged but reader task never spawned, matching the bounded channel(32) between nat_traversal_api.rs:4062 and p2p_endpoint.rs:2476

Companion PR

saorsa-labs/saorsa-transport fix/handshake-channel-capacity — increases the handshake channel from 32 to 1024 as a belt-and-suspenders measure.

Test plan

  • cargo check passes
  • cargo test --test accept_loop_stress passes (40 concurrent connections, >90% identity exchange success)

🤖 Generated with Claude Code

Greptile Summary

This PR fixes the accept loop stall that caused 1,079 dropped connections and identity exchange timeouts on the ant-rc-18 testnet by moving the two blocking write-lock operations (peers + active_connections) off the hot accept path into a detached tokio::spawn task — the same pattern already used by the sharded message dispatcher. The accompanying stress test (tests/accept_loop_stress.rs) validates ≥90% identity exchange success under 40 concurrent connections.

Confidence Score: 5/5

  • Safe to merge — all findings are P2 style/hardening suggestions; no blocking defects introduced.
  • The fix is minimal and well-motivated by concrete testnet evidence. All remaining comments are P2: a doc inconsistency in the test, a slightly wider (but pre-existing) stale-entry window, and a dropped JoinHandle. None of these block correctness or production reliability at the scale this PR targets.
  • No files require special attention.

Important Files Changed

Filename Overview
src/transport_handle.rs Accept loop now spawns registration work off the hot path; introduces a slightly wider stale-entry window vs. the previous inline approach, and drops the spawned JoinHandle silently.
tests/accept_loop_stress.rs New stress test for the accept loop; contains a minor doc inconsistency (module comment says 50 clients, constant is 40) but the test logic and assertions are sound.

Sequence Diagram

sequenceDiagram
    participant T as Transport (QUIC)
    participant AL as Accept Loop
    participant RT as Registration Task (spawned)
    participant LM as Lifecycle Monitor

    T->>AL: accept_any() returns remote_sock
    AL->>AL: rate_limiter.check_ip()
    AL->>RT: tokio::spawn (peers.clone, active_connections.clone)
    Note over AL: immediately loops back to accept_any()
    AL->>T: accept_any() [ready for next connection]

    par Registration Task runs concurrently
        RT->>RT: "register_new_channel(&peers, channel_id, addr)"
        RT->>RT: active_connections.insert(channel_id)
    and Lifecycle Monitor handles transport events
        T->>LM: "ConnectionEvent::Established { remote_address }"
        LM->>LM: active_connections.insert(channel_id)
        LM->>LM: peers.insert or update status
        LM->>T: send_to_peer_optimized (identity announce)
        T-->>LM: ConnectionEvent::Lost / Failed
        LM->>LM: active_connections.remove(channel_id)
        LM->>LM: peers.remove(channel_id)
    end

    Note over RT,LM: ⚠ If Lost fires before RT runs,<br/>RT re-inserts a stale entry
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tests/accept_loop_stress.rs
Line: 21

Comment:
**Doc/constant mismatch**

The module-level doc says "floods it with **50** concurrent client connections" but `NUM_CLIENTS` is 40. The assertion comment on line 141 also refers to "5% rounded up" but `40 / 20 + 1 = 3` is actually 7.5% + rounding. These inconsistencies don't affect correctness but will confuse future readers.

```suggestion
//! This test creates one server node and floods it with 40 concurrent
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1538-1545

Comment:
**Stale entry window widened by spawned task**

The spawned task's window is larger than the previous inline version: the Tokio scheduler can now delay it until after the connection lifecycle monitor has fully processed an `Established` + `Lost`/`Failed` pair. When that happens, the task re-inserts a disconnected channel into both `peers` and `active_connections`, and nothing will ever remove it (the lifecycle monitor already fired its `Lost` event). The stale channel then shows up in `is_connection_active()` and `list_active_connections()` until the next send attempt to it cleans it up.

The lifecycle monitor at `connection_lifecycle_monitor_with_rx` (lines ~1989–2008) already handles inbound registration via `ConnectionEvent::Established`, including the same `peers` insert if the channel is unknown. If `accept_any()` and `ConnectionEvent::Established` always fire for the same connections, the register-in-spawned-task path is redundant and could be dropped entirely, eliminating the race.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1540-1545

Comment:
**Dropped `JoinHandle` silences task failures**

The handle returned by `tokio::spawn` is immediately dropped, so any panic or unexpected error inside the registration task is swallowed without logging. The codebase's strict no-panic rule makes this unlikely to matter in practice, but it diverges from the careful error-reporting pattern used elsewhere (e.g., the shard consumer tasks which are collected into `recv_handles` and awaited on shutdown).

Consider at minimum logging a warning if the spawned task's handle can be joined in a background collection, or wrapping the body with `if let Err(e) = ...` on the write calls (even though they can't currently fail).

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: spawn accept loop registration to p..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

Comment thread tests/accept_loop_stress.rs Outdated
Comment thread src/transport_handle.rs
Comment thread src/transport_handle.rs Outdated
The accept loop in TransportHandle took two write locks inline
(peers + active_connections) for every accepted connection. Under
1000-node scale, contention on these locks caused the loop to fall
behind draining the bounded handshake channel (cap 32 in
saorsa-transport). Once the channel filled, all new connection
handoffs blocked, reader tasks were never spawned, and identity
exchange timed out — degrading upload times from ~175s to 358s+
after 17 hours of operation.

Fix: spawn the registration work (write locks) into a separate task
so the accept loop immediately returns to draining the channel.
This mirrors the sharding fix already applied to the message
receiving system (same file, same root cause pattern).

Includes a stress test that floods a node with 40 concurrent
connections and asserts >90% complete identity exchange.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the fix/accept-loop-write-lock-stall branch from bafc148 to 4dccb64 Compare April 14, 2026 14:04
@jacderida
Copy link
Copy Markdown
Collaborator Author

Re: Doc/constant mismatch (tests/accept_loop_stress.rs:21)

Fixed — updated the module doc to say 40 and corrected the tolerance comment from "5% rounded up" to "~7.5% tolerance".

Re: Stale entry window widened by spawned task (src/transport_handle.rs:1545)

Good observation. The stale entry window is wider with the spawned task, but the cleanup-on-next-send path already handles this. Investigating whether the lifecycle monitor's ConnectionEvent::Established makes this registration redundant is a worthwhile follow-up, but it's a larger change that shouldn't gate this fix. Leaving as-is for this PR.

Re: Dropped JoinHandle (src/transport_handle.rs:1545)

Fixed — the spawned task's handle is now awaited in a second lightweight task that logs a warning on failure.

@jacderida jacderida merged commit 3667104 into rc-2026.4.1 Apr 14, 2026
6 of 7 checks passed
jacderida added a commit that referenced this pull request Apr 14, 2026
…-lock-stall"

This reverts commit 3667104, reversing
changes made to 7c9775a.
jacderida added a commit that referenced this pull request Apr 14, 2026
…-stall

Revert PR #82: accept loop write lock stall fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants