fix: spawn accept loop registration to prevent handshake channel stall by jacderida · Pull Request #82 · saorsa-labs/saorsa-core

jacderida · 2026-04-14T13:51:29Z

Summary

The accept loop in TransportHandle took two write locks inline (peers + active_connections) for every accepted connection. Under 1000-node scale, contention on these locks caused the loop to fall behind draining the bounded handshake channel (cap 32 in saorsa-transport). Once full, all new connection handoffs blocked, reader tasks were never spawned, and identity exchange timed out.
Fix: spawn the registration work into a separate task so the accept loop immediately returns to draining the channel. Same pattern as the sharding fix already applied to the message receiving system.
Includes a stress test (tests/accept_loop_stress.rs) that floods a node with 40 concurrent connections and asserts >90% complete identity exchange.

Evidence from ant-rc-18 testnet (1000 nodes, 17+ hours)

Upload times degraded from ~175s to 358s+ after 17 hours. Root cause investigation:

Shard channel full warnings: 0 (PR fix: shard channel saturation causing upload degradation at 1000-node scale #80 fix confirmed working)
Identity exchange timeouts: 208 in 30 minutes on a single client
Genesis bootstrap node: 7,253 "Accepted connection (unified path)" vs 6,174 "spawning reader task" — 1,079 connections lost to the stalled channel
Node-side comparison showed "Accepted connection" logged but reader task never spawned, matching the bounded channel(32) between nat_traversal_api.rs:4062 and p2p_endpoint.rs:2476

Companion PR

saorsa-labs/saorsa-transport fix/handshake-channel-capacity — increases the handshake channel from 32 to 1024 as a belt-and-suspenders measure.

Test plan

cargo check passes
cargo test --test accept_loop_stress passes (40 concurrent connections, >90% identity exchange success)

🤖 Generated with Claude Code

Greptile Summary

This PR fixes the accept loop stall that caused 1,079 dropped connections and identity exchange timeouts on the ant-rc-18 testnet by moving the two blocking write-lock operations (peers + active_connections) off the hot accept path into a detached tokio::spawn task — the same pattern already used by the sharded message dispatcher. The accompanying stress test (tests/accept_loop_stress.rs) validates ≥90% identity exchange success under 40 concurrent connections.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/hardening suggestions; no blocking defects introduced.
The fix is minimal and well-motivated by concrete testnet evidence. All remaining comments are P2: a doc inconsistency in the test, a slightly wider (but pre-existing) stale-entry window, and a dropped JoinHandle. None of these block correctness or production reliability at the scale this PR targets.
No files require special attention.

Important Files Changed

Filename	Overview
src/transport_handle.rs	Accept loop now spawns registration work off the hot path; introduces a slightly wider stale-entry window vs. the previous inline approach, and drops the spawned JoinHandle silently.
tests/accept_loop_stress.rs	New stress test for the accept loop; contains a minor doc inconsistency (module comment says 50 clients, constant is 40) but the test logic and assertions are sound.

Sequence Diagram

sequenceDiagram
    participant T as Transport (QUIC)
    participant AL as Accept Loop
    participant RT as Registration Task (spawned)
    participant LM as Lifecycle Monitor

    T->>AL: accept_any() returns remote_sock
    AL->>AL: rate_limiter.check_ip()
    AL->>RT: tokio::spawn (peers.clone, active_connections.clone)
    Note over AL: immediately loops back to accept_any()
    AL->>T: accept_any() [ready for next connection]

    par Registration Task runs concurrently
        RT->>RT: "register_new_channel(&peers, channel_id, addr)"
        RT->>RT: active_connections.insert(channel_id)
    and Lifecycle Monitor handles transport events
        T->>LM: "ConnectionEvent::Established { remote_address }"
        LM->>LM: active_connections.insert(channel_id)
        LM->>LM: peers.insert or update status
        LM->>T: send_to_peer_optimized (identity announce)
        T-->>LM: ConnectionEvent::Lost / Failed
        LM->>LM: active_connections.remove(channel_id)
        LM->>LM: peers.remove(channel_id)
    end

    Note over RT,LM: ⚠ If Lost fires before RT runs,<br/>RT re-inserts a stale entry

Prompt To Fix All With AI

This is a comment left during a code review.
Path: tests/accept_loop_stress.rs
Line: 21

Comment:
**Doc/constant mismatch**

The module-level doc says "floods it with **50** concurrent client connections" but `NUM_CLIENTS` is 40. The assertion comment on line 141 also refers to "5% rounded up" but `40 / 20 + 1 = 3` is actually 7.5% + rounding. These inconsistencies don't affect correctness but will confuse future readers.

```suggestion
//! This test creates one server node and floods it with 40 concurrent
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1538-1545

Comment:
**Stale entry window widened by spawned task**

The spawned task's window is larger than the previous inline version: the Tokio scheduler can now delay it until after the connection lifecycle monitor has fully processed an `Established` + `Lost`/`Failed` pair. When that happens, the task re-inserts a disconnected channel into both `peers` and `active_connections`, and nothing will ever remove it (the lifecycle monitor already fired its `Lost` event). The stale channel then shows up in `is_connection_active()` and `list_active_connections()` until the next send attempt to it cleans it up.

The lifecycle monitor at `connection_lifecycle_monitor_with_rx` (lines ~1989–2008) already handles inbound registration via `ConnectionEvent::Established`, including the same `peers` insert if the channel is unknown. If `accept_any()` and `ConnectionEvent::Established` always fire for the same connections, the register-in-spawned-task path is redundant and could be dropped entirely, eliminating the race.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1540-1545

Comment:
**Dropped `JoinHandle` silences task failures**

The handle returned by `tokio::spawn` is immediately dropped, so any panic or unexpected error inside the registration task is swallowed without logging. The codebase's strict no-panic rule makes this unlikely to matter in practice, but it diverges from the careful error-reporting pattern used elsewhere (e.g., the shard consumer tasks which are collected into `recv_handles` and awaited on shutdown).

Consider at minimum logging a warning if the spawned task's handle can be joined in a background collection, or wrapping the body with `if let Err(e) = ...` on the write calls (even though they can't currently fail).

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix: spawn accept loop registration to p..." | Re-trigger Greptile}

Greptile also left 3 inline comments on this PR.

The accept loop in TransportHandle took two write locks inline (peers + active_connections) for every accepted connection. Under 1000-node scale, contention on these locks caused the loop to fall behind draining the bounded handshake channel (cap 32 in saorsa-transport). Once the channel filled, all new connection handoffs blocked, reader tasks were never spawned, and identity exchange timed out — degrading upload times from ~175s to 358s+ after 17 hours of operation. Fix: spawn the registration work (write locks) into a separate task so the accept loop immediately returns to draining the channel. This mirrors the sharding fix already applied to the message receiving system (same file, same root cause pattern). Includes a stress test that floods a node with 40 concurrent connections and asserts >90% complete identity exchange. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jacderida · 2026-04-14T14:04:55Z

Re: Doc/constant mismatch (tests/accept_loop_stress.rs:21)

Fixed — updated the module doc to say 40 and corrected the tolerance comment from "5% rounded up" to "~7.5% tolerance".

Re: Stale entry window widened by spawned task (src/transport_handle.rs:1545)

Good observation. The stale entry window is wider with the spawned task, but the cleanup-on-next-send path already handles this. Investigating whether the lifecycle monitor's ConnectionEvent::Established makes this registration redundant is a worthwhile follow-up, but it's a larger change that shouldn't gate this fix. Leaving as-is for this PR.

Re: Dropped JoinHandle (src/transport_handle.rs:1545)

Fixed — the spawned task's handle is now awaited in a second lightweight task that logs a warning on failure.

…-lock-stall" This reverts commit 3667104, reversing changes made to 7c9775a.

…-stall Revert PR #82: accept loop write lock stall fix

jacderida mentioned this pull request Apr 14, 2026

fix: increase handshake channel capacity from 32 to 1024 saorsa-labs/saorsa-transport#59

Merged

1 task

greptile-apps bot reviewed Apr 14, 2026

View reviewed changes

Comment thread tests/accept_loop_stress.rs Outdated

Comment thread src/transport_handle.rs

Comment thread src/transport_handle.rs Outdated

mickvandijke approved these changes Apr 14, 2026

View reviewed changes

jacderida force-pushed the fix/accept-loop-write-lock-stall branch from bafc148 to 4dccb64 Compare April 14, 2026 14:04

jacderida merged commit 3667104 into rc-2026.4.1 Apr 14, 2026
6 of 7 checks passed

jacderida added a commit that referenced this pull request Apr 14, 2026

Revert "Merge pull request #82 from saorsa-labs/fix/accept-loop-write…

bb28d23

…-lock-stall" This reverts commit 3667104, reversing changes made to 7c9775a.

jacderida mentioned this pull request Apr 14, 2026

Revert PR #82: accept loop write lock stall fix #83

Merged

2 tasks

jacderida added a commit that referenced this pull request Apr 14, 2026

Merge pull request #83 from saorsa-labs/revert/accept-loop-write-lock…

8c68682

…-stall Revert PR #82: accept loop write lock stall fix

jacderida mentioned this pull request Apr 14, 2026

Release v0.23.0 — merge RC 2026.4.1 into main #84

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: spawn accept loop registration to prevent handshake channel stall#82

fix: spawn accept loop registration to prevent handshake channel stall#82
jacderida merged 1 commit intorc-2026.4.1from
fix/accept-loop-write-lock-stall

jacderida commented Apr 14, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jacderida commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jacderida commented Apr 14, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evidence from ant-rc-18 testnet (1000 nodes, 17+ hours)

Companion PR

Test plan

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jacderida commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacderida commented Apr 14, 2026 •

edited by greptile-apps bot

Loading