Revert PR #82: accept loop write lock stall fix#83
Conversation
| let channel_id = remote_sock.to_string(); | ||
| let remote_addr = MultiAddr::quic(remote_sock); | ||
| // PeerConnected is emitted later when the peer's identity is | ||
| // authenticated via a signed message — not at transport level. | ||
| register_new_channel(&peers, &channel_id, &remote_addr).await; | ||
| active_connections.write().await.insert(channel_id); |
There was a problem hiding this comment.
Accept loop write lock stall re-introduced
This restores the exact code that PR #82 identified and fixed: register_new_channel acquires a peers write lock and the following line acquires an active_connections write lock, both inline inside the accept loop. At 1000-node scale, lock contention on these two guards causes the loop to fall behind draining dual.accept_any(), which sits behind the bounded handshake channel (cap 32 in saorsa-transport). Once that channel fills, all new connection handoffs block — identity exchange times out and upload performance degrades (documented: 175 s → 358 s+ after 17 h of operation).
The PR description says PR #82 "introduced a regression" but does not describe it. Without knowing what that regression was, it is impossible to evaluate whether reverting (and re-exposing this documented production failure mode) represents the right tradeoff, or whether a targeted fix to #82's implementation would be safer.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1531-1536
Comment:
**Accept loop write lock stall re-introduced**
This restores the exact code that PR #82 identified and fixed: `register_new_channel` acquires a `peers` write lock and the following line acquires an `active_connections` write lock, both inline inside the accept loop. At 1000-node scale, lock contention on these two guards causes the loop to fall behind draining `dual.accept_any()`, which sits behind the bounded handshake channel (cap 32 in `saorsa-transport`). Once that channel fills, all new connection handoffs block — identity exchange times out and upload performance degrades (documented: 175 s → 358 s+ after 17 h of operation).
The PR description says PR #82 "introduced a regression" but does not describe it. Without knowing what that regression was, it is impossible to evaluate whether reverting (and re-exposing this documented production failure mode) represents the right tradeoff, or whether a targeted fix to #82's implementation would be safer.
How can I resolve this? If you propose a fix, please make it concise.
Summary
fix/accept-loop-write-lock-stall) which introduced a regressionsrc/transport_handle.rsto its pre-fix: spawn accept loop registration to prevent handshake channel stall #82 statetests/accept_loop_stress.rsintegration test added by fix: spawn accept loop registration to prevent handshake channel stall #82Test plan
🤖 Generated with Claude Code
Greptile Summary
This PR reverts PR #82, restoring inline write lock acquisition inside the accept loop and deleting the stress test that guarded against the related stall bug. The PR description states PR #82 "introduced a regression" but does not describe it, making it impossible to evaluate whether the tradeoff is sound.
peers+active_connections) held inline in the accept loop saturate the bounded handshake channel (cap 32) under 1000-node load, causing identity exchange timeouts and >2× upload time degradation after extended operation.Confidence Score: 4/5
Merging re-introduces a documented production stall bug and removes its only regression guard; safe to merge only if the undocumented regression from PR #82 is confirmed to be more severe.
Two P1 findings: the accept loop inline write lock pattern is a known production failure mode at 1000-node scale, and the stress test deletion removes the sole automated safety net. The PR description does not document what regression motivated the revert, making it impossible to confirm the tradeoff is intentional and sound.
src/transport_handle.rs (accept loop hot path) and tests/accept_loop_stress.rs (deleted regression guard)
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[accept loop: dual.accept_any] -->|connection arrives| B{rate limit OK?} B -->|no| A B -->|yes| C["(after revert) await register_new_channel\nwrite lock: peers"] C --> D["await active_connections.write().insert\nwrite lock: active_connections"] D --> A E["PR #82 approach (reverted)"] -->|spawn task| F["register_new_channel\nwrite lock: peers"] F --> G["active_connections.write().insert\nwrite lock: active_connections"] E -->|immediately returns| H[accept loop continues draining] style C fill:#f99,stroke:#c00 style D fill:#f99,stroke:#c00 style E fill:#9f9,stroke:#0a0 style H fill:#9f9,stroke:#0a0Comments Outside Diff (1)
tests/accept_loop_stress.rsThis test was the only automated signal that the accept loop can handle concurrent connection pressure without stalling identity exchange. Removing it means the bug it guards against (accept loop falling behind draining the handshake channel) can re-emerge silently. If the revert is intentional pending a revised fix for the underlying contention problem, the test should be retained (even temporarily disabled) so it can be re-enabled when the fix lands — deleting it entirely removes the institutional memory of how to reproduce the bug.
Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "Revert "Merge pull request #82 from saor..." | Re-trigger Greptile