Revert PR #82: accept loop write lock stall fix by jacderida · Pull Request #83 · saorsa-labs/saorsa-core

jacderida · 2026-04-14T22:02:15Z

Summary

Reverts fix: spawn accept loop registration to prevent handshake channel stall #82 (fix/accept-loop-write-lock-stall) which introduced a regression
Restores src/transport_handle.rs to its pre-fix: spawn accept loop registration to prevent handshake channel stall #82 state
Removes the tests/accept_loop_stress.rs integration test added by fix: spawn accept loop registration to prevent handshake channel stall #82

Test plan

CI passes on the RC branch after merge
No accept loop stalls observed in testnet after revert

🤖 Generated with Claude Code

Greptile Summary

This PR reverts PR #82, restoring inline write lock acquisition inside the accept loop and deleting the stress test that guarded against the related stall bug. The PR description states PR #82 "introduced a regression" but does not describe it, making it impossible to evaluate whether the tradeoff is sound.

The reverted code re-exposes a documented production failure mode: two write locks (peers + active_connections) held inline in the accept loop saturate the bounded handshake channel (cap 32) under 1000-node load, causing identity exchange timeouts and >2× upload time degradation after extended operation.
The deleted stress test was the only automated guard for this failure mode; removing it eliminates the ability to detect the bug in CI if it re-emerges.

Confidence Score: 4/5

Merging re-introduces a documented production stall bug and removes its only regression guard; safe to merge only if the undocumented regression from PR #82 is confirmed to be more severe.

Two P1 findings: the accept loop inline write lock pattern is a known production failure mode at 1000-node scale, and the stress test deletion removes the sole automated safety net. The PR description does not document what regression motivated the revert, making it impossible to confirm the tradeoff is intentional and sound.

src/transport_handle.rs (accept loop hot path) and tests/accept_loop_stress.rs (deleted regression guard)

Important Files Changed

Filename	Overview
src/transport_handle.rs	Reverts spawned-task registration back to inline write lock acquisition in the accept loop hot path, re-introducing the documented handshake channel stall at 1000-node scale
tests/accept_loop_stress.rs	Deletes the stress test that was the sole regression guard for accept loop stalls under concurrent connection pressure; no replacement test is introduced

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[accept loop: dual.accept_any] -->|connection arrives| B{rate limit OK?}
    B -->|no| A
    B -->|yes| C["(after revert) await register_new_channel\nwrite lock: peers"]
    C --> D["await active_connections.write().insert\nwrite lock: active_connections"]
    D --> A

    E["PR #82 approach (reverted)"] -->|spawn task| F["register_new_channel\nwrite lock: peers"]
    F --> G["active_connections.write().insert\nwrite lock: active_connections"]
    E -->|immediately returns| H[accept loop continues draining]

    style C fill:#f99,stroke:#c00
    style D fill:#f99,stroke:#c00
    style E fill:#9f9,stroke:#0a0
    style H fill:#9f9,stroke:#0a0

Comments Outside Diff (1)

tests/accept_loop_stress.rs

Regression guard deleted with no replacement

This test was the only automated signal that the accept loop can handle concurrent connection pressure without stalling identity exchange. Removing it means the bug it guards against (accept loop falling behind draining the handshake channel) can re-emerge silently. If the revert is intentional pending a revised fix for the underlying contention problem, the test should be retained (even temporarily disabled) so it can be re-enabled when the fix lands — deleting it entirely removes the institutional memory of how to reproduce the bug.

Prompt To Fix With AI

This is a comment left during a code review.
Path: tests/accept_loop_stress.rs
Line: 1

Comment:
**Regression guard deleted with no replacement**

This test was the only automated signal that the accept loop can handle concurrent connection pressure without stalling identity exchange. Removing it means the bug it guards against (accept loop falling behind draining the handshake channel) can re-emerge silently. If the revert is intentional pending a revised fix for the underlying contention problem, the test should be retained (even temporarily disabled) so it can be re-enabled when the fix lands — deleting it entirely removes the institutional memory of how to reproduce the bug.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1531-1536

Comment:
**Accept loop write lock stall re-introduced**

This restores the exact code that PR #82 identified and fixed: `register_new_channel` acquires a `peers` write lock and the following line acquires an `active_connections` write lock, both inline inside the accept loop. At 1000-node scale, lock contention on these two guards causes the loop to fall behind draining `dual.accept_any()`, which sits behind the bounded handshake channel (cap 32 in `saorsa-transport`). Once that channel fills, all new connection handoffs block — identity exchange times out and upload performance degrades (documented: 175 s → 358 s+ after 17 h of operation).

The PR description says PR #82 "introduced a regression" but does not describe it. Without knowing what that regression was, it is impossible to evaluate whether reverting (and re-exposing this documented production failure mode) represents the right tradeoff, or whether a targeted fix to #82's implementation would be safer.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/accept_loop_stress.rs
Line: 1

Comment:
**Regression guard deleted with no replacement**

This test was the only automated signal that the accept loop can handle concurrent connection pressure without stalling identity exchange. Removing it means the bug it guards against (accept loop falling behind draining the handshake channel) can re-emerge silently. If the revert is intentional pending a revised fix for the underlying contention problem, the test should be retained (even temporarily disabled) so it can be re-enabled when the fix lands — deleting it entirely removes the institutional memory of how to reproduce the bug.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "Revert "Merge pull request #82 from saor..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

…-lock-stall" This reverts commit 3667104, reversing changes made to 7c9775a.

greptile-apps · 2026-04-14T22:05:06Z

+                let channel_id = remote_sock.to_string();
+                let remote_addr = MultiAddr::quic(remote_sock);
+                // PeerConnected is emitted later when the peer's identity is
+                // authenticated via a signed message — not at transport level.
+                register_new_channel(&peers, &channel_id, &remote_addr).await;
+                active_connections.write().await.insert(channel_id);


Accept loop write lock stall re-introduced

This restores the exact code that PR #82 identified and fixed: register_new_channel acquires a peers write lock and the following line acquires an active_connections write lock, both inline inside the accept loop. At 1000-node scale, lock contention on these two guards causes the loop to fall behind draining dual.accept_any(), which sits behind the bounded handshake channel (cap 32 in saorsa-transport). Once that channel fills, all new connection handoffs block — identity exchange times out and upload performance degrades (documented: 175 s → 358 s+ after 17 h of operation).

The PR description says PR #82 "introduced a regression" but does not describe it. Without knowing what that regression was, it is impossible to evaluate whether reverting (and re-exposing this documented production failure mode) represents the right tradeoff, or whether a targeted fix to #82's implementation would be safer.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/transport_handle.rs Line: 1531-1536 Comment: **Accept loop write lock stall re-introduced** This restores the exact code that PR #82 identified and fixed: `register_new_channel` acquires a `peers` write lock and the following line acquires an `active_connections` write lock, both inline inside the accept loop. At 1000-node scale, lock contention on these two guards causes the loop to fall behind draining `dual.accept_any()`, which sits behind the bounded handshake channel (cap 32 in `saorsa-transport`). Once that channel fills, all new connection handoffs block — identity exchange times out and upload performance degrades (documented: 175 s → 358 s+ after 17 h of operation). The PR description says PR #82 "introduced a regression" but does not describe it. Without knowing what that regression was, it is impossible to evaluate whether reverting (and re-exposing this documented production failure mode) represents the right tradeoff, or whether a targeted fix to #82's implementation would be safer. How can I resolve this? If you propose a fix, please make it concise.

Revert "Merge pull request #82 from saorsa-labs/fix/accept-loop-write…

bb28d23

…-lock-stall" This reverts commit 3667104, reversing changes made to 7c9775a.

jacderida merged commit 8c68682 into rc-2026.4.1 Apr 14, 2026
1 of 2 checks passed

greptile-apps bot reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert PR #82: accept loop write lock stall fix#83

Revert PR #82: accept loop write lock stall fix#83
jacderida merged 1 commit intorc-2026.4.1from
revert/accept-loop-write-lock-stall

jacderida commented Apr 14, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

Uh oh!

greptile-apps bot Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jacderida commented Apr 14, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

greptile-apps bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jacderida commented Apr 14, 2026 •

edited by greptile-apps bot

Loading