Skip to content

Revert PR #82: accept loop write lock stall fix#83

Merged
jacderida merged 1 commit intorc-2026.4.1from
revert/accept-loop-write-lock-stall
Apr 14, 2026
Merged

Revert PR #82: accept loop write lock stall fix#83
jacderida merged 1 commit intorc-2026.4.1from
revert/accept-loop-write-lock-stall

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

@jacderida jacderida commented Apr 14, 2026

Summary

Test plan

  • CI passes on the RC branch after merge
  • No accept loop stalls observed in testnet after revert

🤖 Generated with Claude Code

Greptile Summary

This PR reverts PR #82, restoring inline write lock acquisition inside the accept loop and deleting the stress test that guarded against the related stall bug. The PR description states PR #82 "introduced a regression" but does not describe it, making it impossible to evaluate whether the tradeoff is sound.

  • The reverted code re-exposes a documented production failure mode: two write locks (peers + active_connections) held inline in the accept loop saturate the bounded handshake channel (cap 32) under 1000-node load, causing identity exchange timeouts and >2× upload time degradation after extended operation.
  • The deleted stress test was the only automated guard for this failure mode; removing it eliminates the ability to detect the bug in CI if it re-emerges.

Confidence Score: 4/5

Merging re-introduces a documented production stall bug and removes its only regression guard; safe to merge only if the undocumented regression from PR #82 is confirmed to be more severe.

Two P1 findings: the accept loop inline write lock pattern is a known production failure mode at 1000-node scale, and the stress test deletion removes the sole automated safety net. The PR description does not document what regression motivated the revert, making it impossible to confirm the tradeoff is intentional and sound.

src/transport_handle.rs (accept loop hot path) and tests/accept_loop_stress.rs (deleted regression guard)

Important Files Changed

Filename Overview
src/transport_handle.rs Reverts spawned-task registration back to inline write lock acquisition in the accept loop hot path, re-introducing the documented handshake channel stall at 1000-node scale
tests/accept_loop_stress.rs Deletes the stress test that was the sole regression guard for accept loop stalls under concurrent connection pressure; no replacement test is introduced

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[accept loop: dual.accept_any] -->|connection arrives| B{rate limit OK?}
    B -->|no| A
    B -->|yes| C["(after revert) await register_new_channel\nwrite lock: peers"]
    C --> D["await active_connections.write().insert\nwrite lock: active_connections"]
    D --> A

    E["PR #82 approach (reverted)"] -->|spawn task| F["register_new_channel\nwrite lock: peers"]
    F --> G["active_connections.write().insert\nwrite lock: active_connections"]
    E -->|immediately returns| H[accept loop continues draining]

    style C fill:#f99,stroke:#c00
    style D fill:#f99,stroke:#c00
    style E fill:#9f9,stroke:#0a0
    style H fill:#9f9,stroke:#0a0
Loading

Comments Outside Diff (1)

  1. tests/accept_loop_stress.rs

    P1 Regression guard deleted with no replacement

    This test was the only automated signal that the accept loop can handle concurrent connection pressure without stalling identity exchange. Removing it means the bug it guards against (accept loop falling behind draining the handshake channel) can re-emerge silently. If the revert is intentional pending a revised fix for the underlying contention problem, the test should be retained (even temporarily disabled) so it can be re-enabled when the fix lands — deleting it entirely removes the institutional memory of how to reproduce the bug.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: tests/accept_loop_stress.rs
    Line: 1
    
    Comment:
    **Regression guard deleted with no replacement**
    
    This test was the only automated signal that the accept loop can handle concurrent connection pressure without stalling identity exchange. Removing it means the bug it guards against (accept loop falling behind draining the handshake channel) can re-emerge silently. If the revert is intentional pending a revised fix for the underlying contention problem, the test should be retained (even temporarily disabled) so it can be re-enabled when the fix lands — deleting it entirely removes the institutional memory of how to reproduce the bug.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1531-1536

Comment:
**Accept loop write lock stall re-introduced**

This restores the exact code that PR #82 identified and fixed: `register_new_channel` acquires a `peers` write lock and the following line acquires an `active_connections` write lock, both inline inside the accept loop. At 1000-node scale, lock contention on these two guards causes the loop to fall behind draining `dual.accept_any()`, which sits behind the bounded handshake channel (cap 32 in `saorsa-transport`). Once that channel fills, all new connection handoffs block — identity exchange times out and upload performance degrades (documented: 175 s → 358 s+ after 17 h of operation).

The PR description says PR #82 "introduced a regression" but does not describe it. Without knowing what that regression was, it is impossible to evaluate whether reverting (and re-exposing this documented production failure mode) represents the right tradeoff, or whether a targeted fix to #82's implementation would be safer.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/accept_loop_stress.rs
Line: 1

Comment:
**Regression guard deleted with no replacement**

This test was the only automated signal that the accept loop can handle concurrent connection pressure without stalling identity exchange. Removing it means the bug it guards against (accept loop falling behind draining the handshake channel) can re-emerge silently. If the revert is intentional pending a revised fix for the underlying contention problem, the test should be retained (even temporarily disabled) so it can be re-enabled when the fix lands — deleting it entirely removes the institutional memory of how to reproduce the bug.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Revert "Merge pull request #82 from saor..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

…-lock-stall"

This reverts commit 3667104, reversing
changes made to 7c9775a.
@jacderida jacderida merged commit 8c68682 into rc-2026.4.1 Apr 14, 2026
1 of 2 checks passed
Comment thread src/transport_handle.rs
Comment on lines +1531 to +1536
let channel_id = remote_sock.to_string();
let remote_addr = MultiAddr::quic(remote_sock);
// PeerConnected is emitted later when the peer's identity is
// authenticated via a signed message — not at transport level.
register_new_channel(&peers, &channel_id, &remote_addr).await;
active_connections.write().await.insert(channel_id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Accept loop write lock stall re-introduced

This restores the exact code that PR #82 identified and fixed: register_new_channel acquires a peers write lock and the following line acquires an active_connections write lock, both inline inside the accept loop. At 1000-node scale, lock contention on these two guards causes the loop to fall behind draining dual.accept_any(), which sits behind the bounded handshake channel (cap 32 in saorsa-transport). Once that channel fills, all new connection handoffs block — identity exchange times out and upload performance degrades (documented: 175 s → 358 s+ after 17 h of operation).

The PR description says PR #82 "introduced a regression" but does not describe it. Without knowing what that regression was, it is impossible to evaluate whether reverting (and re-exposing this documented production failure mode) represents the right tradeoff, or whether a targeted fix to #82's implementation would be safer.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1531-1536

Comment:
**Accept loop write lock stall re-introduced**

This restores the exact code that PR #82 identified and fixed: `register_new_channel` acquires a `peers` write lock and the following line acquires an `active_connections` write lock, both inline inside the accept loop. At 1000-node scale, lock contention on these two guards causes the loop to fall behind draining `dual.accept_any()`, which sits behind the bounded handshake channel (cap 32 in `saorsa-transport`). Once that channel fills, all new connection handoffs block — identity exchange times out and upload performance degrades (documented: 175 s → 358 s+ after 17 h of operation).

The PR description says PR #82 "introduced a regression" but does not describe it. Without knowing what that regression was, it is impossible to evaluate whether reverting (and re-exposing this documented production failure mode) represents the right tradeoff, or whether a targeted fix to #82's implementation would be safer.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant