fix: release peers write guard before identity announce send by grumbach · Pull Request #86 · saorsa-labs/saorsa-core

grumbach · 2026-04-15T08:42:35Z

Summary

connection_lifecycle_monitor_with_rx held a peers write guard across dual_node.send_to_peer_optimized(...).await on every ConnectionEvent::Established. A slow QUIC send therefore starved every concurrent reader of peers for the full network round-trip, including the 8 message-dispatch shard consumers, DHT queries, and the accept loop.

This is a latent hazard, not the rc-17/rc-19 regression (that was fixed in rc-18 via the mpsc shard-channel serialisation in #80). It is the next logical cleanup of the same class of bug: one slow remote cascading into multi-second reader starvation.

Fix

Extract the Established arm body into TransportHandle::handle_connection_established and scope the peers write guard tightly around the synchronous map mutation so it is released before the identity-announce send runs.

Operation order is preserved: create_identity_announce_bytes is now invoked via a FnOnce() -> Option<Vec<u8>> closure called after the map update, so any future side effects on node_identity still follow channel registration.

1. active_connections.write().insert(...)        // released immediately
2. peers.write() + update-or-insert              // scoped, released here ← FIX
3. make_announce_bytes() (sync closure)
4. send_announce(...).await                      // no locks held

Regression tests

Two tests fail on reverted code and pass on the fix:

reader_not_blocked_by_slow_identity_announce: parks a fake send for 1200ms and asserts a concurrent peers.read() completes within 400ms. Unfixed: times out at 400ms. Fixed: completes in microseconds.
concurrent_sends_run_in_parallel_not_serialised: spawns 8 concurrent helper calls with 400ms sends and asserts total wall-clock < 1600ms (half of fully-serialised). Unfixed: 3.22s (serialised on write lock). Fixed: 0.40s (truly parallel).

Plus four contract tests: success path, send-error preservation, reconnect connected_at advancement, missing-announce-bytes skip. And one disjoint-key concurrent-correctness test.

10-run stability verified on both regression tests (0 flakes).

Test plan

cargo test --lib — 289/289 pass
cargo test --tests — 344/344 pass (lib + all integration suites)
cargo fmt -- --check
cargo clippy --lib --tests -- -D warnings -D clippy::unwrap_used -D clippy::expect_used
cargo doc --no-deps --lib
10-run stability on both regression tests (1.20s and 0.40s, zero variance)
Baseline reproduced on reverted code (both regression tests fail)

Review

Reviewed across two rounds by three independent hostile reviewers:

Reviewer	Round 1	Round 2
Concurrency	MINOR-POLISH	MINOR-POLISH
Test rigor	MINOR-POLISH	MINOR-POLISH
Regression	MAJOR-REWORK (misread, retracted)	SHIP

Every actionable finding addressed. Full report available on request.

Scope

One file: src/transport_handle.rs (+533 / -44)
No new dependencies
No Cargo.toml / Cargo.lock changes
Does not touch the rc-18 mpsc shard-channel fix (fix: shard channel saturation causing upload degradation at 1000-node scale #80) — this is additive

Out of scope (deferred to follow-up PRs)

Pre-existing inter-map windows between active_connections and peers writes on both Established and Lost/Failed arms (two separate .write().await calls are not atomic — a concurrent reader can observe one map updated but not the other). Flagged by Reviewer A as MAJOR but predates this PR.
Zombie peer state when send_to_peer_optimized fails — the peer is still marked Connected even though the remote will never authenticate. Contract question.
connected_at on reconnect — currently bumps on every Established; arguably should mean "first connected".
Larger DashMap/scc migration for the three hot maps. Separate design PR; uses scc (not DashMap) per reviewed research.

Greptile Summary

This PR extracts the ConnectionEvent::Established handling from connection_lifecycle_monitor_with_rx into TransportHandle::handle_connection_established, scoping the peers write guard to the synchronous map mutation only so it is released before the identity-announce .await. The accompanying tests confirm both that a concurrent reader is not blocked during a slow send and that N parallel sends actually run in parallel.

Confidence Score: 5/5

Safe to merge — fix is logically correct, well-tested, and all remaining observations are P2 style notes.

The peers write guard is now demonstrably scoped to the synchronous map mutation only. The let-chain syntax is valid under edition = "2024". Both regression tests correctly reproduce the bug on reverted code and pass on the fix. No P0 or P1 issues found.

No files require special attention.

Important Files Changed

Filename	Overview
src/transport_handle.rs	Correct guard-scoping fix with comprehensive tests; let-chain syntax is valid under edition = "2024". No issues found.

Sequence Diagram

sequenceDiagram
    participant EL as Event Loop
    participant HCE as handle_connection_established
    participant AC as active_connections
    participant P as peers RwLock
    participant R as Concurrent Reader
    participant QUIC as QUIC send

    Note over EL,QUIC: BEFORE fix
    EL->>HCE: ConnectionEvent::Established
    HCE->>AC: write + insert (guard dropped at semicolon)
    HCE->>P: write guard acquired
    Note over HCE,P: map update synchronous
    HCE->>QUIC: send_to_peer_optimized await (guard still held)
    R->>P: read await BLOCKED for full network RTT
    QUIC-->>HCE: Ok
    HCE->>P: guard finally dropped

    Note over EL,QUIC: AFTER fix
    EL->>HCE: ConnectionEvent::Established
    HCE->>AC: write + insert (guard dropped at semicolon)
    HCE->>P: write guard acquired in scoped block
    Note over HCE,P: map update synchronous
    HCE->>P: guard dropped before any await
    R->>P: read succeeds immediately
    HCE->>HCE: make_announce_bytes sync call
    HCE->>QUIC: send_to_peer_optimized await (no lock held)

_{Reviews (1): Last reviewed commit: "fix: release peers write guard before id..." | Re-trigger Greptile}

`connection_lifecycle_monitor_with_rx` held a `peers: Arc<RwLock<_>>` write guard across `dual_node.send_to_peer_optimized(...).await` on every `ConnectionEvent::Established`. A slow QUIC send therefore starved every concurrent reader of `peers` for the full network round-trip, including the 8 message-dispatch shard consumers, DHT queries, and the accept loop. Extract the Established arm body into `handle_connection_established` and scope the `peers` write guard tightly around the synchronous map mutation so it is dropped before the send. Operation order is preserved: the synthesis of announce bytes runs via a `FnOnce` closure invoked after the map update, so any future side effects on `node_identity` still follow channel registration. Add two regression tests that both fail on reverted code and pass on the fix: - `reader_not_blocked_by_slow_identity_announce`: parks a fake send for 1200ms and asserts a concurrent `peers.read()` completes within 400ms. Unfixed: times out at 400ms. Fixed: completes in microseconds. - `concurrent_sends_run_in_parallel_not_serialised`: spawns 8 concurrent helper calls with 400ms sends and asserts total wall-clock < 1600ms (half of fully-serialised). Unfixed: 3.22s. Fixed: 0.40s. Plus four contract tests covering success, send-error, reconnect, and missing-announce-bytes paths, and one concurrent-correctness test for disjoint-key parallelism. Reviewed across two rounds by three independent hostile reviewers (concurrency, test rigor, regression/contract). Final verdicts: MINOR-POLISH / MINOR-POLISH / SHIP. All actionable findings addressed. Out-of-scope follow-ups (separate PRs): - Inter-map windows between active_connections and peers writes on both Established and Lost/Failed arms. - Zombie peer state when send_to_peer_optimized fails. - `connected_at` semantics on reconnect.

Copilot

Pull request overview

Refactors connection-established handling to avoid holding the peers write lock across an awaited identity-announce send, preventing reader starvation under slow QUIC sends.

Changes:

Extracts ConnectionEvent::Established logic into TransportHandle::handle_connection_established with tight scoping of the peers write guard.
Preserves operation ordering while moving identity announce creation/sending outside of lock scope via injected closures.
Adds regression + contract tests validating non-blocking reads and parallel send behavior.

Copilot · 2026-04-15T08:46:43Z

+        });
+
+        // Wait until the send phase is in progress, then try to read `peers`.
+        send_started.notified().await;


send_started.notified().await has no timeout. If handle_connection_established panics or the send closure is never invoked (e.g., future refactor changes the send conditions), this test can hang indefinitely and stall CI. Wrap the notification wait in tokio::time::timeout with a clear failure message, and consider aborting/joining helper_task on timeout so the failure is reported promptly.

Suggested change

send_started.notified().await;

if timeout(READ_BUDGET, send_started.notified()).await.is_err() {

helper_task.abort();

let _ = helper_task.await;

panic!(

"timed out after {READ_BUDGET:?} waiting for handle_connection_established \

to enter the send phase and notify the test"

);

}

Copilot AI review requested due to automatic review settings April 15, 2026 08:42

Copilot started reviewing on behalf of grumbach April 15, 2026 08:43 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: release peers write guard before identity announce send#86

fix: release peers write guard before identity announce send#86
grumbach wants to merge 1 commit intomainfrom
fix/connection-monitor-lock-across-await

grumbach commented Apr 15, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        send_started.notified().await;
+        if timeout(READ_BUDGET, send_started.notified()).await.is_err() {
+            helper_task.abort();
+            let _ = helper_task.await;
+            panic!(
+                "timed out after {READ_BUDGET:?} waiting for handle_connection_established \
+                 to enter the send phase and notify the test"
+            );
+        }

Conversation

grumbach commented Apr 15, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Regression tests

Test plan

Review

Scope

Out of scope (deferred to follow-up PRs)

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grumbach commented Apr 15, 2026 •

edited by greptile-apps bot

Loading