Skip to content

fix: release peers write guard before identity announce send#86

Open
grumbach wants to merge 1 commit intomainfrom
fix/connection-monitor-lock-across-await
Open

fix: release peers write guard before identity announce send#86
grumbach wants to merge 1 commit intomainfrom
fix/connection-monitor-lock-across-await

Conversation

@grumbach
Copy link
Copy Markdown
Collaborator

@grumbach grumbach commented Apr 15, 2026

Summary

connection_lifecycle_monitor_with_rx held a peers write guard across dual_node.send_to_peer_optimized(...).await on every ConnectionEvent::Established. A slow QUIC send therefore starved every concurrent reader of peers for the full network round-trip, including the 8 message-dispatch shard consumers, DHT queries, and the accept loop.

This is a latent hazard, not the rc-17/rc-19 regression (that was fixed in rc-18 via the mpsc shard-channel serialisation in #80). It is the next logical cleanup of the same class of bug: one slow remote cascading into multi-second reader starvation.

Fix

Extract the Established arm body into TransportHandle::handle_connection_established and scope the peers write guard tightly around the synchronous map mutation so it is released before the identity-announce send runs.

Operation order is preserved: create_identity_announce_bytes is now invoked via a FnOnce() -> Option<Vec<u8>> closure called after the map update, so any future side effects on node_identity still follow channel registration.

1. active_connections.write().insert(...)        // released immediately
2. peers.write() + update-or-insert              // scoped, released here ← FIX
3. make_announce_bytes() (sync closure)
4. send_announce(...).await                      // no locks held

Regression tests

Two tests fail on reverted code and pass on the fix:

  • reader_not_blocked_by_slow_identity_announce: parks a fake send for 1200ms and asserts a concurrent peers.read() completes within 400ms. Unfixed: times out at 400ms. Fixed: completes in microseconds.
  • concurrent_sends_run_in_parallel_not_serialised: spawns 8 concurrent helper calls with 400ms sends and asserts total wall-clock < 1600ms (half of fully-serialised). Unfixed: 3.22s (serialised on write lock). Fixed: 0.40s (truly parallel).

Plus four contract tests: success path, send-error preservation, reconnect connected_at advancement, missing-announce-bytes skip. And one disjoint-key concurrent-correctness test.

10-run stability verified on both regression tests (0 flakes).

Test plan

  • cargo test --lib — 289/289 pass
  • cargo test --tests — 344/344 pass (lib + all integration suites)
  • cargo fmt -- --check
  • cargo clippy --lib --tests -- -D warnings -D clippy::unwrap_used -D clippy::expect_used
  • cargo doc --no-deps --lib
  • 10-run stability on both regression tests (1.20s and 0.40s, zero variance)
  • Baseline reproduced on reverted code (both regression tests fail)

Review

Reviewed across two rounds by three independent hostile reviewers:

Reviewer Round 1 Round 2
Concurrency MINOR-POLISH MINOR-POLISH
Test rigor MINOR-POLISH MINOR-POLISH
Regression MAJOR-REWORK (misread, retracted) SHIP

Every actionable finding addressed. Full report available on request.

Scope

Out of scope (deferred to follow-up PRs)

  1. Pre-existing inter-map windows between active_connections and peers writes on both Established and Lost/Failed arms (two separate .write().await calls are not atomic — a concurrent reader can observe one map updated but not the other). Flagged by Reviewer A as MAJOR but predates this PR.
  2. Zombie peer state when send_to_peer_optimized fails — the peer is still marked Connected even though the remote will never authenticate. Contract question.
  3. connected_at on reconnect — currently bumps on every Established; arguably should mean "first connected".
  4. Larger DashMap/scc migration for the three hot maps. Separate design PR; uses scc (not DashMap) per reviewed research.

Greptile Summary

This PR extracts the ConnectionEvent::Established handling from connection_lifecycle_monitor_with_rx into TransportHandle::handle_connection_established, scoping the peers write guard to the synchronous map mutation only so it is released before the identity-announce .await. The accompanying tests confirm both that a concurrent reader is not blocked during a slow send and that N parallel sends actually run in parallel.

Confidence Score: 5/5

Safe to merge — fix is logically correct, well-tested, and all remaining observations are P2 style notes.

The peers write guard is now demonstrably scoped to the synchronous map mutation only. The let-chain syntax is valid under edition = "2024". Both regression tests correctly reproduce the bug on reverted code and pass on the fix. No P0 or P1 issues found.

No files require special attention.

Important Files Changed

Filename Overview
src/transport_handle.rs Correct guard-scoping fix with comprehensive tests; let-chain syntax is valid under edition = "2024". No issues found.

Sequence Diagram

sequenceDiagram
    participant EL as Event Loop
    participant HCE as handle_connection_established
    participant AC as active_connections
    participant P as peers RwLock
    participant R as Concurrent Reader
    participant QUIC as QUIC send

    Note over EL,QUIC: BEFORE fix
    EL->>HCE: ConnectionEvent::Established
    HCE->>AC: write + insert (guard dropped at semicolon)
    HCE->>P: write guard acquired
    Note over HCE,P: map update synchronous
    HCE->>QUIC: send_to_peer_optimized await (guard still held)
    R->>P: read await BLOCKED for full network RTT
    QUIC-->>HCE: Ok
    HCE->>P: guard finally dropped

    Note over EL,QUIC: AFTER fix
    EL->>HCE: ConnectionEvent::Established
    HCE->>AC: write + insert (guard dropped at semicolon)
    HCE->>P: write guard acquired in scoped block
    Note over HCE,P: map update synchronous
    HCE->>P: guard dropped before any await
    R->>P: read succeeds immediately
    HCE->>HCE: make_announce_bytes sync call
    HCE->>QUIC: send_to_peer_optimized await (no lock held)
Loading

Reviews (1): Last reviewed commit: "fix: release peers write guard before id..." | Re-trigger Greptile

`connection_lifecycle_monitor_with_rx` held a `peers: Arc<RwLock<_>>`
write guard across `dual_node.send_to_peer_optimized(...).await` on every
`ConnectionEvent::Established`. A slow QUIC send therefore starved every
concurrent reader of `peers` for the full network round-trip, including
the 8 message-dispatch shard consumers, DHT queries, and the accept loop.

Extract the Established arm body into `handle_connection_established` and
scope the `peers` write guard tightly around the synchronous map mutation
so it is dropped before the send. Operation order is preserved: the
synthesis of announce bytes runs via a `FnOnce` closure invoked after the
map update, so any future side effects on `node_identity` still follow
channel registration.

Add two regression tests that both fail on reverted code and pass on the
fix:

- `reader_not_blocked_by_slow_identity_announce`: parks a fake send for
  1200ms and asserts a concurrent `peers.read()` completes within 400ms.
  Unfixed: times out at 400ms. Fixed: completes in microseconds.

- `concurrent_sends_run_in_parallel_not_serialised`: spawns 8 concurrent
  helper calls with 400ms sends and asserts total wall-clock < 1600ms
  (half of fully-serialised). Unfixed: 3.22s. Fixed: 0.40s.

Plus four contract tests covering success, send-error, reconnect, and
missing-announce-bytes paths, and one concurrent-correctness test for
disjoint-key parallelism.

Reviewed across two rounds by three independent hostile reviewers
(concurrency, test rigor, regression/contract). Final verdicts:
MINOR-POLISH / MINOR-POLISH / SHIP. All actionable findings addressed.

Out-of-scope follow-ups (separate PRs):
- Inter-map windows between active_connections and peers writes on both
  Established and Lost/Failed arms.
- Zombie peer state when send_to_peer_optimized fails.
- `connected_at` semantics on reconnect.
Copilot AI review requested due to automatic review settings April 15, 2026 08:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors connection-established handling to avoid holding the peers write lock across an awaited identity-announce send, preventing reader starvation under slow QUIC sends.

Changes:

  • Extracts ConnectionEvent::Established logic into TransportHandle::handle_connection_established with tight scoping of the peers write guard.
  • Preserves operation ordering while moving identity announce creation/sending outside of lock scope via injected closures.
  • Adds regression + contract tests validating non-blocking reads and parallel send behavior.

Comment thread src/transport_handle.rs
});

// Wait until the send phase is in progress, then try to read `peers`.
send_started.notified().await;
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send_started.notified().await has no timeout. If handle_connection_established panics or the send closure is never invoked (e.g., future refactor changes the send conditions), this test can hang indefinitely and stall CI. Wrap the notification wait in tokio::time::timeout with a clear failure message, and consider aborting/joining helper_task on timeout so the failure is reported promptly.

Suggested change
send_started.notified().await;
if timeout(READ_BUDGET, send_started.notified()).await.is_err() {
helper_task.abort();
let _ = helper_task.await;
panic!(
"timed out after {READ_BUDGET:?} waiting for handle_connection_established \
to enter the send phase and notify the test"
);
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants