fix: release peers write guard before identity announce send#86
Open
fix: release peers write guard before identity announce send#86
Conversation
`connection_lifecycle_monitor_with_rx` held a `peers: Arc<RwLock<_>>` write guard across `dual_node.send_to_peer_optimized(...).await` on every `ConnectionEvent::Established`. A slow QUIC send therefore starved every concurrent reader of `peers` for the full network round-trip, including the 8 message-dispatch shard consumers, DHT queries, and the accept loop. Extract the Established arm body into `handle_connection_established` and scope the `peers` write guard tightly around the synchronous map mutation so it is dropped before the send. Operation order is preserved: the synthesis of announce bytes runs via a `FnOnce` closure invoked after the map update, so any future side effects on `node_identity` still follow channel registration. Add two regression tests that both fail on reverted code and pass on the fix: - `reader_not_blocked_by_slow_identity_announce`: parks a fake send for 1200ms and asserts a concurrent `peers.read()` completes within 400ms. Unfixed: times out at 400ms. Fixed: completes in microseconds. - `concurrent_sends_run_in_parallel_not_serialised`: spawns 8 concurrent helper calls with 400ms sends and asserts total wall-clock < 1600ms (half of fully-serialised). Unfixed: 3.22s. Fixed: 0.40s. Plus four contract tests covering success, send-error, reconnect, and missing-announce-bytes paths, and one concurrent-correctness test for disjoint-key parallelism. Reviewed across two rounds by three independent hostile reviewers (concurrency, test rigor, regression/contract). Final verdicts: MINOR-POLISH / MINOR-POLISH / SHIP. All actionable findings addressed. Out-of-scope follow-ups (separate PRs): - Inter-map windows between active_connections and peers writes on both Established and Lost/Failed arms. - Zombie peer state when send_to_peer_optimized fails. - `connected_at` semantics on reconnect.
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors connection-established handling to avoid holding the peers write lock across an awaited identity-announce send, preventing reader starvation under slow QUIC sends.
Changes:
- Extracts
ConnectionEvent::Establishedlogic intoTransportHandle::handle_connection_establishedwith tight scoping of thepeerswrite guard. - Preserves operation ordering while moving identity announce creation/sending outside of lock scope via injected closures.
- Adds regression + contract tests validating non-blocking reads and parallel send behavior.
| }); | ||
|
|
||
| // Wait until the send phase is in progress, then try to read `peers`. | ||
| send_started.notified().await; |
There was a problem hiding this comment.
send_started.notified().await has no timeout. If handle_connection_established panics or the send closure is never invoked (e.g., future refactor changes the send conditions), this test can hang indefinitely and stall CI. Wrap the notification wait in tokio::time::timeout with a clear failure message, and consider aborting/joining helper_task on timeout so the failure is reported promptly.
Suggested change
| send_started.notified().await; | |
| if timeout(READ_BUDGET, send_started.notified()).await.is_err() { | |
| helper_task.abort(); | |
| let _ = helper_task.await; | |
| panic!( | |
| "timed out after {READ_BUDGET:?} waiting for handle_connection_established \ | |
| to enter the send phase and notify the test" | |
| ); | |
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
connection_lifecycle_monitor_with_rxheld apeerswrite guard acrossdual_node.send_to_peer_optimized(...).awaiton everyConnectionEvent::Established. A slow QUIC send therefore starved every concurrent reader ofpeersfor the full network round-trip, including the 8 message-dispatch shard consumers, DHT queries, and the accept loop.This is a latent hazard, not the rc-17/rc-19 regression (that was fixed in rc-18 via the mpsc shard-channel serialisation in #80). It is the next logical cleanup of the same class of bug: one slow remote cascading into multi-second reader starvation.
Fix
Extract the
Establishedarm body intoTransportHandle::handle_connection_establishedand scope thepeerswrite guard tightly around the synchronous map mutation so it is released before the identity-announce send runs.Operation order is preserved:
create_identity_announce_bytesis now invoked via aFnOnce() -> Option<Vec<u8>>closure called after the map update, so any future side effects onnode_identitystill follow channel registration.Regression tests
Two tests fail on reverted code and pass on the fix:
reader_not_blocked_by_slow_identity_announce: parks a fake send for 1200ms and asserts a concurrentpeers.read()completes within 400ms. Unfixed: times out at 400ms. Fixed: completes in microseconds.concurrent_sends_run_in_parallel_not_serialised: spawns 8 concurrent helper calls with 400ms sends and asserts total wall-clock < 1600ms (half of fully-serialised). Unfixed: 3.22s (serialised on write lock). Fixed: 0.40s (truly parallel).Plus four contract tests: success path, send-error preservation, reconnect
connected_atadvancement, missing-announce-bytes skip. And one disjoint-key concurrent-correctness test.10-run stability verified on both regression tests (0 flakes).
Test plan
cargo test --lib— 289/289 passcargo test --tests— 344/344 pass (lib + all integration suites)cargo fmt -- --checkcargo clippy --lib --tests -- -D warnings -D clippy::unwrap_used -D clippy::expect_usedcargo doc --no-deps --libReview
Reviewed across two rounds by three independent hostile reviewers:
Every actionable finding addressed. Full report available on request.
Scope
src/transport_handle.rs(+533 / -44)Cargo.toml/Cargo.lockchangesOut of scope (deferred to follow-up PRs)
active_connectionsandpeerswrites on bothEstablishedandLost/Failedarms (two separate.write().awaitcalls are not atomic — a concurrent reader can observe one map updated but not the other). Flagged by Reviewer A as MAJOR but predates this PR.send_to_peer_optimizedfails — the peer is still markedConnectedeven though the remote will never authenticate. Contract question.connected_aton reconnect — currently bumps on everyEstablished; arguably should mean "first connected".scc(not DashMap) per reviewed research.Greptile Summary
This PR extracts the
ConnectionEvent::Establishedhandling fromconnection_lifecycle_monitor_with_rxintoTransportHandle::handle_connection_established, scoping thepeerswrite guard to the synchronous map mutation only so it is released before the identity-announce.await. The accompanying tests confirm both that a concurrent reader is not blocked during a slow send and that N parallel sends actually run in parallel.Confidence Score: 5/5
Safe to merge — fix is logically correct, well-tested, and all remaining observations are P2 style notes.
The peers write guard is now demonstrably scoped to the synchronous map mutation only. The let-chain syntax is valid under edition = "2024". Both regression tests correctly reproduce the bug on reverted code and pass on the fix. No P0 or P1 issues found.
No files require special attention.
Important Files Changed
Sequence Diagram
sequenceDiagram participant EL as Event Loop participant HCE as handle_connection_established participant AC as active_connections participant P as peers RwLock participant R as Concurrent Reader participant QUIC as QUIC send Note over EL,QUIC: BEFORE fix EL->>HCE: ConnectionEvent::Established HCE->>AC: write + insert (guard dropped at semicolon) HCE->>P: write guard acquired Note over HCE,P: map update synchronous HCE->>QUIC: send_to_peer_optimized await (guard still held) R->>P: read await BLOCKED for full network RTT QUIC-->>HCE: Ok HCE->>P: guard finally dropped Note over EL,QUIC: AFTER fix EL->>HCE: ConnectionEvent::Established HCE->>AC: write + insert (guard dropped at semicolon) HCE->>P: write guard acquired in scoped block Note over HCE,P: map update synchronous HCE->>P: guard dropped before any await R->>P: read succeeds immediately HCE->>HCE: make_announce_bytes sync call HCE->>QUIC: send_to_peer_optimized await (no lock held)Reviews (1): Last reviewed commit: "fix: release peers write guard before id..." | Re-trigger Greptile