release: v0.32.0 by jacderida · Pull Request #61 · WithAutonomi/saorsa-transport

jacderida · 2026-04-14T21:39:30Z

Summary

Merges RC branch rc-2026.4.1 into main for the v0.32.0 release
Includes all PRs merged to the RC branch (refactor: remove dead connect_dual_stack and try_connect_family methods #36–feat: increase ACK timeout size budget from 1ms/KB to 4ms/KB #58), with PR fix: increase handshake channel capacity from 32 to 1024 #59 reverted (revert: undo PR #59 handshake channel capacity increase #60)
Version bumped to 0.32.0 with updated Cargo.lock

Key changes

NAT traversal fixes for large-scale testnets (990+ nodes)
PUNCH_ME_NOW_NACK coordinator signalling
Quality-aware coordinator selection with RTT weighting
Lock-free ConnectionRouter with atomic selection stats
UPnP IGD port mapping for NAT traversal
Tightened network timeouts for faster connection establishment
Relay session improvements (cleanup, rotation, health monitoring)
ACK timeout size budget increase (1ms/KB → 4ms/KB)

Test plan

CI passes
Tag v0.32.0 after merge

🤖 Generated with Claude Code

Greptile Summary

This release PR merges the rc-2026.4.1 branch into main for v0.32.0, incorporating a large batch of NAT traversal improvements: UPnP IGD port mapping, a lock-free ConnectionRouter, Tier-4 coordinator back-pressure via RelaySlotTable, PUNCH_ME_NOW_NACK signalling for faster coordinator rotation, quality-aware DHT-referrer coordinator selection, tightened connection timeouts, and relay session lifecycle improvements.

P1 — broken test: src/connection_strategy.rs::test_default_config asserts holepunch_timeout = 8s and relay_timeout = 10s, but the Default impl in this PR sets them to 3s and 5s. CI will fail until one side is corrected.
P2 — stale NACK accumulation: a NACK not consumed before try_hole_punch is cancelled by the outer timeout lingers in nack_set and may prematurely abort the next coordinator rotation for the same target.
P2 — rate-limit relaxation: max_coordination_per_window was raised 60× (5 → 300/min per connection); the RelaySlotTable provides the real back-pressure, but the per-connection guard is now largely inert.

Confidence Score: 4/5

Safe to merge after fixing the test mismatch in connection_strategy.rs; all other findings are P2.

One P1 finding — the test_default_config assertions diverge from the Default impl and will break CI. The remaining comments are P2 quality/hardening concerns that do not block correctness.

src/connection_strategy.rs — test_default_config will fail due to holepunch_timeout/relay_timeout mismatch between Default impl and assertions.

Important Files Changed

Filename	Overview
src/connection_strategy.rs	Tightened strategy timeouts and added `transition_to_next_relay`; but `test_default_config` asserts 8s/10s for holepunch/relay while the `Default` impl produces 3s/5s — this test will fail.
src/upnp.rs	New best-effort UPnP IGD port-mapping service; cleanly feature-gated, validates external IP for public routability before surfacing as a candidate, and gracefully degrades when no gateway is present.
src/relay_slot_table.rs	New node-wide Tier-4 back-pressure table for coordinator relay sessions; keyed by (initiator_addr, target_peer_id), amortized inline sweep, released on connection Drop — well-tested.
src/connection_router.rs	Refactored to lock-free interior mutability: `RouterStats` now uses `AtomicU64`, constrained transport uses `OnceLock`, and all public methods take `&self`.
src/p2p_endpoint.rs	Major additions: quality-aware coordinator rotation, per-target DashMap for peer IDs and preferred coordinators, NACK fast-exit path, relay timeout, and reachability tracking; stale NACK accumulation across rotations is a minor edge case.
src/connection/nat_traversal.rs	Adds `RelaySlotTable` integration into `BootstrapCoordinator`, `Drop` impl for slot release, and `PortMapped` candidate source; `max_coordination_per_window` raised 60× from 5→300 deserves review.
src/frame/nat_traversal_unified.rs	New `PunchMeNowNack` frame (0x3d7e99) with clean encode/decode and correct `SIZE_BOUND` constant.
src/node.rs	Replaced speculative `has_public_ip` with freshness-aware `can_receive_direct` based on actual inbound connection evidence and `DIRECT_REACHABILITY_TTL`.
src/reachability.rs	New module providing `ReachabilityScope`, `TraversalMethod`, and `socket_addr_scope` helpers; cleanly separates address classification from node-level policy.
src/masque/relay_server.rs	Added `spawn_cleanup_task` with Weak-reference lifecycle and two covering tests.
src/config/nat_timeouts.rs	Raised `DEFAULT_SEND_ACK_TIMEOUT` from 1s → 5s and `FAST_SEND_ACK_TIMEOUT` from 500ms → 2500ms to handle cross-region hole-punched paths; rationale well-documented.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[connect_with_fallback] --> B{Direct connect}
    B -->|success| Z[Connected - Direct]
    B -->|fail| C{Preferred coordinators from DHT?}
    C -->|Yes: N coordinators| D[Rotate through coordinators\n1.5s quick timeout each except last]
    C -->|No| E[Single coordinator\nfull strategy timeout]
    D --> F{try_hole_punch}
    E --> F
    F -->|NACK received| G[consume_nack - abort early\nrotate to next coordinator]
    G --> D
    F -->|timeout / error| H{More coordinators in rotation?}
    H -->|Yes| D
    H -->|No - should_retry?| F
    H -->|No - exhausted| I[Post-punch direct retry 1s]
    F -->|success| Z2[Connected - HolePunch]
    I -->|success| Z2
    I -->|fail| J{Try relay}
    J -->|5s timeout| K[transition_to_next_relay]
    K -->|more relays| J
    K -->|exhausted| L[Connection Failed]
    J -->|success| Z3[Connected - Relay]
    subgraph Coordinator back-pressure
        M[PUNCH_ME_NOW arrives at coordinator]
        M --> N{RelaySlotTable at capacity?}
        N -->|No - slot acquired| O[Forward to target]
        N -->|Yes - refused| P[Silent drop - initiator times out]
        O --> Q{Target found?}
        Q -->|No| R[Send PUNCH_ME_NOW_NACK]
        Q -->|Yes| S[Relay frame to target]
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/connection_strategy.rs
Line: 489-490

Comment:
**Test-vs-implementation mismatch: CI will fail**

The `Default` impl (line 183–184) sets `holepunch_timeout = 3s` and `relay_timeout = 5s`, but this test asserts `8s` and `10s` respectively. These values have never matched in this PR, so `cargo test` will fail.

Either the test needs to be updated to the values the `Default` impl actually produces, or the `Default` impl needs to be corrected to the intended values:

```suggestion
        assert_eq!(config.holepunch_timeout, Duration::from_secs(3));
        assert_eq!(config.relay_timeout, Duration::from_secs(5));
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/p2p_endpoint.rs
Line: 93-104

Comment:
**Docstring references a stale `8s` timeout**

The comment says "the worst-case wait for K preferred coordinators is roughly `(K-1) * 1.5s + 8s`", but the `holepunch_timeout` default in `StrategyConfig` is now `3s` (set in this PR, `connection_strategy.rs` line 183). The `8s` figure is stale — either the default was intended to stay at 8s (in which case `Default` needs fixing) or the docstring should be updated to reflect `3s`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/connection/nat_traversal.rs
Line: 484

Comment:
**Per-connection coordination rate limit raised 60×**

`max_coordination_per_window` went from 5 → 300 PUNCH_ME_NOW requests per 60-second window *per connection*. While the new `RelaySlotTable` provides the actual node-wide back-pressure at 32 concurrent sessions (which is the right control), this effectively neuters the per-connection DoS guard: a single misbehaving peer can now issue 300 relay-probe bursts per minute before the per-connection rate limiter fires. Consider whether the Tier-4 cap alone is sufficient or whether a more moderate per-connection limit (e.g. 60–120/min) is preferable to retain defence-in-depth.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/p2p_endpoint.rs
Line: 2263-2275

Comment:
**Stale NACKs may prematurely abort the next coordinator rotation**

When `try_hole_punch` is cancelled (because the outer `timeout(attempt_timeout, …)` fires), any NACK that arrived just before the cancellation stays unconsumed in `self.inner.nack_set`. On the *next* call to `try_hole_punch` (with a different coordinator), `nack_notify` may immediately fire and `consume_nack` will return `true` — returning `Err(...)` even though the *new* coordinator hasn't sent a NACK. Consider clearing the NACK entry for the target when rotating coordinators, or scoping NACKs by `(coordinator_addr, target_peer_id)` so a stale NACK from coordinator A doesn't abort coordinator B.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "style: apply rustfmt formatting" | Re-trigger Greptile}

Greptile also left 4 inline comments on this PR.

Reduce connection strategy timeouts based on RTT analysis (300ms worst-case cross-globe RTT). The previous values were overly conservative, causing up to 89s worst-case connection time to unreachable peers. Changes: - Direct connection: 5s → 2s (1-2 RTTs + margin) - Hole-punch round: 15s → 3s (3 RTTs + margin) - Max hole-punch rounds: 3 → 2 - Relay establishment: 30s → 10s - Post-hole-punch direct retry: 3s → 1s (warm NAT pinhole) - Default send ACK timeout: 1s → 500ms - Remove redundant hardcoded 15s deadline in try_hole_punch; outer strategy.holepunch_timeout() is now the single source of truth Worst-case connection path drops from 89s to ~21s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After lowering DEFAULT_SEND_ACK_TIMEOUT to 500ms, both constants held the same value, making TimeoutConfig::fast() identical to default for send_ack_timeout. Lower the fast variant to 250ms to match the halved-interval pattern used throughout the fast profile. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: tighten network timeouts for faster connections

Observed hole-punch success times of 1-6s across regions (London → Tokyo coordinator → Hetzner NAT node). The 3s holepunch_timeout was cutting off punches that needed 3-6s of cross-continental coordination, causing fallback to relay which then also struggled. - ipv4_timeout: 2s → 3s (cross-region direct needs margin) - ipv6_timeout: 2s → 3s - holepunch_timeout: 3s → 8s (2x margin over observed 1-6s range) Worst-case connection path: 3+3+8+1+8+1+10 ≈ 34s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hole_punch_target_peer_id was a single shared field on P2pEndpoint. When multiple concurrent dial_candidate calls ran in parallel (the DHT lookup uses ALPHA=3 parallel queries), they all overwrote the same field. try_hole_punch then used find_connection_by_peer_id which returned whatever connection matched the LAST peer ID written, not the one being hole-punched. This caused hole-punches to "succeed" but return connections to the wrong peer. The DHT layer then couldn't use the connection (address mismatch), re-dialled, and timed out — blocking uploads indefinitely. Fix: replace the single shared Option<[u8; 32]> with a per-target DashMap<SocketAddr, [u8; 32]>. Each concurrent dial gets its own entry keyed by the target address it's connecting to. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log the actual peer ID value at three points: 1. dial_candidate: what's inserted into the DashMap 2. try_hole_punch: what's read from the DashMap 3. send_coordination_request_with_peer_id: what goes on the wire This will show whether the DashMap fix is working or if the peer ID is being corrupted elsewhere in the path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…y diagnostics Add a per-target preferred coordinator DashMap so the DHT referrer (the node that returned the target in a FindNode response) is used as the first-choice coordinator for PUNCH_ME_NOW relay. This ensures the coordinator has an active connection to the target. Also improve the "No connection found" log to include hex-encoded peer IDs of all known connections, making it possible to definitively confirm when a coordinator cannot relay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The fixed 500ms send_ack_timeout causes all large chunk transfers to fail. QUIC slow-start over a high-RTT path (e.g. London→Tokyo, 250ms RTT) needs 9-10 round trips to ramp the congestion window enough for 4MB, taking 2-3 seconds minimum. The 500ms timeout fires well before the data is delivered. Add a size-proportional budget: base_timeout + (payload_bytes / 1024)ms. This gives ~500ms for small DHT messages (still catches dead connections quickly) and ~4.5s for a 4MB chunk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The PUNCH_ME_NOW relay path queues frames directly on the connection's pending queue without checking if the connection is alive. If the connection has silently died (error set or drained), the frame goes into a queue that will never be transmitted and no error is surfaced. Add conn.error and is_drained() checks before queuing, with logging to diagnose why some PUNCH_ME_NOW frames are queued but never arrive at the coordinator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Frames queued on a live connection sometimes don't arrive at the coordinator. The previous diagnostic confirmed the connection is alive at queue time. This log covers the next step: whether poll_transmit actually pops the frame from the pending queue and encodes it into a QUIC packet. Only logs relay frames (target_peer_id is Some), not direct punches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The previous logic skipped overwriting a DashMap entry when the existing connection had no close_reason. But connections can become zombies — their driver is no longer polling them, yet close_reason() still returns None. Frames queued on zombie connections are never encoded into QUIC packets and silently lost. Diagnostic data from a 50-node testnet showed that only 1 out of 5 coordinators was actually encoding PUNCH_ME_NOW frames. The other 4 had stale Connection objects in the DashMap that accepted frames into their pending queues but had no active driver to transmit them. Always overwrite with the newest connection, which is the one most likely to have an active driver. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Before queuing a PUNCH_ME_NOW on a coordinator connection from the DashMap, verify the low-level endpoint still tracks it. Zombie connections pass close_reason() and is_drained() checks but their driver is no longer polling, so frames queued on them are never encoded into QUIC packets. When a zombie is detected, remove it from the DashMap and fall through to establish a new connection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The DashMap can hold a connection whose handle index differs from the endpoint's current connection to the same address. This happens when the coordinator reconnects — the endpoint gets a new connection but the DashMap keeps the old one. Frames encoded on the old connection use stale connection IDs that the coordinator drops silently. Compare the DashMap connection's handle_index() against the endpoint's connection_stable_id_for_addr(). If they differ, log the mismatch (STALE) and remove the DashMap entry, falling through to establish a new connection. Logging distinguishes three cases: - STALE: DashMap and endpoint both have connections, but different ones - ORPHAN: DashMap has a connection but the endpoint doesn't - verified: handles match, connection is the same one being driven Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously, round 2 switched to coordinator_candidates[1] — a random fallback that likely has no connection to the target. The preferred coordinator (index 0) was chosen because the DHT referrer has a known connection to the target. If round 1 timed out, the relay was sent but the return connection didn't arrive in time. Retrying with the same coordinator gives it another chance rather than wasting 5 seconds on a coordinator that can't relay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Deletes the unused duplicate connect_with_fallback on NatTraversalEndpoint plus its supporting "PATH_CHALLENGE hole-punch" chain (attempt_hole_punching, attempt_quic_hole_punching, get_candidate_pairs_for_addr, calculate_candidate_pair_priority, create_path_challenge_packet, store_successful_candidate_pair, get_successful_candidate_address) and the orphaned successful_candidates DashMap field. The deleted code was reachable only from the duplicate connect_with_fallback (also removed). The actual production hole-punch path lives in P2pEndpoint::connect_with_fallback_inner, which is what LinkTransport::dial_addr (link_transport_impl.rs) and the saorsa-transport example bin both call. The duplicate had no in-tree or downstream consumers — verified across this crate, saorsa-core, and ant-node. The deleted hole-punch implementation could not have worked even if it had been wired in: 1. attempt_quic_hole_punching bound a fresh std::net::UdpSocket to the local candidate address — always fails on a real node because Quinn already owns the port (UDP binds are exclusive). 2. The "QUIC packet" it sent was a hand-rolled byte sequence (0x40 [0,0,0,1] 0x1a <8 random>) that is not a valid encrypted QUIC packet, so any receiving Quinn endpoint silently dropped it during connection-id lookup. 3. The success branch then waited 100 ms blocking on recv_from for a "response" no compliant peer would ever send. The #[allow(dead_code)] markers on every function in the chain disguised that nothing in the path could ever succeed, actively misleading anyone debugging hole-punch issues via grep. Comment blocks left in place point readers at the production path. Also fixes a pre-existing nonminimal_bool clippy warning in the same file (`!target_peer_id.is_some()` -> `target_peer_id.is_none()`) so the file now passes `cargo clippy --lib -- -D warnings`. The fix is a single character but is called out here so it isn't hidden inside the bulk deletion. The `!` indicates removal of a `pub fn`. No in-tree or downstream consumer depended on it, but the public API surface is technically narrower. Verified: cargo test --lib 1459/1459 passed (3 ignored, unchanged) cargo clippy --lib -- -D warnings clean rustfmt --edition 2024 --check clean Downstream saorsa-core test suite still green Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds an optional UPnP IGD client that opportunistically asks the local router to forward the endpoint's UDP port and surfaces the resulting public address as a high-priority candidate alongside locally-discovered and peer-observed candidates. The integration is strictly additive: failure is silent (no router, UPnP disabled, refused mapping, untrusted external IP) and behaviour matches the pre-UPnP build exactly when the gateway is unavailable. The endpoint construction never blocks on the probe — discovery runs in a background task with a 2s deadline. Key design points: * Single-shot probe per session. A router that did not answer SSDP is left alone for the rest of the session — no aggressive re-probing. * Lease renewed at half the lease duration (default 1h, so refresh every 30 minutes). Crash-path safety: 1h lease bounds the worst-case leak. * Gateway-claimed external IPs are validated as plausibly public before use. RFC1918, CGNAT (100.64/10), loopback, link-local, documentation, broadcast and multicast all rejected so a misbehaving router cannot poison candidate discovery. * Endpoint owns the service exclusively; the discovery manager only holds a read-only UpnpStateRx (watch::Receiver). This lets the shutdown path move out the service and call DeletePortMapping directly with a 500ms budget. * PortMapped slots into priority calculation between Host (126) and ServerReflexive (100) at 110, and gets discovery priority 70_000 so it outranks the bound-address promotion (60_000). Drive-by: clippy caught a pre-existing `!is_some()` in the PUNCH_ME_NOW coordination path that was blocking warnings-as-errors after a clippy upgrade — replaced with `.is_none()`. Tests: 1468 lib tests pass with --features upnp. New unit tests cover config defaults, the disabled path, the IP plausibility classifier across IPv4/IPv6 special ranges, and the discovery integration with a pinned UpnpStateRx. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Removes `pub struct CandidatePair` and `pub enum CandidatePairState` from `nat_traversal_api.rs`. They were only ever used by the dead hole-punch chain deleted in 5280123, and `connection/nat_traversal.rs` defines its own unrelated `pub(super) CandidatePair` for the live coordinator path. With the chain gone, both types are orphans in the public API surface. Also tightens the two tombstone comments — drops the box-drawing borders, fixes the misleading "implemented elsewhere in this file" phrasing (the production flow spans `p2p_endpoint.rs` and the PUNCH_ME_NOW helpers later in this file), and cross-links them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pre-existing formatting drift caught by `cargo fmt --all -- --check` on this branch but unrelated to the hole-punch chain removal: - `src/endpoint.rs:387` — collapse a `.as_ref().map(...)` chain that rustfmt now wants on one line. - `src/p2p_endpoint.rs:1822` — split a multi-arg `info!` so each positional argument sits on its own line. No semantic change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The IPv4 plausibility check rejects RFC 5737 documentation space (192.0.2/24, 198.51.100/24, 203.0.113/24) so a misbehaving router cannot poison NAT traversal candidate discovery by returning a documentation address as its "external" IP. The IPv6 side of the classifier was missing the equivalent rejection of RFC 3849 `2001:db8::/32`, leaving a small but real asymmetry in the stated threat model. Split `is_plausibly_public` into symmetric v4/v6 helpers, add an `is_ipv6_documentation` check backed by named constants for the prefix bytes, and replace the hand-rolled `is_ipv6_link_local` with stdlib's `Ipv6Addr::is_unicast_link_local()` (stable since 1.76 — well under the project MSRV of 1.88). Fixes up the test that would have caught this: the positive assertion in `accepts_global_unicast_ipv6_and_rejects_link_local` was using `2001:db8::1` as its "global unicast" example, which accidentally hid the gap. Swap it for `2606:4700:4700::1111` (Cloudflare DNS), add `rejects_ipv6_documentation_range` with a neighbouring `/32` as a positive control, and add `rejects_ipv6_multicast_and_unspecified` for basic coverage completeness. Drive-bys from the deep review of PR #40: * `CandidateDiscoveryManager::set_upnp_state_rx` downgraded from `pub` to `pub(crate)` — it's an internal wiring hook used only by the endpoint constructor, not a public API. * Clarifying comment on `local_socket_for_mapping` explaining why the sync `std::net::UdpSocket::bind`/`connect` pair is safe to call inside the tokio task (pure kernel route lookup, no wire I/O, runs once per session). Tests: 1470/1470 lib tests pass with `--features upnp`, and the UPnP module now has 11 passing tests (up from 9) across both the real and stub backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ch-chain refactor!: remove dead NatTraversalEndpoint hole-punch chain

PR #41 removed the dead `NatTraversalEndpoint::attempt_hole_punching` chain (`get_candidate_pairs_for_addr`, `calculate_candidate_pair_priority`, `attempt_quic_hole_punching`, `create_path_challenge_packet`, `store_successful_candidate_pair`, `get_successful_candidate_address`) as unreachable code gated behind `#[allow(dead_code)]` that could not have worked in production. PR #40 had added `CandidateSource::PortMapped` handling inside two of those exact functions (`is_local_side` in `get_candidate_pairs_for_addr` and a type-preference slot in `calculate_candidate_pair_priority`). Those additions patched dead code and are safely dropped — the live production pairing path in `P2pEndpoint::connect_with_fallback_inner` drives the coordinator-mediated PUNCH_ME_NOW flow and consumes priorities from `crate::connection::nat_traversal::calculate_candidate_priority`, which works off `CandidateType`, not `CandidateSource`. The mapping `CandidateSource::PortMapped -> CandidateType::ServerReflexive` in `classify_candidate_type` at `src/connection/nat_traversal.rs:307` (introduced by PR #40 and untouched by PR #41) is what carries the PortMapped variant through the production pairing formula. No further plumbing is required. Extended PR #41's tombstone comment with a note explaining why PortMapped needs no replacement pairing logic at this site, so the next person grepping for `PortMapped` in `nat_traversal_api.rs` lands on the explanation directly. Validation: - cargo fmt --all -- --check clean - cargo clippy --features upnp --all-targets -- -D warnings clean - cargo clippy --no-default-features --features platform-verifier,network-discovery --all-targets -- -D warnings clean - cargo nextest run --features upnp --lib: 1470/1470 passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: best-effort UPnP IGD port mapping

`P2pEndpoint::send()` previously took a write lock on the `ConnectionRouter` just to call `select_engine_for_addr`, which only needed to bump a stats counter. At high node counts (1000-node testnet) this serialised every outbound send through a single exclusive lock and was a dominant contention point. Convert `RouterStats::{quic,constrained,fallback}_selections` to `AtomicU64` so the `select_engine*` family can take `&self`, then switch the send path to `router.read().await`. Other counters stay plain `u64` because they are only mutated on `&mut self` connect/accept/event-loop paths. Hand-implement `Clone` for `RouterStats` since `AtomicU64` isn't `Clone`, loading each counter with `Relaxed` ordering (stats are monotonic and don't synchronise other state). BREAKING: `RouterStats::{quic,constrained,fallback}_selections` are now `AtomicU64` instead of `u64`. External readers must use `.load(Ordering::Relaxed)`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nect_transport, bench cleanup Three follow-ups from the PR #42 review: 1. `connection_router.rs`: the fallback-adjustment path decremented `{quic,constrained}_selections` via a plain `load` + `store` — not the "CAS loop" the comment claimed. Under concurrent callers (which this PR is designed to enable) that pattern silently loses updates, including increments from other concurrent `select_engine_detailed` callers. Replace with `fetch_update(Relaxed, Relaxed, saturating_sub)`. 2. `p2p_endpoint.rs`: `connect_transport` held an exclusive write lock on the router just to call `select_engine_for_addr`, even though that method is now `&self`. Downgrade engine selection to a read lock; only the Constrained branch re-acquires a write lock for `router.connect(addr)`. The QUIC branch now holds no router lock. 3. `benches/connection_router.rs`: the previous PR updated test `let mut router`s but missed the bench, leaving 9 unused_mut warnings that broke `clippy --all-targets -D warnings` on CI. Line 187 (`bench_constrained_connect`) correctly retains `mut` for its `router.connect` call. Verified: fmt clean, clippy clean, connection_router (45) + constrained_integration (24) + p2p_endpoint (20) tests all pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Follow-up to PR #42 review feedback. The previous commit dropped the write-lock from the `send` hot path but kept `Arc<RwLock<ConnectionRouter>>` in `P2pEndpoint` and `pub AtomicU64` fields on `RouterStats`. This finishes the job: 1. `RouterStats` fields are now private. External consumers read via accessor methods (`quic_selections() -> u64`, etc.) or capture a plain-`u64` point-in-time view via the new `RouterStatsSnapshot` value type returned from `RouterStats::snapshot()`. The manual `Clone` impl is gone — `RouterStatsSnapshot` replaces it for the "take a copy" use-case. `P2pEndpoint::routing_stats()` now returns `RouterStatsSnapshot` and is no longer `async`. 2. Every counter in `RouterStats` is an `AtomicU64` (not just the selection counters). This lets every mutating method on `ConnectionRouter` run under `&self`. 3. `ConnectionRouter` is now fully interior-mutable: - `constrained_transport: Option<ConstrainedTransport>` → `OnceLock<ConstrainedTransport>` (lazy init under `&self`, once set never revoked). - `next_quic_id: u64` → `AtomicU64` (`fetch_add`). - `set_quic_endpoint` deleted along with its redundant caller in `P2pEndpoint::new` — `with_full_config` already installs it. - `connect`, `connect_async`, `connect_peer`, `connect_quic_async`, `connect_constrained`, `accept_quic`, `poll_events`, and `process_constrained_incoming` all take `&self`. 4. `P2pEndpoint::router` is now `Arc<ConnectionRouter>` instead of `Arc<RwLock<ConnectionRouter>>`. `send()` and `connect_transport()` no longer `.read().await`/`.write().await` on the router at all — the previously-remaining write lock in `connect_transport`'s constrained branch is gone. `P2pEndpoint::router()` now returns `&Arc<ConnectionRouter>` synchronously. 5. Added a compile-time `assert_send_sync::<ConnectionRouter>()` so any future change that accidentally adds a non-`Sync` field fails the build immediately instead of surfacing a confusing error at the `P2pEndpoint` clone site. 6. Documented the TOCTOU assumption in `connect_transport`: engine selection and engine-use are two separate `&self` calls, but the race is closed by construction (QUIC endpoint fixed at construction, constrained transport lazy-init via `OnceLock` and never torn down). Comment explains what a future change must do if that invariant is relaxed. BREAKING: - `RouterStats::{quic_connections,constrained_connections,*_bytes_*, connection_failures,quic_selections,constrained_selections, fallback_selections,events_processed}` are no longer `pub` fields. Use the same-named accessor methods (returning `u64`), or `RouterStats::snapshot()` for a plain-`u64` struct. - `ConnectionRouter::set_quic_endpoint` has been removed. - `P2pEndpoint::router()` is no longer `async` and returns `&Arc<ConnectionRouter>` instead of an `RwLockReadGuard`. - `P2pEndpoint::routing_stats()` is no longer `async` and returns `RouterStatsSnapshot` instead of `RouterStats`. - Every method on `ConnectionRouter` now takes `&self` — external callers holding a `&mut ConnectionRouter` should switch to `&self` (or `Arc<ConnectionRouter>`). Verified: fmt clean, clippy --all-targets -D warnings clean, connection_router (45) + constrained_integration (24) + p2p_endpoint (20) tests all pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-lock-send perf!: read-lock router on send via atomic selection stats

The per-connection rate limit of 5 coordination requests per 60-second window is too restrictive for bootstrap nodes acting as coordinators. On a network with 40+ NAT-restricted nodes, a client needs to send ~80 PUNCH_ME_NOW frames (2 rounds per target) during initial connection establishment. The low limit causes 50% of relay requests to be silently dropped, preventing hole punches from succeeding. Raise to 50 per minute to allow clients to hole-punch to all targets in a single burst while still providing protection against amplification attacks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The per-connection rate limit of 5 coordination requests per 60-second window is too restrictive for nodes acting as hole-punch coordinators. Any node can be selected as a coordinator — either a bootstrap node during initial discovery, or a regular network node chosen as a DHT referrer. On a network with many NAT-restricted nodes, a peer may need to send dozens of PUNCH_ME_NOW frames through a single coordinator during connection establishment. The low limit causes relay requests to be silently dropped, preventing hole punches from succeeding and forcing expensive MASQUE relay fallback. Observed 50% rejection rate on a 52-node testnet. Raise to 50 per minute to allow burst hole-punching while still providing protection against amplification attacks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eout Adds two coordinated changes that move coordinator load off the small set of bootstrap nodes that today serve as the de-facto hole-punch coordinator for every cold-starting peer. Tier 2 — list-form preferred coordinators with rotation: - hole_punch_preferred_coordinators changes from DashMap<SocketAddr, SocketAddr> to DashMap<SocketAddr, Vec<SocketAddr>>. Callers (saorsa-core's DHT layer) supply a ranked list best-first; the existing single-coordinator setter remains as a thin wrapper. - A new merge_preferred_coordinators helper inserts the ranked list at the front of coordinator_candidates, deduping any pre-existing copies. Pure function, unit-tested. - The hole-punch loop in connect_with_fallback_inner rotates through the first N candidates (where N = preferred list length). All but the final attempt use a new short timeout (PER_COORDINATOR_QUICK_HOLEPUNCH_TIMEOUT = 1.5s) so a busy or unreachable coordinator is abandoned quickly; the final attempt uses the strategy's full hole-punch timeout to give it time to complete the punch. When no preferred list is set, the legacy single-coordinator retry behaviour is preserved. - Bounds-safe rotation via .get() instead of indexing. Tier 4 (lite) — coordinator-side back-pressure with silent refusal: - New NatTraversalConfig fields coordinator_max_active_relays (default 32) and coordinator_relay_slot_timeout (default 5s), validated in 1..=256 and 100ms..=60s respectively. - Plumbed through TransportConfig via setters mirroring allow_loopback, into NatTraversalState::new and BootstrapCoordinator::new. - BootstrapCoordinator gains a relay_slots HashMap keyed by (initiator_peer_id, target_peer_id) recording arrival Instant, plus a backpressure_refusals stat. - process_punch_me_now_frame inline-sweeps stale slots before the back-pressure check (so ghost slots from crashed peers cannot leak the counter), then either accepts and inserts/refreshes the slot or silently returns Ok(None) and increments the refusal stat. Re-arming the same (initiator, target) pair refreshes the slot timestamp without consuming additional capacity. - Only the relay branch (frames carrying target_peer_id) consumes a slot — non-relay echo frames are unchanged. The two tiers compose: a coordinator at capacity silently drops the relay; the initiator's per-attempt timeout (Tier 2) drives it to its next preferred coordinator. No new wire frame is needed, so this is backwards-compatible with older peers. Tests added: - 5 unit tests for merge_preferred_coordinators (front insertion, ordering preservation, dedup, empty handling) - 5 unit tests for BootstrapCoordinator back-pressure (under-cap accept, at-cap silent refuse, slot re-arm without capacity consumption, sweep reclaims stale slots, non-relay frames don't consume slots) Test files (security_regression_tests.rs, relay_queue_tests.rs) updated to include the two new fields in their NatTraversalConfig struct literals. This is the second of three PRs landing smarter hole-punch coordinator selection. The first (saorsa-core) added round-aware referrer ranking in the DHT layer; the third will wire saorsa-core to call the new list-form set_hole_punch_preferred_coordinators API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lotTable Addresses PR #43 deep-review findings H1, H2, H3, M1, M2, L1–L6. ## H1 — node-wide scope (breaking) Back-pressure state moves out of BootstrapCoordinator into a new shared `RelaySlotTable` (`src/relay_slot_table.rs`) that every QUIC connection spawned by a NatTraversalEndpoint references via Arc. The 32-slot cap is now enforced *across all connections at the node*, not per-connection — which is what the PR description originally promised but the code didn't deliver. One table is instantiated in `create_inner_endpoint` and injected into both server-side and client-side TransportConfig. The old `TransportConfig::{coordinator_max_active_relays, coordinator_relay_slot_timeout}` fields are gone; they are replaced by `TransportConfig::relay_slot_table: Option<Arc<RelaySlotTable>>`. BREAKING: `TransportConfig::coordinator_max_active_relays` and `coordinator_relay_slot_timeout` setters removed. Callers using `NatTraversalConfig` should switch to `coordinator_max_active_relays` (unchanged) and the renamed `coordinator_relay_slot_idle_timeout`. ## H2 — explicit release on connection close `impl Drop for BootstrapCoordinator` calls `RelaySlotTable::release_for_initiator(addr)`, reclaiming every slot owned by the closed connection immediately. The idle timeout is now an honest safety net for crashes/NAT rebinds, not the only release path. The coordinator cannot directly observe hole-punch outcomes (traffic flows initiator↔target, bypassing the coordinator), so the three release mechanisms are: connection close (fast), re-arm refresh, and idle sweep. All three are documented on the config field. ## H3 — drop dead `from_peer` from slot key Slot key is now `(SocketAddr, [u8; 32])` — `(initiator_addr, target_peer_id)`. The old `from_peer = derive_peer_id_from_connection()` was a connection-id hash that was constant per BootstrapCoordinator instance, so keying on it added zero discrimination in production. The socket address is stable across rounds within a session and distinct across initiators, which is exactly what dedup needs. ## M1 — rotation / max_holepunch_rounds interaction documented New paragraph on `set_hole_punch_preferred_coordinators` explains how the rotation index and strategy round counter advance together, and the edge case where a caller raising `max_holepunch_rounds` above K gives the final coordinator extra retries. ## M2 — fields plumbed through `P2pConfig::nat` `NatConfig` gains `coordinator_max_active_relays` and `coordinator_relay_slot_idle_timeout`; `to_nat_config()` now reads from `self.nat.*` instead of hardcoding defaults, so downstream P2pConfig callers can tune the cap. ## Minor fixes - L1: misleading "Coordination completed" trace on the refusal path is replaced with an accurate "refused by node-wide back-pressure" trace. - L2: RelaySlotTable warns at the first refusal and every 16 thereafter so operators see a log line at the start of a storm. - L3: debug_assert! in the connect loop proves `coordinator_candidates[idx] == strategy_coordinator` while rotating. - L4: sweep is amortized — `sweep_if_due` runs the `retain` only if the previous sweep was at least 100ms ago, bounding per-frame cost. - L5: `merge_preferred_coordinators` builds the merged list with one allocation instead of `Vec::insert(0, ..)` in a loop (O(N+M) not O(N·M)). - L6: test helpers use `VarInt::from_u32` directly; dead `unwrap_or` fallback gone. ## Tests - 6 new `RelaySlotTable` unit tests (under-cap, at-cap refuse, re-arm, idle sweep, release_for_initiator, refusal counter). - 4 new `BootstrapCoordinator`-integration tests verifying the shared table is consulted for the relay branch, the non-relay branch doesn't consume a slot, capacity refusal is silent, and — new for H2 — that dropping the coordinator releases exactly the slots it owned while leaving other initiators' slots alone. - Old 5 per-coordinator back-pressure tests are gone (the data structure they exercised no longer exists). - Existing relay_queue_tests.rs and security_regression_tests.rs struct literals updated for the renamed field. 1485 lib tests + 24 integration tests pass; cargo fmt clean; cargo clippy --all-targets -- -D warnings -D clippy::unwrap_used -D clippy::expect_used clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…otation feat(nat): rotate hole-punch coordinators with capped per-attempt timeout

On larger networks with many NAT-restricted nodes, 50 per minute is still too low during the initial bootstrap burst when all nodes are hole-punching to each other simultaneously. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ions 500ms was too tight for hole-punched and cross-region connections where the path includes 3+ RTTs (open_uni + data + peer ACK). Cross-continent RTTs of 200-300ms meant the identity announce frequently failed silently (fire-and-forget, never retried), leaving both peers stuck in a 15s identity exchange timeout. 5 seconds is generous enough for any real connection path while still detecting dead connections quickly (well under the 15s identity timeout). Fast profile raised proportionally (250ms -> 2.5s).

…ing it The accept loop was closing the NEWER incoming connection when a live connection to the same address existed, keeping the older one. This is backwards: the newer connection is the one the remote peer just completed a handshake on and is actively using. Closing it kills their identity exchange, causing 15s timeouts. This was the root cause of the regression from PR #43 (coordinator rotation): when the rotation opened a new connection to a coordinator that already had an old connection, the acceptor closed the new one, leaving neither side with a working connection. Fix: replace old connection with new one (consistent with add_connection which also always overwrites with the newer connection). Also reset the emitted set so the replacement gets a reader task and PeerConnected event. Defense in depth: try_hole_punch now checks the DashMap in addition to connected_peers before opening a new coordinator connection, preventing unnecessary duplicates from being created in the first place.

…ed direct reachability Port of David Irvine reachability fix (a5b9db46 from ant-quic). Key changes: - New src/reachability.rs module with ReachabilityScope, TraversalMethod, socket_addr_scope() classifier, and DIRECT_REACHABILITY_TTL (15min) - NodeStatus: has_public_ip -> has_global_address (address property, not reachability proof), added direct_reachability_scope field - can_receive_direct now requires peer-verified evidence (active direct incoming connections or fresh scope-aware timestamps) - can_help_traversal() simplified to just can_receive_direct - TraversalMethod moved from node_event to reachability module - PeerConnection gains traversal_method and side fields - P2pEvent::PeerConnected gains traversal_method field - EndpointStats tracks active_direct_incoming_connections and last_direct_{loopback,local,global}_at timestamps - detect_nat_type simplified to soft debug hint only - Removed get_nat_stats() / nat_stats() placeholder methods - link_transport Capabilities gains direct_reachability_scope

…stdoc - Remove record_direct_incoming_stats() which double-counted active_direct_incoming_connections (event_callback already handles it via spawn_connection_handler -> ConnectionEstablished) - Only set supports_relay/supports_coordination when side.is_client() (we connected to them, proving they accept inbound). A peer that connected to us only proves they can make outbound connections. - Fix all pre-existing broken rustdoc links (LinkTransport::dial -> dial_addr, UpnpMappingService/RelaySlotTable scope issues, escaped brackets in relay_server.rs) - Add missing PortPrediction test coverage

… first

The cleanup_expired_sessions() method existed but was never called periodically. Add spawn_cleanup_task() which uses a Weak<Self> reference and tokio::time::interval to reap timed-out sessions at the configured cleanup_interval (default 60s). The task stops automatically when the server Arc is dropped. Wire up the cleanup task at both relay server creation sites in nat_traversal_api.rs so it starts as soon as the node boots.

…load balancing

…flight DHT queries The accept loop was closing the old connection immediately with "superseded" when replacing it with a newer connection to the same address. This caused the remote peer to tear down all state on that connection, including in-flight DHT FindNode queries and quote requests. The lost responses forced the DHT to re-walk, triggering more connections, more supersedes, and a cascading slowdown. On a 60-node testnet with 80% port-restricted NAT, this caused: - 264 superseded connections per upload - Quote collection taking 8+ minutes (vs expected ~1.5 minutes) - 212 hole-punch timeout events Fix: instead of closing the old connection synchronously, spawn a task that waits 5 seconds before closing it. This gives in-flight operations time to complete on the old connection while new operations use the replacement. The DashMap is updated immediately so sends go on the new connection. Tested on 60-node testnet (48 NAT / 12 standard): - 0 supersede-related disruptions - Upload times back to ~1m 45s steady state - 8 consecutive 50MB uploads with 0 failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lookups saorsa-core stores hole_punch_target_peer_ids and hole_punch_preferred_coordinators under IPv4 keys (e.g. 63.177.242.27:10008), but connect_with_fallback receives IPv6-mapped addresses ([::ffff:63.177.242.27]:10008) from dual-stack sockets. The DashMap lookup misses, causing: 1. target_peer_id falls back to wire_id_from_addr (address-based ID) 2. The coordinator can't match the address-based ID against its peer connection table (which stores real ML-DSA peer IDs) 3. PUNCH_ME_NOW relay fails with "No connection found" 4. Hole-punch times out after 2 rounds → MASQUE relay fallback At 60 nodes this was masked because bootstrap coordinators (public IP) had connections to most peers and the address fallback happened to work. At 990 nodes, NAT nodes are selected as coordinators more often and the address-based wire ID never matches. Fix: normalize all addresses to plain IPv4 when inserting into and looking up from both hole-punch DashMaps, using the existing normalize_socket_addr() helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Six fixes for issues discovered on a 990-node testnet with 20% NAT simulation: 1. False symmetric NAT detection: is_symmetric_nat() now only checks observed addresses from known peers (bootstrap nodes), preventing relay-allocated ports from triggering false positives. 2. Relay address pollution: poll_discovery filters observations whose port doesn't match the local listen port, preventing MASQUE relay ports from being published in the DHT as the node's own address. 3. Simultaneous-connect tie-breaking: accept loop uses deterministic peer ID comparison (lower ID keeps outbound) to resolve duplicate connections from hole-punch races, preventing in-flight request loss. 4. Connection cleanup race: do_cleanup_connection checks whether the connection at an address is still live before removing, preventing a dying reader task from nuking a replacement connection. 5. Timeout reductions: hole-punch per-round 8s→3s, relay 10s→5s, MASQUE relay session establishment hard-capped at 5s. 6. Coordinator accumulation: set_hole_punch_preferred_coordinator appends referrers instead of overwriting, giving the rotation loop multiple coordinator options. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a coordinator receives a PUNCH_ME_NOW relay request but cannot find the target peer among its connections, it now sends a NACK frame back to the requester instead of silently dropping the request. This allows the requester to immediately rotate to the next coordinator (sub-RTT) instead of waiting for the full 3-second hole-punch timeout. On a 990-node testnet with 20% NAT, this eliminates ~5 minutes of dead time during DHT bootstrap where the client was waiting for timeouts on coordinators that didn't have the target. Implementation: - New QUIC extension frame PUNCH_ME_NOW_NACK (type 0x3d7e99) - Coordinator sends NACK via pending_relay_events back to requester - NACK flows: endpoint → high-level endpoint → channel → DashSet + Notify - try_hole_punch poll loop wakes on NACK and returns error immediately - Backward compatible: old nodes ignore unknown frame type per RFC 9000 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: NACK + NAT traversal fixes for large-scale testnets

The per-stream ACK timeout after finish() used a size-proportional budget of 1 ms per KB on top of the base send_ack_timeout. For large payloads (e.g. 4 MB chunks) on bandwidth-constrained residential connections, this budget was insufficient — a 2.8 MB chunk only got ~2.8 s of budget, totalling ~7.8 s with the 5 s base, while the actual transfer can take 10-20 s on a slow uplink. Increase the coefficient to ~4 ms/KB (~250 KB/s floor). A 2.8 MB chunk now gets ~11 s of budget, totalling ~16 s with the base timeout. This eliminates chunk PUT timeouts on slow connections while having no effect on fast connections (where the ACK arrives well before the timeout). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: increase ACK timeout size budget from 1ms/KB to 4ms/KB

The bounded mpsc channel that carries completed connections from the accept loop (nat_traversal_api) to the reader task spawn path (p2p_endpoint) had a capacity of 32. When the consumer stalled briefly under write-lock contention in saorsa-core's accept loop, the channel filled and all subsequent connection handoffs blocked indefinitely — the send result was silently discarded with `let _ =`. After 15+ hours in a 1000-node testnet, this caused 1,079 connections to be accepted at the QUIC level but never forwarded to spawn reader tasks, stalling identity exchange and degrading upload times from ~175s to 358s+. Increase to 1024 to provide sufficient buffer for transient stalls. The primary fix is in saorsa-core (spawning registration work off the accept loop), but the larger buffer provides defense in depth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: increase handshake channel capacity from 32 to 1024

…-capacity" This reverts commit 7986b01, reversing changes made to f2b30ad.

…city revert: undo PR #59 handshake channel capacity increase

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-04-14T21:45:42Z

+        assert_eq!(config.holepunch_timeout, Duration::from_secs(8));
+        assert_eq!(config.relay_timeout, Duration::from_secs(10));


Test-vs-implementation mismatch: CI will fail

The Default impl (line 183–184) sets holepunch_timeout = 3s and relay_timeout = 5s, but this test asserts 8s and 10s respectively. These values have never matched in this PR, so cargo test will fail.

Either the test needs to be updated to the values the Default impl actually produces, or the Default impl needs to be corrected to the intended values:

Suggested change

assert_eq!(config.holepunch_timeout, Duration::from_secs(8));

assert_eq!(config.relay_timeout, Duration::from_secs(10));

assert_eq!(config.holepunch_timeout, Duration::from_secs(3));

assert_eq!(config.relay_timeout, Duration::from_secs(5));

Prompt To Fix With AI

This is a comment left during a code review. Path: src/connection_strategy.rs Line: 489-490 Comment: **Test-vs-implementation mismatch: CI will fail** The `Default` impl (line 183–184) sets `holepunch_timeout = 3s` and `relay_timeout = 5s`, but this test asserts `8s` and `10s` respectively. These values have never matched in this PR, so `cargo test` will fail. Either the test needs to be updated to the values the `Default` impl actually produces, or the `Default` impl needs to be corrected to the intended values: ```suggestion assert_eq!(config.holepunch_timeout, Duration::from_secs(3)); assert_eq!(config.relay_timeout, Duration::from_secs(5)); ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-14T21:45:43Z

+const POST_HOLEPUNCH_DIRECT_RETRY_TIMEOUT: Duration = Duration::from_secs(1);
+
+/// Per-attempt hole-punch timeout used when rotating through a list of
+/// preferred coordinators. Kept short so a busy or unreachable coordinator
+/// is abandoned quickly and the next one in the list is tried; the *last*
+/// coordinator in the rotation falls back to the strategy's full
+/// hole-punch timeout to give it time to actually complete the punch.
+///
+/// Tuned for the Tier 2 + Tier 4 (lite) coordinator-rotation flow: 1.5s
+/// is comfortably above one round-trip on most internet links but well
+/// below the strategy default (~8s), so the worst-case wait for K
+/// preferred coordinators is roughly `(K-1) * 1.5s + 8s` instead of


Docstring references a stale 8s timeout

The comment says "the worst-case wait for K preferred coordinators is roughly (K-1) * 1.5s + 8s", but the holepunch_timeout default in StrategyConfig is now 3s (set in this PR, connection_strategy.rs line 183). The 8s figure is stale — either the default was intended to stay at 8s (in which case Default needs fixing) or the docstring should be updated to reflect 3s.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/p2p_endpoint.rs Line: 93-104 Comment: **Docstring references a stale `8s` timeout** The comment says "the worst-case wait for K preferred coordinators is roughly `(K-1) * 1.5s + 8s`", but the `holepunch_timeout` default in `StrategyConfig` is now `3s` (set in this PR, `connection_strategy.rs` line 183). The `8s` figure is stale — either the default was intended to stay at 8s (in which case `Default` needs fixing) or the docstring should be updated to reflect `3s`. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-14T21:45:44Z

            rate_window: Duration::from_secs(60),
            coordination_requests: VecDeque::new(),
-            max_coordination_per_window: 5, // Max 5 coordination requests per 60 seconds
+            max_coordination_per_window: 300, // Max 300 coordination requests per 60 seconds


Per-connection coordination rate limit raised 60×

max_coordination_per_window went from 5 → 300 PUNCH_ME_NOW requests per 60-second window per connection. While the new RelaySlotTable provides the actual node-wide back-pressure at 32 concurrent sessions (which is the right control), this effectively neuters the per-connection DoS guard: a single misbehaving peer can now issue 300 relay-probe bursts per minute before the per-connection rate limiter fires. Consider whether the Tier-4 cap alone is sufficient or whether a more moderate per-connection limit (e.g. 60–120/min) is preferable to retain defence-in-depth.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/connection/nat_traversal.rs Line: 484 Comment: **Per-connection coordination rate limit raised 60×** `max_coordination_per_window` went from 5 → 300 PUNCH_ME_NOW requests per 60-second window *per connection*. While the new `RelaySlotTable` provides the actual node-wide back-pressure at 32 concurrent sessions (which is the right control), this effectively neuters the per-connection DoS guard: a single misbehaving peer can now issue 300 relay-probe bursts per minute before the per-connection rate limiter fires. Consider whether the Tier-4 cap alone is sufficient or whether a more moderate per-connection limit (e.g. 60–120/min) is preferable to retain defence-in-depth. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-14T21:45:45Z

+                    // Check if the NACK is for our target
+                    if let Some(ref pid) = target_peer_id {
+                        if self.inner.consume_nack(pid) {
+                            info!(
+                                "try_hole_punch: NACK received from coordinator for target {} — aborting immediately",
+                                target
+                            );
+                            return Err(EndpointError::Connection(
+                                format!("Coordinator NACK: target {} not found", target),
+                            ));
+                        }
+                    }
+                }


Stale NACKs may prematurely abort the next coordinator rotation

When try_hole_punch is cancelled (because the outer timeout(attempt_timeout, …) fires), any NACK that arrived just before the cancellation stays unconsumed in self.inner.nack_set. On the next call to try_hole_punch (with a different coordinator), nack_notify may immediately fire and consume_nack will return true — returning Err(...) even though the new coordinator hasn't sent a NACK. Consider clearing the NACK entry for the target when rotating coordinators, or scoping NACKs by (coordinator_addr, target_peer_id) so a stale NACK from coordinator A doesn't abort coordinator B.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/p2p_endpoint.rs Line: 2263-2275 Comment: **Stale NACKs may prematurely abort the next coordinator rotation** When `try_hole_punch` is cancelled (because the outer `timeout(attempt_timeout, …)` fires), any NACK that arrived just before the cancellation stays unconsumed in `self.inner.nack_set`. On the *next* call to `try_hole_punch` (with a different coordinator), `nack_notify` may immediately fire and `consume_nack` will return `true` — returning `Err(...)` even though the *new* coordinator hasn't sent a NACK. Consider clearing the NACK entry for the target when rotating coordinators, or scoping NACKs by `(coordinator_addr, target_peer_id)` so a stale NACK from coordinator A doesn't abort coordinator B. How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-04-14T21:55:55Z

Benchmark Results

Performance Comparison

Benchmark	Baseline	Current	Change	Status

Summary

Configuration

Regression threshold: >10% slower
Improvement threshold: >10% faster
Measurements: Mean execution time

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mickvandijke and others added 30 commits April 4, 2026 17:57

Merge pull request #38 from saorsa-labs/perf/tighten-network-timeouts

24bec69

perf: tighten network timeouts for faster connections

Merge pull request #41 from saorsa-labs/refactor/remove-dead-hole-pun…

9ba3ecc

…ch-chain refactor!: remove dead NatTraversalEndpoint hole-punch chain

Merge pull request #40 from saorsa-labs/feat/upnp-port-mapping

3d73241

feat: best-effort UPnP IGD port mapping

Merge pull request #42 from saorsa-labs/perf/router-stats-atomic-read…

6bfd936

…-lock-send perf!: read-lock router on send via atomic selection stats

mickvandijke and others added 22 commits April 7, 2026 01:15

Merge pull request #43 from saorsa-labs/feat/hole-punch-coordinator-r…

2441896

…otation feat(nat): rotate hole-punch coordinators with capped per-attempt timeout

fix(relay): rotate through all relay candidates instead of failing on…

6f139c7

… first

fix(relay): return socket on session reuse instead of None

e44b25e

fix(nat): quality-aware coordinator selection with RTT weighting and …

44b4662

…load balancing

Merge pull request #56 from jacderida/round4-combined

4587a33

feat: NACK + NAT traversal fixes for large-scale testnets

Merge pull request #58 from saorsa-labs/feat-upload-timeouts

f2b30ad

feat: increase ACK timeout size budget from 1ms/KB to 4ms/KB

Merge pull request #59 from saorsa-labs/fix/handshake-channel-capacity

7986b01

fix: increase handshake channel capacity from 32 to 1024

Revert "Merge pull request #59 from saorsa-labs/fix/handshake-channel…

66a1c6f

…-capacity" This reverts commit 7986b01, reversing changes made to f2b30ad.

Merge pull request #60 from saorsa-labs/revert/handshake-channel-capa…

891efbc

…city revert: undo PR #59 handshake channel capacity increase

chore: bump version to 0.32.0

49a13c4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed Apr 14, 2026

View reviewed changes

fix: apply rustfmt formatting and fix test_default_config assertion

a6cc9fb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jacderida force-pushed the rc-2026.4.1 branch from 19fbdd1 to a6cc9fb Compare April 14, 2026 21:58

jacderida merged commit c34424e into main Apr 14, 2026
47 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v0.32.0#61

release: v0.32.0#61
jacderida merged 53 commits into
mainfrom
rc-2026.4.1

jacderida commented Apr 14, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Apr 14, 2026

Uh oh!

greptile-apps Bot Apr 14, 2026

Uh oh!

greptile-apps Bot Apr 14, 2026

Uh oh!

greptile-apps Bot Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		assert_eq!(config.holepunch_timeout, Duration::from_secs(8));
		assert_eq!(config.relay_timeout, Duration::from_secs(10));

Conversation

jacderida commented Apr 14, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Test plan

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 14, 2026

Benchmark Results

Performance Comparison

Summary

Configuration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacderida commented Apr 14, 2026 •

edited by greptile-apps Bot

Loading