Skip to content

perf: tighten network timeouts for faster connections#38

Merged
jacderida merged 5 commits into
rc-2026.4.1from
perf/tighten-network-timeouts
Apr 4, 2026
Merged

perf: tighten network timeouts for faster connections#38
jacderida merged 5 commits into
rc-2026.4.1from
perf/tighten-network-timeouts

Conversation

@mickvandijke
Copy link
Copy Markdown
Collaborator

@mickvandijke mickvandijke commented Apr 4, 2026

Summary

  • Reduce connection strategy timeouts based on 300ms worst-case RTT analysis: direct 5s→2s, hole-punch 15s→3s, relay 30s→10s, max rounds 3→2
  • Remove redundant hardcoded 15s deadline in try_hole_punch; the outer strategy.holepunch_timeout() is now the single source of truth
  • Reduce post-hole-punch direct retry from 3s→1s and default send ACK timeout from 1s→500ms
  • Worst-case connection path drops from 89s to ~21s

Timeout changes

Timeout Old New Rationale
ipv4_timeout 5s 2s 1-2 RTTs + margin
ipv6_timeout 5s 2s 1-2 RTTs + margin
holepunch_timeout 15s 3s 3 RTTs + margin, higher-level retries handle loss
relay_timeout 30s 10s 4 RTTs through relay + setup
max_holepunch_rounds 3 2 If 2 coordinators fail, NAT is incompatible
POST_HOLEPUNCH_DIRECT_RETRY_TIMEOUT 3s 1s Warm NAT pinhole, just 1-2 RTTs
DEFAULT_SEND_ACK_TIMEOUT 1s 500ms 1 RTT + processing

Test plan

  • cargo test --lib connection_strategy — 16 passed
  • cargo test --lib happy_eyeballs — 15 passed
  • cargo fmt --all -- --check — clean
  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • E2E testnet validation with mixed NAT nodes across regions

🤖 Generated with Claude Code

Greptile Summary

This PR tightens all NAT traversal timeouts (direct: 5s→2s, hole-punch: 15s→3s, relay: 30s→10s, max rounds: 3→2) and removes the redundant hardcoded 15s deadline inside try_hole_punch, making strategy.holepunch_timeout() the single source of truth. The worst-case connection path drops from ~89s to ~21s.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/cleanup and do not affect correctness.

The timeout reductions are well-reasoned with a clear RTT-based rationale, unit tests pass, and the removal of the duplicate internal deadline in try_hole_punch is a clean simplification. The only issue is FAST_SEND_ACK_TIMEOUT being a now-identical copy of DEFAULT_SEND_ACK_TIMEOUT, which is a minor dead-code concern with no runtime impact.

src/config/nat_timeouts.rsFAST_SEND_ACK_TIMEOUT should be differentiated from DEFAULT_SEND_ACK_TIMEOUT or removed.

Important Files Changed

Filename Overview
src/config/nat_timeouts.rs Lowered DEFAULT_SEND_ACK_TIMEOUT 1s→500ms; FAST_SEND_ACK_TIMEOUT is now an identical dead duplicate at 500ms.
src/connection_strategy.rs Default timeouts tightened (IPv4/6: 5s→2s, holepunch: 15s→3s, relay: 30s→10s, max_rounds: 3→2); unit tests updated and passing.
src/p2p_endpoint.rs Removed redundant 15s internal deadline from try_hole_punch; POST_HOLEPUNCH_DIRECT_RETRY_TIMEOUT reduced 3s→1s; outer strategy.holepunch_timeout() is now the single source of truth.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant S as ConnectionStrategy
    participant E as P2pEndpoint
    participant Coord as Coordinator
    participant T as Target

    C->>S: new(StrategyConfig::default())
    note over S: DirectIPv4 stage

    C->>E: connect(target_ipv4)
    note over E: timeout = 2s (was 5s)
    E-->>C: Err (timeout)
    C->>S: transition_to_ipv6()

    C->>E: connect(target_ipv6)
    note over E: timeout = 2s (was 5s)
    E-->>C: Err (timeout)
    C->>S: transition_to_holepunch()

    loop max 2 rounds (was 3)
        C->>E: try_hole_punch(target, coordinator)
        note over E: outer timeout = 3s (was 15s)
        E->>Coord: initiate_nat_traversal / PUNCH_ME_NOW
        Coord->>T: relay PUNCH_ME_NOW
        T->>E: QUIC handshake (PQC ML-KEM-768)
        alt connection detected
            E-->>C: Ok(PeerConnection)
        else timeout or error
            C->>E: connect(target)
            note over E: POST_HOLEPUNCH_DIRECT_RETRY = 1s (was 3s)
            E-->>C: Ok or Err
            S->>S: increment_round / transition_to_relay
        end
    end

    C->>E: try_relay_connection(target, relay)
    note over E: relay_timeout = 10s (was 30s)
    E-->>C: Ok(PeerConnection) or Err

    note over C,T: Worst-case: 2+2+3+1+3+1+10 ≈ 22s (was ~89s)
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/config/nat_timeouts.rs
Line: 134-138

Comment:
**Duplicate constants — `FAST_SEND_ACK_TIMEOUT` is now a dead copy of `DEFAULT_SEND_ACK_TIMEOUT`**

After lowering `DEFAULT_SEND_ACK_TIMEOUT` to 500 ms, both constants hold the same value, so `TimeoutConfig::fast()` and `TimeoutConfig::default()` produce identical `send_ack_timeout`s. The "fast" preset no longer offers any improvement for this field. Either differentiate the two (e.g. lower `FAST_SEND_ACK_TIMEOUT` to 250 ms to match the `NatTraversalTimeouts::fast()` profile's halved intervals), or remove `FAST_SEND_ACK_TIMEOUT` and reuse `DEFAULT_SEND_ACK_TIMEOUT` directly in `TimeoutConfig::fast()`.

```suggestion
/// Default time to wait for the peer to acknowledge stream data after a send.
const DEFAULT_SEND_ACK_TIMEOUT: Duration = Duration::from_millis(500);

/// Fast-network send ACK timeout (halved relative to default).
const FAST_SEND_ACK_TIMEOUT: Duration = Duration::from_millis(250);
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "perf: tighten network timeouts for faste..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

mickvandijke and others added 4 commits April 4, 2026 17:14
…ck-methods

refactor: remove dead connect_dual_stack and try_connect_family methods
Tests were using ConnectionHealth enum and connection_health() method
that were removed from PeerConnection/Node. Rewrote health tests to use
is_connected() and removed stale health ping/pong fields from struct
literals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-health-refs

fix: remove references to deleted ConnectionHealth API in tests
Reduce connection strategy timeouts based on RTT analysis (300ms worst-case
cross-globe RTT). The previous values were overly conservative, causing up
to 89s worst-case connection time to unreachable peers.

Changes:
- Direct connection: 5s → 2s (1-2 RTTs + margin)
- Hole-punch round: 15s → 3s (3 RTTs + margin)
- Max hole-punch rounds: 3 → 2
- Relay establishment: 30s → 10s
- Post-hole-punch direct retry: 3s → 1s (warm NAT pinhole)
- Default send ACK timeout: 1s → 500ms
- Remove redundant hardcoded 15s deadline in try_hole_punch; outer
  strategy.holepunch_timeout() is now the single source of truth

Worst-case connection path drops from 89s to ~21s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/config/nat_timeouts.rs Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 4, 2026

Benchmark Results

Performance Comparison

Benchmark Baseline Current Change Status

Summary

Configuration

  • Regression threshold: >10% slower
  • Improvement threshold: >10% faster
  • Measurements: Mean execution time

After lowering DEFAULT_SEND_ACK_TIMEOUT to 500ms, both constants held
the same value, making TimeoutConfig::fast() identical to default for
send_ack_timeout. Lower the fast variant to 250ms to match the
halved-interval pattern used throughout the fast profile.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jacderida jacderida changed the base branch from main to rc-2026.4.1 April 4, 2026 16:40
@jacderida jacderida merged commit 24bec69 into rc-2026.4.1 Apr 4, 2026
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants