Skip to content

perf: tighten network timeouts for faster connections#68

Merged
jacderida merged 5 commits into
rc-2026.4.1from
perf/tighten-network-timeouts
Apr 4, 2026
Merged

perf: tighten network timeouts for faster connections#68
jacderida merged 5 commits into
rc-2026.4.1from
perf/tighten-network-timeouts

Conversation

@mickvandijke
Copy link
Copy Markdown
Collaborator

@mickvandijke mickvandijke commented Apr 4, 2026

Summary

Timeout changes

Timeout Old New Rationale
DIAL_TIMEOUT 90s 25s Sum of stages: direct (2s) + 2×holepunch (3s+1s) + relay (10s) + margin
DEFAULT_CONNECTION_TIMEOUT_SECS 90s 25s Matches DIAL_TIMEOUT
SEND_TIMEOUT 15s 10s 3x margin over 4MB at 10Mbps
REQUEST_TIMEOUT 30s 10s Aligned with libp2p/BEP 5 practice
IDENTITY_EXCHANGE_TIMEOUT 10s 5s 1-2 RTTs + margin
BOOTSTRAP_IDENTITY_TIMEOUT_SECS 10s 5s Matches identity exchange

Test plan

  • cargo check — clean
  • cargo fmt --all -- --check — clean
  • cargo clippy --lib --all-features -- -D warnings — clean
  • E2E testnet validation with mixed NAT nodes across regions

🤖 Generated with Claude Code

Greptile Summary

This PR reduces six network timeouts across the transport adapter, DHT network manager, and node configuration — cutting dial/connection timeouts from 90s→25s, send timeout from 15s→10s, and request/identity timeouts from 30s→10s and 10s→5s. The rationale (sum-of-stages analysis: direct 2s + hole-punch 3+1s + relay 10s = ~20s, 25s with margin) is well-documented and the approach is sound for the majority of changes. However, SEND_TIMEOUT in send_to_peer_raw presents a concrete issue: it wraps both dial_addr and the data transfer in the same 10s budget, while the companion DIAL_TIMEOUT analysis documents that relay connections alone take ~10s, leaving no time for data writes in relay-only NAT scenarios.

  • DIAL_TIMEOUT 90s→25s: Well-justified from stage analysis; consistent with DEFAULT_CONNECTION_TIMEOUT_SECS.
  • SEND_TIMEOUT 15s→10s: ⚠️ The 10s budget covers both the dial_addr call (up to ~20s for relay) and the actual transfer — relay-only peers will almost certainly time out before any data is sent.
  • REQUEST_TIMEOUT 30s→10s: Reasonable tightening for DoS-protection on message handlers.
  • IDENTITY_EXCHANGE_TIMEOUT 10s→5s: Safe; the min(config.request_timeout, IDENTITY_EXCHANGE_TIMEOUT) guard still correctly applies the 5s cap.
  • DEFAULT_CONNECTION_TIMEOUT_SECS 90s→25s and BOOTSTRAP_IDENTITY_TIMEOUT_SECS 10s→5s: Well-reasoned, no concerns.
  • E2E testnet validation with mixed NAT nodes is still pending in the PR checklist."

Confidence Score: 3/5

Mostly safe to merge, but SEND_TIMEOUT in send_to_peer_raw is too tight for relay paths and will cause silent send failures to NAT'd peers requiring relay connections

Five of six timeout changes are well-reasoned and consistent with the documented stage-sum analysis. The SEND_TIMEOUT change introduces a concrete regression: by wrapping dial_addr (which can take ~20s in relay scenarios) inside a 10s budget, sends to relay-only peers will reliably time out. The E2E testnet validation across mixed NAT nodes is also still pending. These two factors lower confidence from what would otherwise be a 4-5.

src/transport/saorsa_transport_adapter.rs — specifically the SEND_TIMEOUT constant and send_to_peer_raw function; verify whether saorsa-transport's dial_addr reuses existing connections before applying the new timeout

Important Files Changed

Filename Overview
src/transport/saorsa_transport_adapter.rs DIAL_TIMEOUT 90s→25s is correct; SEND_TIMEOUT 15s→10s is too aggressive — the 10s budget encompasses the full dial_addr call which can itself take ~20s for relay paths, causing systematic timeouts for relay-only peers
src/dht_network_manager.rs REQUEST_TIMEOUT 30s→10s and IDENTITY_EXCHANGE_TIMEOUT 10s→5s are safe; min() guard logic is preserved; DEFAULT_REQUEST_TIMEOUT_SECS (30s) is unchanged and may warrant review for consistency
src/network.rs DEFAULT_CONNECTION_TIMEOUT_SECS 90s→25s and BOOTSTRAP_IDENTITY_TIMEOUT_SECS 10s→5s are well-justified, consistent with DIAL_TIMEOUT rationale, and correctly wired into NodeConfig defaults

Sequence Diagram

sequenceDiagram
    participant Caller
    participant send_to_peer_raw
    participant dial_addr
    participant Transport

    Caller->>send_to_peer_raw: send(addr, data)
    Note over send_to_peer_raw: SEND_TIMEOUT = 10s (entire block)
    send_to_peer_raw->>dial_addr: dial_addr(addr)
    dial_addr->>Transport: direct attempt (~2s)
    alt direct succeeds
        Transport-->>dial_addr: conn ✓
    else hole-punch
        dial_addr->>Transport: hole-punch rounds (~3s + 1s)
        alt hole-punch succeeds
            Transport-->>dial_addr: conn ✓
        else relay fallback
            dial_addr->>Transport: relay connection (~10s)
            Note over dial_addr,Transport: Relay alone ≈ 10s = SEND_TIMEOUT ⚠️
            Transport-->>dial_addr: conn (or timeout ❌)
        end
    end
    dial_addr-->>send_to_peer_raw: conn
    send_to_peer_raw->>Transport: open_uni_typed + write_all + finish
    Transport-->>send_to_peer_raw: Ok(())
    send_to_peer_raw-->>Caller: Ok(()) or Elapsed timeout error
Loading

Comments Outside Diff (2)

  1. src/transport/saorsa_transport_adapter.rs, line 520-558 (link)

    P1 SEND_TIMEOUT too tight to cover relay dial + data transfer

    send_to_peer_raw wraps both dial_addr and the subsequent stream write inside the same 10-second budget. According to the updated comment on DIAL_TIMEOUT in connect_to_peer (lines 440–443), the full NAT-traversal flow is:

    direct (2s) + 2 × hole-punch rounds (3s + 1s retry each) + relay (10s) ≈ 20s

    The relay fallback alone is documented as 10s, which exactly equals the new SEND_TIMEOUT. For a fresh relay connection to a NAT'd peer, dial_addr can consume the entire 10-second window before a single byte of data is written — meaning send_to_peer_raw will reliably time out in the worst-case relay scenario.

    connect_to_peer correctly budgets 25s for the same dial path via DIAL_TIMEOUT. send_to_peer_raw invokes the same transport but has only 10s total for dial + transfer. Consider splitting the timeout budget:

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/transport/saorsa_transport_adapter.rs
    Line: 520-558
    
    Comment:
    **`SEND_TIMEOUT` too tight to cover relay dial + data transfer**
    
    `send_to_peer_raw` wraps both `dial_addr` *and* the subsequent stream write inside the same 10-second budget. According to the updated comment on `DIAL_TIMEOUT` in `connect_to_peer` (lines 440–443), the full NAT-traversal flow is:
    
    > direct (2s) + 2 × hole-punch rounds (3s + 1s retry each) + relay (10s) ≈ 20s
    
    The relay fallback alone is documented as **10s**, which exactly equals the new `SEND_TIMEOUT`. For a fresh relay connection to a NAT'd peer, `dial_addr` can consume the entire 10-second window before a single byte of data is written — meaning `send_to_peer_raw` will reliably time out in the worst-case relay scenario.
    
    `connect_to_peer` correctly budgets 25s for the same dial path via `DIAL_TIMEOUT`. `send_to_peer_raw` invokes the same transport but has only 10s total for dial + transfer. Consider splitting the timeout budget:
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. src/dht_network_manager.rs, line 2539-2541 (link)

    P2 DEFAULT_REQUEST_TIMEOUT_SECS not updated alongside REQUEST_TIMEOUT

    The module-level REQUEST_TIMEOUT (DoS-protection timeout on DHT message handlers) was reduced from 30s → 10s. However, DEFAULT_REQUEST_TIMEOUT_SECS still defaults to 30s and feeds DhtNetworkConfig::default().request_timeout, which drives:

    • response_timeout in wait_for_dht_response (30s)
    • The upper bound of dial_timeout = transport.connection_timeout().min(config.request_timeout)

    These two constants serve different roles so diverging values may be intentional, but given the PR's stated goal of a 10s request budget it is worth confirming DEFAULT_REQUEST_TIMEOUT_SECS should remain at 30s.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/dht_network_manager.rs
    Line: 2539-2541
    
    Comment:
    **`DEFAULT_REQUEST_TIMEOUT_SECS` not updated alongside `REQUEST_TIMEOUT`**
    
    The module-level `REQUEST_TIMEOUT` (DoS-protection timeout on DHT message handlers) was reduced from 30s → 10s. However, `DEFAULT_REQUEST_TIMEOUT_SECS` still defaults to **30s** and feeds `DhtNetworkConfig::default().request_timeout`, which drives:
    - `response_timeout` in `wait_for_dht_response` (30s)
    - The upper bound of `dial_timeout = transport.connection_timeout().min(config.request_timeout)`
    
    These two constants serve different roles so diverging values may be intentional, but given the PR's stated goal of a 10s request budget it is worth confirming `DEFAULT_REQUEST_TIMEOUT_SECS` should remain at 30s.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/transport/saorsa_transport_adapter.rs
Line: 520-558

Comment:
**`SEND_TIMEOUT` too tight to cover relay dial + data transfer**

`send_to_peer_raw` wraps both `dial_addr` *and* the subsequent stream write inside the same 10-second budget. According to the updated comment on `DIAL_TIMEOUT` in `connect_to_peer` (lines 440–443), the full NAT-traversal flow is:

> direct (2s) + 2 × hole-punch rounds (3s + 1s retry each) + relay (10s) ≈ 20s

The relay fallback alone is documented as **10s**, which exactly equals the new `SEND_TIMEOUT`. For a fresh relay connection to a NAT'd peer, `dial_addr` can consume the entire 10-second window before a single byte of data is written — meaning `send_to_peer_raw` will reliably time out in the worst-case relay scenario.

`connect_to_peer` correctly budgets 25s for the same dial path via `DIAL_TIMEOUT`. `send_to_peer_raw` invokes the same transport but has only 10s total for dial + transfer. Consider splitting the timeout budget:

```suggestion
        const DIAL_PHASE_TIMEOUT: Duration = Duration::from_secs(25);
        const SEND_TIMEOUT: Duration = Duration::from_secs(30);

        tokio::time::timeout(SEND_TIMEOUT, async {
            let conn = tokio::time::timeout(
                DIAL_PHASE_TIMEOUT,
                self.transport.dial_addr(*addr, SAORSA_DHT_PROTOCOL),
            )
            .await
            .map_err(|_| anyhow::anyhow!("Dial timed out after {DIAL_PHASE_TIMEOUT:?} to {addr}"))?
            .map_err(|e| anyhow::anyhow!("Dial by address failed: {}", e))?;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/dht_network_manager.rs
Line: 2539-2541

Comment:
**`DEFAULT_REQUEST_TIMEOUT_SECS` not updated alongside `REQUEST_TIMEOUT`**

The module-level `REQUEST_TIMEOUT` (DoS-protection timeout on DHT message handlers) was reduced from 30s → 10s. However, `DEFAULT_REQUEST_TIMEOUT_SECS` still defaults to **30s** and feeds `DhtNetworkConfig::default().request_timeout`, which drives:
- `response_timeout` in `wait_for_dht_response` (30s)
- The upper bound of `dial_timeout = transport.connection_timeout().min(config.request_timeout)`

These two constants serve different roles so diverging values may be intentional, but given the PR's stated goal of a 10s request budget it is worth confirming `DEFAULT_REQUEST_TIMEOUT_SECS` should remain at 30s.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "perf: tighten network timeouts for faste..." | Re-trigger Greptile

Reduce timeouts based on RTT analysis (300ms worst-case cross-globe RTT).
Previous values caused excessive delays when peers were unreachable.

Changes:
- DIAL_TIMEOUT: 90s → 25s (derived from strategy stage sum + margin)
- DEFAULT_CONNECTION_TIMEOUT_SECS: 90s → 25s (matches DIAL_TIMEOUT)
- SEND_TIMEOUT: 15s → 10s (3x margin over 4MB at 10Mbps)
- REQUEST_TIMEOUT: 30s → 10s (aligned with libp2p/BEP 5 practice)
- IDENTITY_EXCHANGE_TIMEOUT: 10s → 5s (1-2 RTTs + margin)
- BOOTSTRAP_IDENTITY_TIMEOUT_SECS: 10s → 5s (matches identity exchange)

Companion to saorsa-transport timeout tightening.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 4, 2026 15:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces several network operation timeouts to shorten failure detection and improve overall connection/send responsiveness (especially when peers/addresses are stale).

Changes:

  • Reduced peer dial timeout in the transport adapter (90s → 25s).
  • Reduced raw send path timeout in the transport adapter (15s → 10s).
  • Reduced default node connection timeout and DHT handler / identity-exchange timeouts (e.g., 90s → 25s; 30s → 10s; 10s → 5s).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/transport/saorsa_transport_adapter.rs Tightens dial and send operation timeouts used by the adapter APIs.
src/network.rs Lowers the default node connection timeout and bootstrap identity exchange timeout.
src/dht_network_manager.rs Lowers DHT handler request timeout and identity exchange timeout caps.

Comment thread src/network.rs
mickvandijke and others added 3 commits April 4, 2026 18:21
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
send_to_peer_raw calls the same dial_addr path as connect_to_peer but
had only 10s — less than the relay fallback alone. Bump to 25s to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aligns with the tighter timeout profile while staying above the relay
stage (~10s) so dial_candidate's NAT traversal cascade is not truncated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke changed the base branch from main to rc-2026.4.1 April 4, 2026 16:35
Point saorsa-transport dependency at the saorsa-labs Git repo on the
rc-2026.4.1 release-candidate branch instead of crates.io.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jacderida jacderida merged commit dcfc79e into rc-2026.4.1 Apr 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants