perf: tighten network timeouts for faster connections by mickvandijke · Pull Request #68 · WithAutonomi/saorsa-core

mickvandijke · 2026-04-04T15:58:37Z

Summary

Reduce dial, send, DHT, and identity exchange timeouts based on 300ms worst-case RTT analysis
DIAL_TIMEOUT and DEFAULT_CONNECTION_TIMEOUT_SECS reduced from 90s→25s (derived from sum of connection strategy stages)
Companion PR: fix: remove references to deleted ConnectionHealth API in tests saorsa-transport#37 (transport-layer timeout changes)

Timeout changes

Timeout	Old	New	Rationale
`DIAL_TIMEOUT`	90s	25s	Sum of stages: direct (2s) + 2×holepunch (3s+1s) + relay (10s) + margin
`DEFAULT_CONNECTION_TIMEOUT_SECS`	90s	25s	Matches DIAL_TIMEOUT
`SEND_TIMEOUT`	15s	10s	3x margin over 4MB at 10Mbps
`REQUEST_TIMEOUT`	30s	10s	Aligned with libp2p/BEP 5 practice
`IDENTITY_EXCHANGE_TIMEOUT`	10s	5s	1-2 RTTs + margin
`BOOTSTRAP_IDENTITY_TIMEOUT_SECS`	10s	5s	Matches identity exchange

Test plan

cargo check — clean
cargo fmt --all -- --check — clean
cargo clippy --lib --all-features -- -D warnings — clean
E2E testnet validation with mixed NAT nodes across regions

🤖 Generated with Claude Code

Greptile Summary

This PR reduces six network timeouts across the transport adapter, DHT network manager, and node configuration — cutting dial/connection timeouts from 90s→25s, send timeout from 15s→10s, and request/identity timeouts from 30s→10s and 10s→5s. The rationale (sum-of-stages analysis: direct 2s + hole-punch 3+1s + relay 10s = ~20s, 25s with margin) is well-documented and the approach is sound for the majority of changes. However, SEND_TIMEOUT in send_to_peer_raw presents a concrete issue: it wraps both dial_addr and the data transfer in the same 10s budget, while the companion DIAL_TIMEOUT analysis documents that relay connections alone take ~10s, leaving no time for data writes in relay-only NAT scenarios.

DIAL_TIMEOUT 90s→25s: Well-justified from stage analysis; consistent with DEFAULT_CONNECTION_TIMEOUT_SECS.
SEND_TIMEOUT 15s→10s: ⚠️ The 10s budget covers both the dial_addr call (up to ~20s for relay) and the actual transfer — relay-only peers will almost certainly time out before any data is sent.
REQUEST_TIMEOUT 30s→10s: Reasonable tightening for DoS-protection on message handlers.
IDENTITY_EXCHANGE_TIMEOUT 10s→5s: Safe; the min(config.request_timeout, IDENTITY_EXCHANGE_TIMEOUT) guard still correctly applies the 5s cap.
DEFAULT_CONNECTION_TIMEOUT_SECS 90s→25s and BOOTSTRAP_IDENTITY_TIMEOUT_SECS 10s→5s: Well-reasoned, no concerns.
E2E testnet validation with mixed NAT nodes is still pending in the PR checklist."

Confidence Score: 3/5

Mostly safe to merge, but SEND_TIMEOUT in send_to_peer_raw is too tight for relay paths and will cause silent send failures to NAT'd peers requiring relay connections

Five of six timeout changes are well-reasoned and consistent with the documented stage-sum analysis. The SEND_TIMEOUT change introduces a concrete regression: by wrapping dial_addr (which can take ~20s in relay scenarios) inside a 10s budget, sends to relay-only peers will reliably time out. The E2E testnet validation across mixed NAT nodes is also still pending. These two factors lower confidence from what would otherwise be a 4-5.

src/transport/saorsa_transport_adapter.rs — specifically the SEND_TIMEOUT constant and send_to_peer_raw function; verify whether saorsa-transport's dial_addr reuses existing connections before applying the new timeout

Important Files Changed

Filename	Overview
src/transport/saorsa_transport_adapter.rs	DIAL_TIMEOUT 90s→25s is correct; SEND_TIMEOUT 15s→10s is too aggressive — the 10s budget encompasses the full dial_addr call which can itself take ~20s for relay paths, causing systematic timeouts for relay-only peers
src/dht_network_manager.rs	REQUEST_TIMEOUT 30s→10s and IDENTITY_EXCHANGE_TIMEOUT 10s→5s are safe; min() guard logic is preserved; DEFAULT_REQUEST_TIMEOUT_SECS (30s) is unchanged and may warrant review for consistency
src/network.rs	DEFAULT_CONNECTION_TIMEOUT_SECS 90s→25s and BOOTSTRAP_IDENTITY_TIMEOUT_SECS 10s→5s are well-justified, consistent with DIAL_TIMEOUT rationale, and correctly wired into NodeConfig defaults

Sequence Diagram

sequenceDiagram
    participant Caller
    participant send_to_peer_raw
    participant dial_addr
    participant Transport

    Caller->>send_to_peer_raw: send(addr, data)
    Note over send_to_peer_raw: SEND_TIMEOUT = 10s (entire block)
    send_to_peer_raw->>dial_addr: dial_addr(addr)
    dial_addr->>Transport: direct attempt (~2s)
    alt direct succeeds
        Transport-->>dial_addr: conn ✓
    else hole-punch
        dial_addr->>Transport: hole-punch rounds (~3s + 1s)
        alt hole-punch succeeds
            Transport-->>dial_addr: conn ✓
        else relay fallback
            dial_addr->>Transport: relay connection (~10s)
            Note over dial_addr,Transport: Relay alone ≈ 10s = SEND_TIMEOUT ⚠️
            Transport-->>dial_addr: conn (or timeout ❌)
        end
    end
    dial_addr-->>send_to_peer_raw: conn
    send_to_peer_raw->>Transport: open_uni_typed + write_all + finish
    Transport-->>send_to_peer_raw: Ok(())
    send_to_peer_raw-->>Caller: Ok(()) or Elapsed timeout error

Comments Outside Diff (2)

src/transport/saorsa_transport_adapter.rs, line 520-558 (link)

SEND_TIMEOUT too tight to cover relay dial + data transfer

send_to_peer_raw wraps both dial_addr and the subsequent stream write inside the same 10-second budget. According to the updated comment on DIAL_TIMEOUT in connect_to_peer (lines 440–443), the full NAT-traversal flow is:

direct (2s) + 2 × hole-punch rounds (3s + 1s retry each) + relay (10s) ≈ 20s

The relay fallback alone is documented as 10s, which exactly equals the new SEND_TIMEOUT. For a fresh relay connection to a NAT'd peer, dial_addr can consume the entire 10-second window before a single byte of data is written — meaning send_to_peer_raw will reliably time out in the worst-case relay scenario.

connect_to_peer correctly budgets 25s for the same dial path via DIAL_TIMEOUT. send_to_peer_raw invokes the same transport but has only 10s total for dial + transfer. Consider splitting the timeout budget:

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/transport/saorsa_transport_adapter.rs
Line: 520-558

Comment:
**`SEND_TIMEOUT` too tight to cover relay dial + data transfer**

`send_to_peer_raw` wraps both `dial_addr` *and* the subsequent stream write inside the same 10-second budget. According to the updated comment on `DIAL_TIMEOUT` in `connect_to_peer` (lines 440–443), the full NAT-traversal flow is:

> direct (2s) + 2 × hole-punch rounds (3s + 1s retry each) + relay (10s) ≈ 20s

The relay fallback alone is documented as **10s**, which exactly equals the new `SEND_TIMEOUT`. For a fresh relay connection to a NAT'd peer, `dial_addr` can consume the entire 10-second window before a single byte of data is written — meaning `send_to_peer_raw` will reliably time out in the worst-case relay scenario.

`connect_to_peer` correctly budgets 25s for the same dial path via `DIAL_TIMEOUT`. `send_to_peer_raw` invokes the same transport but has only 10s total for dial + transfer. Consider splitting the timeout budget:



How can I resolve this? If you propose a fix, please make it concise.

src/dht_network_manager.rs, line 2539-2541 (link)

DEFAULT_REQUEST_TIMEOUT_SECS not updated alongside REQUEST_TIMEOUT

The module-level REQUEST_TIMEOUT (DoS-protection timeout on DHT message handlers) was reduced from 30s → 10s. However, DEFAULT_REQUEST_TIMEOUT_SECS still defaults to 30s and feeds DhtNetworkConfig::default().request_timeout, which drives:

response_timeout in wait_for_dht_response (30s)
The upper bound of dial_timeout = transport.connection_timeout().min(config.request_timeout)

These two constants serve different roles so diverging values may be intentional, but given the PR's stated goal of a 10s request budget it is worth confirming DEFAULT_REQUEST_TIMEOUT_SECS should remain at 30s.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/dht_network_manager.rs
Line: 2539-2541

Comment:
**`DEFAULT_REQUEST_TIMEOUT_SECS` not updated alongside `REQUEST_TIMEOUT`**

The module-level `REQUEST_TIMEOUT` (DoS-protection timeout on DHT message handlers) was reduced from 30s → 10s. However, `DEFAULT_REQUEST_TIMEOUT_SECS` still defaults to **30s** and feeds `DhtNetworkConfig::default().request_timeout`, which drives:
- `response_timeout` in `wait_for_dht_response` (30s)
- The upper bound of `dial_timeout = transport.connection_timeout().min(config.request_timeout)`

These two constants serve different roles so diverging values may be intentional, but given the PR's stated goal of a 10s request budget it is worth confirming `DEFAULT_REQUEST_TIMEOUT_SECS` should remain at 30s.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/transport/saorsa_transport_adapter.rs
Line: 520-558

Comment:
**`SEND_TIMEOUT` too tight to cover relay dial + data transfer**

`send_to_peer_raw` wraps both `dial_addr` *and* the subsequent stream write inside the same 10-second budget. According to the updated comment on `DIAL_TIMEOUT` in `connect_to_peer` (lines 440–443), the full NAT-traversal flow is:

> direct (2s) + 2 × hole-punch rounds (3s + 1s retry each) + relay (10s) ≈ 20s

The relay fallback alone is documented as **10s**, which exactly equals the new `SEND_TIMEOUT`. For a fresh relay connection to a NAT'd peer, `dial_addr` can consume the entire 10-second window before a single byte of data is written — meaning `send_to_peer_raw` will reliably time out in the worst-case relay scenario.

`connect_to_peer` correctly budgets 25s for the same dial path via `DIAL_TIMEOUT`. `send_to_peer_raw` invokes the same transport but has only 10s total for dial + transfer. Consider splitting the timeout budget:

```suggestion
        const DIAL_PHASE_TIMEOUT: Duration = Duration::from_secs(25);
        const SEND_TIMEOUT: Duration = Duration::from_secs(30);

        tokio::time::timeout(SEND_TIMEOUT, async {
            let conn = tokio::time::timeout(
                DIAL_PHASE_TIMEOUT,
                self.transport.dial_addr(*addr, SAORSA_DHT_PROTOCOL),
            )
            .await
            .map_err(|_| anyhow::anyhow!("Dial timed out after {DIAL_PHASE_TIMEOUT:?} to {addr}"))?
            .map_err(|e| anyhow::anyhow!("Dial by address failed: {}", e))?;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/dht_network_manager.rs
Line: 2539-2541

Comment:
**`DEFAULT_REQUEST_TIMEOUT_SECS` not updated alongside `REQUEST_TIMEOUT`**

The module-level `REQUEST_TIMEOUT` (DoS-protection timeout on DHT message handlers) was reduced from 30s → 10s. However, `DEFAULT_REQUEST_TIMEOUT_SECS` still defaults to **30s** and feeds `DhtNetworkConfig::default().request_timeout`, which drives:
- `response_timeout` in `wait_for_dht_response` (30s)
- The upper bound of `dial_timeout = transport.connection_timeout().min(config.request_timeout)`

These two constants serve different roles so diverging values may be intentional, but given the PR's stated goal of a 10s request budget it is worth confirming `DEFAULT_REQUEST_TIMEOUT_SECS` should remain at 30s.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "perf: tighten network timeouts for faste..." | Re-trigger Greptile}

Reduce timeouts based on RTT analysis (300ms worst-case cross-globe RTT). Previous values caused excessive delays when peers were unreachable. Changes: - DIAL_TIMEOUT: 90s → 25s (derived from strategy stage sum + margin) - DEFAULT_CONNECTION_TIMEOUT_SECS: 90s → 25s (matches DIAL_TIMEOUT) - SEND_TIMEOUT: 15s → 10s (3x margin over 4MB at 10Mbps) - REQUEST_TIMEOUT: 30s → 10s (aligned with libp2p/BEP 5 practice) - IDENTITY_EXCHANGE_TIMEOUT: 10s → 5s (1-2 RTTs + margin) - BOOTSTRAP_IDENTITY_TIMEOUT_SECS: 10s → 5s (matches identity exchange) Companion to saorsa-transport timeout tightening. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Reduces several network operation timeouts to shorten failure detection and improve overall connection/send responsiveness (especially when peers/addresses are stale).

Changes:

Reduced peer dial timeout in the transport adapter (90s → 25s).
Reduced raw send path timeout in the transport adapter (15s → 10s).
Reduced default node connection timeout and DHT handler / identity-exchange timeouts (e.g., 90s → 25s; 30s → 10s; 10s → 5s).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/transport/saorsa_transport_adapter.rs	Tightens dial and send operation timeouts used by the adapter APIs.
src/network.rs	Lowers the default node connection timeout and bootstrap identity exchange timeout.
src/dht_network_manager.rs	Lowers DHT handler request timeout and identity exchange timeout caps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

send_to_peer_raw calls the same dial_addr path as connect_to_peer but had only 10s — less than the relay fallback alone. Bump to 25s to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Aligns with the tighter timeout profile while staying above the relay stage (~10s) so dial_candidate's NAT traversal cascade is not truncated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Point saorsa-transport dependency at the saorsa-labs Git repo on the rc-2026.4.1 release-candidate branch instead of crates.io. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 4, 2026 15:58

Copilot started reviewing on behalf of mickvandijke April 4, 2026 15:59 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

Comment thread src/network.rs

mickvandijke and others added 3 commits April 4, 2026 18:21

fix: update test assertion to match new 25s connection timeout default

0ba19fb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: align SEND_TIMEOUT with DIAL_TIMEOUT to cover full NAT traversal

d8bda2c

send_to_peer_raw calls the same dial_addr path as connect_to_peer but had only 10s — less than the relay fallback alone. Bump to 25s to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: reduce DEFAULT_REQUEST_TIMEOUT_SECS from 30s to 15s

75baafd

Aligns with the tighter timeout profile while staying above the relay stage (~10s) so dial_candidate's NAT traversal cascade is not truncated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mickvandijke changed the base branch from main to rc-2026.4.1 April 4, 2026 16:35

build: switch saorsa-transport to rc-2026.4.1 branch

1837321

Point saorsa-transport dependency at the saorsa-labs Git repo on the rc-2026.4.1 release-candidate branch instead of crates.io. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jacderida approved these changes Apr 4, 2026

View reviewed changes

jacderida merged commit dcfc79e into rc-2026.4.1 Apr 4, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: tighten network timeouts for faster connections#68

perf: tighten network timeouts for faster connections#68
jacderida merged 5 commits into
rc-2026.4.1from
perf/tighten-network-timeouts

mickvandijke commented Apr 4, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mickvandijke commented Apr 4, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Timeout changes

Test plan

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mickvandijke commented Apr 4, 2026 •

edited by greptile-apps Bot

Loading