Skip to content

fix: NAT traversal hardening and 1000-node scale fixes#70

Merged
jacderida merged 8 commits intorc-2026.4.1from
fix/dht-dialable-addresses-nat-hardening
Apr 6, 2026
Merged

fix: NAT traversal hardening and 1000-node scale fixes#70
jacderida merged 8 commits intorc-2026.4.1from
fix/dht-dialable-addresses-nat-hardening

Conversation

@mickvandijke
Copy link
Copy Markdown
Collaborator

@mickvandijke mickvandijke commented Apr 6, 2026

Summary

Bundles NAT-traversal correctness fixes, a new observed-address cache, and two perf fixes targeting 1000-node routing contention, all against the rc-2026.4.1 release branch.

NAT traversal & dialable addresses

Root cause

DhtNetworkManager::local_dht_node() built its self-entry from the static NodeConfig::listen_addrs() derivation, which returns wildcard IPs (0.0.0.0 / ::) and the configured port (0 for ephemeral binds). Receivers' dialable_addresses() filter then rejected the entry, so DHT-based peer discovery returned zero contactable addresses for any node. Connections only succeeded via side channels (static bootstrap, in-memory cache, relay fallback) — exactly the sporadic-failure pattern reported by saorsa-node-lmdb.

Fix

local_dht_node() is now an async fn that sources its self-entry addresses from TransportHandle::observed_external_address() — the OBSERVED_ADDRESS reflexive address reported by peers — with the new observed-address cache as a fallback for the window immediately after a connection drop. The earlier iteration added a UdpSocket::connect-then-getsockname helper (primary_local_ip) to substitute wildcard IPs, but that was dropped in a follow-up commit because it ran blocking syscalls inside an async context; the OBSERVED_ADDRESS-only flow supersedes it and needs no interface probing.

Other NAT fixes bundled

  • Bootstrap referrer threadingbootstrap_from_peers now records the queried bootstrap peer as the hole-punch coordinator referrer for every node it returned, completing the dial_candidate → set_hole_punch_preferred_coordinator chain the iterative-lookup path already used.
  • Event-driven relay/address updates — Relay-established and peer-address-update events drain via a tokio::select! over TransportHandle::recv_relay_established() / recv_peer_address_update() instead of the previous 1-second polling loop, eliminating the race window where outbound DHT queries advertised no relay address for up to a second after relay setup.
  • Identity-exchange timeout parityIDENTITY_EXCHANGE_TIMEOUT and BOOTSTRAP_IDENTITY_TIMEOUT_SECS are both now 15 s (previously 5 s). Identity exchange (two RTTs + ML-DSA-65 signature, ~3.3 kB) blew through 5 s sporadically on congested cellular and cross-region links. Review noted the initial commit only bumped the bootstrap path; the non-bootstrap send_dht_request path is now on the same 15 s budget.
  • Multi-address dial helperdial_addresses() loops through every dialable address until one succeeds. Applied at 5 call sites (bootstrap_from_peers, spawn_bucket_refresh_task, trigger_self_lookup, find_closest_nodes_network parallel-query block, send_dht_request fallback) so a stale NAT binding on entry style: Fix formatting issues to resolve CI failures #1 no longer kills the dial when entry Comparison between saorsa and libp2p or iroh #2 would have worked.

Observed-address cache (new)

src/transport/observed_address_cache.rs (new, 372 lines) records OBSERVED_ADDRESS reflexive addresses against the local socket they were reported on, with a TTL and recency-based eviction. The cache is consulted by TransportHandle::observed_external_address() when no live connection currently carries a fresh reflexive address, so a node whose upstream connection just dropped can still publish a dialable self-entry for the few seconds before a new connection re-reports the address. Without this fallback the fix above degrades to "works only while you already have at least one live peer with recent OBSERVED_ADDRESS traffic."

1000-node scaling

Two independent contention points identified in 1000-node testnet traces where responses were missing the 25 s caller timeout ([STEP 6 FAILED] / Response channel closed (receiver timed out) cascade):

  • perf(dht): atomic last_seen under a read lock — routing-table touch was called on every inbound DHT message and always took a RwLock::write(), serialising every read behind each update. NodeInfo.last_seen is now an AtomicInstant and there is a fast path that atomically bumps the timestamp under a read lock when the address merge would be a no-op (address absent, already present, or skipped by the loopback-injection rule). The write-lock slow path is still used when a real address merge is needed.
  • perf(transport): shard the inbound message dispatcher — the previous single consumer task took three async write locks plus an awaited peer-id registration per message, so messages queued behind each other instead of being processed in parallel. The receive loop is now a light dispatcher that hashes the source IpAddr and hands off to one of 8 parallel shard consumers, each running the full parse/register/broadcast pipeline independently. Hashing by IpAddr (not full SocketAddr) keeps a peer pinned to the same shard across ephemeral-port reconnects so per-peer message ordering is preserved.

Regression tests

tests/dht_self_advertisement.rs (new, 649 lines) pins the published self-entry behaviour with three cases covering public-mode wildcard, port match, and local-mode ephemeral bind. All three FAILED on rc-2026.4.1 with concrete evidence (e.g. [Quic(0.0.0.0:0)] published as the only address; transport bound to port 55886 but DHT advertised port 0). All pass after this branch. They use only public APIs (DhtNetworkManager::find_closest_nodes_local_with_self) so no new test surface area was added to the library.

Out of scope

  • The dead hole-punch chain in saorsa-transport's nat_traversal_api.rs is fixed in a separate commit in that crate.
  • The --port 0 default in ant-node remains a deliberate operator-config decision.

Commits

  • 08febcc fix: advertise dialable addresses to DHT and harden NAT traversal
  • 4aeecf6 chore: drop needless struct update in IPDiversityConfig test
  • a528dec fix: bump IDENTITY_EXCHANGE_TIMEOUT to 15s for slow-link parity
  • 26eb128 fix: drop primary_local_ip and rely on OBSERVED_ADDRESS for self-entry
  • 7086ccb feat: cache observed external addresses as fallback for dropped connections
  • 0a854cd perf(dht): atomic last_seen to run touch under a read lock
  • cf1f32d perf(transport): shard inbound message dispatcher by source IP

Test plan

  • cargo test --lib
  • cargo test --test dht_self_advertisement — 3/3 passing (new regression suite)
  • cargo test --test two_node_messaging
  • cargo test --test node_lifecycle
  • cargo clippy --lib -- -D warnings — clean
  • Re-run cargo test --lib and clippy after the two perf: commits
  • Smoke test against saorsa-2.saorsalabs.com:10000 / saorsa-3.saorsalabs.com:10000 bootstrap nodes
  • Confirm DHT-published self-entries contain dialable addresses on the testnet
  • Validate 1000-node testnet run no longer hits the [STEP 6 FAILED] / Response channel closed cascade

Greptile Summary

This PR bundles a well-scoped set of fixes and performance improvements targeting DHT self-advertisement correctness, NAT traversal reliability, and routing-table write-lock contention at 1000-node scale.

Core changes:

  • local_dht_node() is now async and sources self-entry addresses exclusively from TransportHandle::observed_external_addresses() (live QUIC OBSERVED_ADDRESS frames) and the runtime-bound listen address set, discarding the prior NodeConfig::listen_addrs() call that returned un-dialable wildcards/port-0 values
  • New ObservedAddressCache (src/transport/observed_address_cache.rs) fills the "invisible window" between connection drop and reconnection with frequency+recency-aware address selection; multi-homed partitioned by (local_bind, observed) key
  • Bootstrap referrer threading: bootstrap_from_peers now passes the bootstrap peer's socket address as hole-punch coordinator referrer to every dial_addresses call
  • IDENTITY_EXCHANGE_TIMEOUT and BOOTSTRAP_IDENTITY_TIMEOUT_SECS unified at 15 s (was 5 s) to handle congested cellular / cross-region links
  • NodeInfo.last_seen promoted to AtomicInstant (parking_lot::Mutex<Instant>) so the hot touch path can run under a read lock on the routing table instead of taking a write lock on every inbound DHT message
  • Inbound message dispatcher sharded across 8 parallel consumer tasks keyed by source-IP hash, eliminating head-of-line blocking under 1000-node load
  • Relay/address update consumption moved from a 1-second polling loop to an event-driven tokio::select! over recv_relay_established / recv_peer_address_update
  • New regression suite tests/dht_self_advertisement.rs (649 lines) that confirmed the exact failure modes on the base branch and passes cleanly on this branch

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style suggestions with no production correctness impact

No P0 or P1 issues found. The two P2 comments flag a write-lock optimization opportunity in the shard consumer and the use of DefaultHasher (which is deterministic within a run and carries no correctness risk for a load-balancing shard key). The root-cause fixes are correct, the new ObservedAddressCache is well-designed and thoroughly tested, the regression suite concretely demonstrates the pre-fix failures, and the atomic touch path is a valid read-lock optimization with proper fallback. Prior concern about blocking I/O in local_dht_node() is fully resolved by the OBSERVED_ADDRESS-only approach.

src/transport_handle.rs — the shard consumer write-lock pattern is worth a follow-up optimization once the scalability impact is measured in production

Important Files Changed

Filename Overview
src/transport/observed_address_cache.rs New module: bounded (16-entry), frequency+recency-aware cache for QUIC OBSERVED_ADDRESS fallback; well-designed with per-local-bind partitioning for multi-homed safety; 11 deterministic unit tests cover all selection paths and eviction
src/dht_network_manager.rs Core fix: local_dht_node() is now async and sources self-entry addresses from observed_external_addresses() and runtime listen_addrs; adds dial_addresses() multi-address retry helper (applied at 5 call sites); bootstrap_referrer threading corrected
src/transport_handle.rs Single-consumer recv loop replaced by 8-shard parallel dispatcher keyed by source-IP hash; observed_address_cache field + accessor methods added; recv_relay_established / recv_peer_address_update event channels; write lock held across await in shard consumer (P2 style note)
src/transport/saorsa_transport_adapter.rs Adds ADDRESS_EVENT_CHANNEL_CAPACITY bounded channel (256), drop-logging helper with coalesced warnings, and supporting constants for the sharded dispatcher infrastructure
src/network.rs BOOTSTRAP_IDENTITY_TIMEOUT_SECS bumped to 15s matching IDENTITY_EXCHANGE_TIMEOUT; polling relay/address loop replaced with event-driven tokio::select! on recv_relay_established / recv_peer_address_update, eliminating 1-second race window
src/dht/core_engine.rs AtomicInstant (parking_lot::Mutex) replaces plain Instant in NodeInfo.last_seen; touch_node_typed fast-path atomically bumps timestamp under read lock with no bucket mutation; escalates to write lock only when address merge is needed
tests/dht_self_advertisement.rs New 649-line regression suite: 3 tests covering wildcard/public-mode, port-match, and local-mode ephemeral-bind self-advertisement contracts; all 3 failed on base branch (concrete evidence of the bug) and pass on this branch
src/dht/security_tests.rs Minor chore: struct update syntax in IPDiversityConfig tests; no logic changes
src/transport.rs Adds pub(crate) mod observed_address_cache module declaration; no other changes

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Inbound QUIC message] --> B[upstream_rx channel]
    B --> C{shard_index_for_addr\nhash source IP}
    C -->|shard 0| D0[Shard Consumer 0]
    C -->|shard 1| D1[Shard Consumer 1]
    C -->|...| D2[...]
    C -->|shard 7| D7[Shard Consumer 7]
    D0 & D1 & D2 & D7 --> E[parse_protocol_message]
    E --> F{authenticated?}
    F -->|yes| G[peer_to_channel.write\nchannel_to_peers.write\nregister_connection_peer_id]
    F -->|no| H[broadcast P2PEvent]
    G --> I{is /rr/ response?}
    I -->|yes| J[active_requests.write\ndeliver to oneshot]
    I -->|no| H

    subgraph NAT [NAT Self-Entry Fix]
        K[local_dht_node async] --> L{observed_external\naddresses?}
        L -->|yes| M[publish observed addr\nper interface]
        L -->|no| N[ObservedAddressCache\nfallback]
        N --> O{specific IP\nlisten addr?}
        O -->|yes| P[publish listen addr]
        O -->|no| Q[publish nothing —\nhonest empty entry]
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1701-1723

Comment:
**Write lock held across two await points**

`peer_to_channel.write()` is acquired on line 1701 and held across both `channel_to_peers.write().await` (line 1706) and the async `dual_node_for_peer_reg.register_connection_peer_id(...).await` call (line 1717). All 8 shards contend on this single global write lock for every authenticated message — not just on first peer registration — partially re-introducing the serialization the sharding was meant to remove.

A read-first check avoids the write lock on the hot path (already-known peer + already-known channel):

```rust
// Fast path: read lock sufficient when peer+channel already known.
{
    let p2c = peer_to_channel.read().await;
    if p2c.get(app_id).is_some_and(|chs| chs.contains(&channel_id)) {
        // Nothing to register; fall through to event dispatch.
        drop(p2c);
    } else {
        drop(p2c);
        // Slow path: new peer or new channel — take write lock.
        let mut p2c = peer_to_channel.write().await;
        // ... existing write-lock body ...
    }
}
```

The correctness argument in the existing comment (consistency between the app map and the transport's internal registration) still holds on the slow path only; the fast path is safe because there is nothing to register.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/transport_handle.rs
Line: 1847-1851

Comment:
**`DefaultHasher` is not guaranteed stable across Rust versions**

`std::collections::hash_map::DefaultHasher::new()` uses SipHash-1-3 with a zeroed key (deterministic within a run), but the Rust reference explicitly states the algorithm is "subject to change at any point in the future." A Rust upgrade that changes the algorithm would silently redistribute all peers across shards, causing a brief message-ordering hiccup while in-flight messages drain from the old shard assignments.

For a load-balancing shard key, correctness is maintained even across a reshuffle (no persistent state is keyed by shard), but replacing with a stable, purpose-built hash (e.g. `rustc-hash::FxHasher` or a simple FNV) would make the choice explicit and immune to compiler updates:

```rust
fn shard_index_for_addr(addr: &SocketAddr) -> usize {
    use std::hash::{Hash, Hasher};
    // FNV-1a — fast, stable, no crate dependency beyond std.
    let mut h: u64 = 14695981039346656037;
    for byte in addr.ip().to_string().bytes() {
        h ^= u64::from(byte);
        h = h.wrapping_mul(1099511628211);
    }
    (h as usize) % MESSAGE_DISPATCH_SHARDS
}
```

How can I resolve this? If you propose a fix, please make it concise.

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.
Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Reviews (2): Last reviewed commit: "fix: address deep-review findings on PR ..." | Re-trigger Greptile

Comment thread src/dht_network_manager.rs Outdated
@mickvandijke mickvandijke marked this pull request as draft April 6, 2026 14:17
mickvandijke added a commit that referenced this pull request Apr 6, 2026
`send_dht_request` computes its identity-exchange timeout as
`min(request_timeout, IDENTITY_EXCHANGE_TIMEOUT)`. With the constant at
5s the effective budget was always 5s — too tight for cross-region or
congested cellular links where the two-RTT exchange plus ML-DSA-65
verification can blow past with retransmits.

This is the same root cause that motivated bumping
`BOOTSTRAP_IDENTITY_TIMEOUT_SECS` to 15s in 08febcc; that fix only
covered the initial bootstrap dial, leaving every subsequent peer dial
exposed to the original failure mode. Bumping this constant to 15s
matches the bootstrap budget so both code paths now absorb the same
class of slow-link delay.

Resolves the Greptile P1 finding on PR #70.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mickvandijke added a commit that referenced this pull request Apr 6, 2026
The earlier wildcard substitution in `local_dht_node()` opened a UDP
socket and used the connect-then-getsockname trick to discover the
host's primary outbound interface IP, then published that as a fallback
when the listen address was a wildcard. This had three problems:

1. Blocking I/O inside an async fn: `std::net::UdpSocket::{bind,
   connect,local_addr}` are sync syscalls held across no `.await`, and
   `local_dht_node()` is on the hot DHT-lookup path. Wrong colour for
   tokio worker threads.

2. Wrong answer for home-NAT deployments: a node behind a residential
   NAT publishes an RFC1918 address (192.168.x.x) that internet peers
   cannot route to. The new `dial_addresses()` helper then burned
   connect attempts on guaranteed-failed dials.

3. Unnecessary: every node already has an authoritative source for its
   post-NAT reflexive address — the OBSERVED_ADDRESS frames sent by
   peers, plumbed through `TransportHandle::observed_external_address()`.
   The previous fix made it the *first* source but kept the substitution
   as a fallback "just in case." It is not needed.

This commit removes `primary_local_ip()`, `PRIMARY_IP_PROBE_V{4,6}`,
and the `UdpSocket` import. `local_dht_node()` now sources addresses
from exactly two places:

- `transport.observed_external_address()` — the OBSERVED_ADDRESS
  reflexive address.
- `transport.listen_addrs()`, but only entries with a *specific* IP
  (loopback or any other non-wildcard); wildcards and zero ports are
  skipped.

If neither source produces an address, the published self-entry has
empty `addresses`. This is the right answer at the publish layer:
better to admit "I don't know how to be reached yet" than to lie with
a bind-side wildcard or a guessed LAN IP. The empty window closes
naturally once the first peer connects to us and OBSERVED_ADDRESS flows.

### Tests

`tests/dht_self_advertisement.rs` rewritten as five tests across two
paths:

- **Loopback bind path** (single-node, `local: true`):
  - `local_mode_publishes_dialable_loopback_address`
  - `local_mode_published_self_entry_matches_runtime_listen_addrs`
  - `local_mode_never_publishes_port_zero`

- **Wildcard / OBSERVED_ADDRESS path** (`local: false`):
  - `wildcard_bind_with_no_peers_publishes_empty_self_entry` — pins the
    new "don't lie when you don't know" contract.
  - `wildcard_bind_publishes_observed_address_after_peer_connection` —
    primary regression test for the OBSERVED_ADDRESS flow. Two
    wildcard-bound nodes connect over loopback, identity exchanges,
    and the test polls `transport.observed_external_address()` until
    the frame arrives, then asserts the observed value appears in the
    published self-entry.

Resolves the Greptile P2 finding on PR #70.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke changed the title fix: advertise dialable addresses to DHT and harden NAT traversal fix: NAT traversal hardening and 1000-node scale fixes Apr 6, 2026
mickvandijke added a commit that referenced this pull request Apr 6, 2026
Bundles seven review-identified fixes (H2, H3, H4, M1, M2, M3, M5) on
the NAT-traversal hardening and 1000-node scale work. Each addresses a
specific bug or correctness gap surfaced during deep review.

H2 — dispatcher fault isolation (src/transport_handle.rs):
A single shard-consumer panic caused the inbound dispatcher loop to
break, dropping the remaining seven shard senders and cascading down
all healthy shards. Switched to `try_send` with explicit handling of
both `Full` and `Closed` errors: drops are counted and warn-logged at
intervals, and the dispatcher only exits when its upstream channel
itself closes. The eight shards now provide real fault isolation.

H3 — bridge task split + bounded forwarder channels
(src/network.rs, src/transport/saorsa_transport_adapter.rs):
The DHT bridge ran `find_closest_nodes_network` inline inside the
select loop on the relay branch, blocking the peer-address-update
branch for the duration of the iterative lookup; the underlying
forwarder mpsc channels were `unbounded_channel` so the queue could
grow without limit during the stall. Detached the lookup + publish
into its own spawned task per relay event so the select loop keeps
polling, and replaced both forwarder channels with bounded
`channel(ADDRESS_EVENT_CHANNEL_CAPACITY)` plus a per-event drop
counter.

H4 — atomic touch fast path respects addr_type
(src/dht/core_engine.rs):
The read-lock fast path in `touch_last_seen_if_merge_noop` only
checked whether the address was already present, ignoring the
caller's `addr_type`. The relay-promotion call site in the DHT bridge
(`touch_node_typed(peer, addr, AddressType::Relay)`) on an address
previously learned as `Direct` would silently bump `last_seen` and
drop the relay re-classification, breaking the type-priority ordering
that the dialer relies on. Plumbed `addr_type` through
`try_touch_last_seen` and `touch_last_seen_if_merge_noop`; the fast
path now refuses to short-circuit when the existing classification
differs, forcing escalation to the slow path so
`merge_typed_address` can re-order.

M1 — observed-address cache keyed by local bind
(src/transport/observed_address_cache.rs,
 src/transport/saorsa_transport_adapter.rs,
 src/transport_handle.rs,
 src/dht_network_manager.rs):
On a multi-homed host, observations from peers reaching the node via
different local interfaces were mixed into one keyspace, so an
address learned on interface A could be served as the self-entry
when only interface B was usable. Cache key changed from `SocketAddr`
to `(local_bind, observed)`. Added `most_frequent_recent_per_local_bind`
plural API. The forwarder captures per-stack `local_bind` before
spawning. New `TransportHandle::observed_external_addresses()` plural
accessor combines all live + cached per-bind observations;
`local_dht_node()` now publishes all of them so peers reaching the
host on any interface can dial back. Adds four new tests covering
multi-bind partitioning.

M2 — try_send for shard handoff (src/transport_handle.rs):
The dispatcher's `send().await` to a per-shard mpsc would block on
the upstream from saorsa-transport when one shard's RwLock contention
spiked, starving the other seven shards through the same single point
of serialisation the sharding was supposed to eliminate. The H2 fix's
non-blocking `try_send` also addresses this — recorded as M2 for
explicitness.

M3 — shard ordering docstring (src/transport_handle.rs):
`shard_index_for_addr` claimed to preserve per-peer ordering across
reconnects. True only across ephemeral-port reconnects from the same
IP. NAT rebinding to a new public IP, mobile Wi-Fi↔cellular roaming,
or dual-stack failover routes the peer to a different shard, breaking
causal ordering at the application layer. Docstring corrected to
"per source IP" with the caveat called out.

M5 — hold lock across register_connection_peer_id
(src/transport_handle.rs):
The shard consumer dropped the `peer_to_channel` write lock before
calling `dual_node.register_connection_peer_id`, leaving a window in
which the app-level map said the peer was known but the transport's
internal addr→peer map did not — a hole-punch lookup during that
window could fail spuriously. Lock now held across the registration
call.

Verification:
- cargo build --lib: clean
- cargo fmt --check: clean
- cargo clippy --lib -- -D warnings -D clippy::unwrap_used \
                       -D clippy::expect_used: clean
- cargo clippy --all-targets -- -D warnings: clean
- cargo test --lib: 282 passed, 0 failed
- cargo test --test dht_self_advertisement: 6/6 passed
- cargo test --test two_node_messaging: 7/7 passed
- cargo test --test node_lifecycle: 17/17 passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke marked this pull request as ready for review April 6, 2026 18:12
mickvandijke and others added 8 commits April 6, 2026 20:43
Bundles six NAT-traversal fixes uncovered during a deep audit of the
sporadic-failure pattern reported by saorsa-node-lmdb.

Root cause:
  DhtNetworkManager::local_dht_node() built its self-entry from the
  static NodeConfig::listen_addrs() derivation, which returns wildcard
  IPs (0.0.0.0 / ::) and the *configured* port (0 for ephemeral binds).
  Receivers' dialable_addresses() filter then rejected the entry, so
  DHT-based peer discovery returned zero contactable addresses for any
  node. Connections only succeeded via side channels (static bootstrap,
  in-memory cache, relay fallback) — exactly the sporadic pattern.

  Rewritten as `async fn` that sources addresses, in priority order,
  from TransportHandle::observed_external_address() (the OBSERVED_ADDRESS
  reflexive address, which was defined on the adapter and never called
  outside its own file) and the runtime listen_addrs RwLock, with
  unspecified IPs substituted via a new primary_local_ip() helper that
  uses the standard UdpSocket::connect-then-getsockname trick. No new
  dependencies.

Other fixes bundled:
  - bootstrap_from_peers now records the bootstrap peer as the natural
    hole-punch coordinator referrer for every node it returned, threading
    through the existing dial_candidate -> set_hole_punch_preferred_coordinator
    plumbing that the iterative-lookup path already used.
  - Relay-established and peer-address-update events drained via a
    tokio::select! over new TransportHandle::recv_relay_established() /
    recv_peer_address_update() instead of the previous 1-second polling
    loop, eliminating the race window where outbound DHT queries
    advertised no relay address for up to a second after relay setup.
  - BOOTSTRAP_IDENTITY_TIMEOUT_SECS bumped 5 -> 15 with explanatory
    comment. Identity exchange (two RTTs + ML-DSA-65 signature, ~3.3kB)
    blew through 5s sporadically on congested cellular and cross-region
    links.
  - Added dial_addresses() helper that loops through every dialable
    address until one succeeds. Replaced 5 single-address dial sites
    (bootstrap_from_peers, spawn_bucket_refresh_task, trigger_self_lookup,
    find_closest_nodes_network parallel-query block, send_dht_request
    fallback) so a stale NAT binding on entry #1 no longer kills the
    dial when entry #2 would have worked.

Regression tests:
  tests/dht_self_advertisement.rs adds three #[tokio::test] cases that
  pin the published self-entry behaviour. All three FAILED on main with
  concrete evidence (e.g. `[Quic(0.0.0.0:0)]` published as the only
  address; transport bound to port 55886 but DHT advertised port 0).
  All PASS after this commit. They use only public APIs
  (DhtNetworkManager::find_closest_nodes_local_with_self) so no new
  test surface area was added to the library.

Verified:
  cargo test --lib                       269/269
  cargo test --test dht_self_advertisement 3/3
  cargo test --test two_node_messaging   7/7
  cargo test --test node_lifecycle       17/17
  cargo clippy --lib -- -D warnings      clean

The dead hole-punch chain in saorsa-transport's nat_traversal_api.rs
(Bug 4 from the audit) is fixed in a separate commit in that crate.
The --port 0 default in ant-node (Bug 5) is left as a deliberate
decision for the operator config layer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both fields of `IPDiversityConfig` are explicitly set, so the trailing
`..IPDiversityConfig::default()` triggers `clippy::needless_update` under
`cargo clippy --all-targets`. Test code only; no functional change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`send_dht_request` computes its identity-exchange timeout as
`min(request_timeout, IDENTITY_EXCHANGE_TIMEOUT)`. With the constant at
5s the effective budget was always 5s — too tight for cross-region or
congested cellular links where the two-RTT exchange plus ML-DSA-65
verification can blow past with retransmits.

This is the same root cause that motivated bumping
`BOOTSTRAP_IDENTITY_TIMEOUT_SECS` to 15s in 08febcc; that fix only
covered the initial bootstrap dial, leaving every subsequent peer dial
exposed to the original failure mode. Bumping this constant to 15s
matches the bootstrap budget so both code paths now absorb the same
class of slow-link delay.

Resolves the Greptile P1 finding on PR #70.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The earlier wildcard substitution in `local_dht_node()` opened a UDP
socket and used the connect-then-getsockname trick to discover the
host's primary outbound interface IP, then published that as a fallback
when the listen address was a wildcard. This had three problems:

1. Blocking I/O inside an async fn: `std::net::UdpSocket::{bind,
   connect,local_addr}` are sync syscalls held across no `.await`, and
   `local_dht_node()` is on the hot DHT-lookup path. Wrong colour for
   tokio worker threads.

2. Wrong answer for home-NAT deployments: a node behind a residential
   NAT publishes an RFC1918 address (192.168.x.x) that internet peers
   cannot route to. The new `dial_addresses()` helper then burned
   connect attempts on guaranteed-failed dials.

3. Unnecessary: every node already has an authoritative source for its
   post-NAT reflexive address — the OBSERVED_ADDRESS frames sent by
   peers, plumbed through `TransportHandle::observed_external_address()`.
   The previous fix made it the *first* source but kept the substitution
   as a fallback "just in case." It is not needed.

This commit removes `primary_local_ip()`, `PRIMARY_IP_PROBE_V{4,6}`,
and the `UdpSocket` import. `local_dht_node()` now sources addresses
from exactly two places:

- `transport.observed_external_address()` — the OBSERVED_ADDRESS
  reflexive address.
- `transport.listen_addrs()`, but only entries with a *specific* IP
  (loopback or any other non-wildcard); wildcards and zero ports are
  skipped.

If neither source produces an address, the published self-entry has
empty `addresses`. This is the right answer at the publish layer:
better to admit "I don't know how to be reached yet" than to lie with
a bind-side wildcard or a guessed LAN IP. The empty window closes
naturally once the first peer connects to us and OBSERVED_ADDRESS flows.

### Tests

`tests/dht_self_advertisement.rs` rewritten as five tests across two
paths:

- **Loopback bind path** (single-node, `local: true`):
  - `local_mode_publishes_dialable_loopback_address`
  - `local_mode_published_self_entry_matches_runtime_listen_addrs`
  - `local_mode_never_publishes_port_zero`

- **Wildcard / OBSERVED_ADDRESS path** (`local: false`):
  - `wildcard_bind_with_no_peers_publishes_empty_self_entry` — pins the
    new "don't lie when you don't know" contract.
  - `wildcard_bind_publishes_observed_address_after_peer_connection` —
    primary regression test for the OBSERVED_ADDRESS flow. Two
    wildcard-bound nodes connect over loopback, identity exchanges,
    and the test polls `transport.observed_external_address()` until
    the frame arrives, then asserts the observed value appears in the
    published self-entry.

Resolves the Greptile P2 finding on PR #70.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ctions

saorsa-transport exposes the externally-observed address (from QUIC
OBSERVED_ADDRESS frames) only via *active* connections. When every
connection drops, the live read returns None and the node has no way to
tell the DHT how to be reached. The result is a temporary "invisible
window" where the node vanishes from DHT queries between a connection
drop and the next reconnection.

This commit adds an in-memory cache that records every
`P2pEvent::ExternalAddressDiscovered` event the transport emits and
serves as a frequency- and recency-aware fallback when no live
connection has an observation.

### Design

`TransportHandle::observed_external_address()` now has two sources:

  1. Live: ask `dual_node.get_observed_external_address()` first. Unchanged;
     this is the freshest answer while at least one connection is up.
  2. Fallback: consult the new `ObservedAddressCache` when the live
     source is empty. The cache returns the most-frequently-observed
     address among recent entries, breaking ties by recency.

Different peers can legitimately observe a node at different addresses
(symmetric NAT, multi-homed hosts, dual-stack divergence). The cache
tracks how many distinct `(peer, address)` events have been received
and picks "the address most peers agree on" — saorsa-transport dedups
events per (peer, address) pair, so the count is effectively a peer
consensus signal, not raw event volume.

### NAT-rebinding handling

Pure frequency would let a long-lived stale address (count: 10000,
last seen 24h ago) win over a fresh new address (count: 5, last seen
now). That is the wrong answer after a NAT mapping rebinds.

Selection is therefore split into two passes:
  1. Among entries seen within `OBSERVATION_RECENCY_WINDOW` (10 min),
     return the highest-count one.
  2. If nothing is recent, fall back to the global highest-count entry.

Eviction is recency-based: when the cache is full
(`MAX_CACHED_OBSERVATIONS = 16`), the entry with the oldest `last_seen`
is removed. Stale high-count entries naturally get pushed out as fresh
observations arrive at new addresses.

### Wiring

`DualStackNode::spawn_peer_address_update_forwarder` now takes an
`Arc<parking_lot::Mutex<ObservedAddressCache>>` and handles a third
event variant, `P2pEvent::ExternalAddressDiscovered`, recording each
event into the cache. The other two branches (PeerAddressUpdated,
RelayEstablished) are unchanged. The silent `Ok(_other)` arm is now
explicit with a comment listing the event variants we intentionally
skip.

`TransportHandle::new` creates the cache and passes a clone to the
forwarder; `new_for_tests` creates an empty cache so test handles have
the same semantics. Both constructors are in lockstep.

### Persistence

The cache is in-memory only. A process restart resets it. A freshly
started node re-discovers its current address from live connections
rather than trusting potentially-stale state from a previous run.

### Tests

- 9 unit tests on `ObservedAddressCache` covering: empty, single,
  repeated, frequency, recency tiebreak, NAT-rebinding fallback,
  long-quiet fallback, eviction, and re-observation-on-full.
- 1 integration test `observed_address_cache_serves_fallback_after_connection_drop`
  that connects two wildcard-bound nodes over loopback, waits for
  `ExternalAddressDiscovered` to reach the cache via the new
  `cached_observed_external_address` polling accessor, stops node_b to
  drop the live connection, waits for `connected_peers()` to drain,
  then asserts that both `observed_external_address()` and the DHT
  self-entry still return the cached value.

Total: 278 lib tests (269 + 9), 6 dht_self_advertisement tests (5 + 1).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
At 1000 nodes the routing table touch was called on every inbound DHT
message and always took a write lock, serialising every read behind
each update. Wrap `NodeInfo.last_seen` in an `AtomicInstant` and add a
fast path that atomically bumps the timestamp under a read lock when
the address merge would be a no-op (address absent, already present,
or skipped by the loopback-injection rule). The write-lock slow path
is still used when a real address merge is needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
At 1000 peers the single consumer task draining every inbound message
became the dominant serialisation point — each message took three
async write locks plus an awaited peer-id registration before the next
could even be looked at, causing response delivery to miss the 25 s
caller timeout. Split the receive loop into a light dispatcher that
hashes the source IP and hands off to one of 8 parallel shard
consumers, each running the full parse/register/broadcast pipeline
independently. Hashing by `IpAddr` (not full `SocketAddr`) keeps a
peer pinned to the same shard across ephemeral-port reconnects so
per-peer message ordering is preserved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bundles seven review-identified fixes (H2, H3, H4, M1, M2, M3, M5) on
the NAT-traversal hardening and 1000-node scale work. Each addresses a
specific bug or correctness gap surfaced during deep review.

H2 — dispatcher fault isolation (src/transport_handle.rs):
A single shard-consumer panic caused the inbound dispatcher loop to
break, dropping the remaining seven shard senders and cascading down
all healthy shards. Switched to `try_send` with explicit handling of
both `Full` and `Closed` errors: drops are counted and warn-logged at
intervals, and the dispatcher only exits when its upstream channel
itself closes. The eight shards now provide real fault isolation.

H3 — bridge task split + bounded forwarder channels
(src/network.rs, src/transport/saorsa_transport_adapter.rs):
The DHT bridge ran `find_closest_nodes_network` inline inside the
select loop on the relay branch, blocking the peer-address-update
branch for the duration of the iterative lookup; the underlying
forwarder mpsc channels were `unbounded_channel` so the queue could
grow without limit during the stall. Detached the lookup + publish
into its own spawned task per relay event so the select loop keeps
polling, and replaced both forwarder channels with bounded
`channel(ADDRESS_EVENT_CHANNEL_CAPACITY)` plus a per-event drop
counter.

H4 — atomic touch fast path respects addr_type
(src/dht/core_engine.rs):
The read-lock fast path in `touch_last_seen_if_merge_noop` only
checked whether the address was already present, ignoring the
caller's `addr_type`. The relay-promotion call site in the DHT bridge
(`touch_node_typed(peer, addr, AddressType::Relay)`) on an address
previously learned as `Direct` would silently bump `last_seen` and
drop the relay re-classification, breaking the type-priority ordering
that the dialer relies on. Plumbed `addr_type` through
`try_touch_last_seen` and `touch_last_seen_if_merge_noop`; the fast
path now refuses to short-circuit when the existing classification
differs, forcing escalation to the slow path so
`merge_typed_address` can re-order.

M1 — observed-address cache keyed by local bind
(src/transport/observed_address_cache.rs,
 src/transport/saorsa_transport_adapter.rs,
 src/transport_handle.rs,
 src/dht_network_manager.rs):
On a multi-homed host, observations from peers reaching the node via
different local interfaces were mixed into one keyspace, so an
address learned on interface A could be served as the self-entry
when only interface B was usable. Cache key changed from `SocketAddr`
to `(local_bind, observed)`. Added `most_frequent_recent_per_local_bind`
plural API. The forwarder captures per-stack `local_bind` before
spawning. New `TransportHandle::observed_external_addresses()` plural
accessor combines all live + cached per-bind observations;
`local_dht_node()` now publishes all of them so peers reaching the
host on any interface can dial back. Adds four new tests covering
multi-bind partitioning.

M2 — try_send for shard handoff (src/transport_handle.rs):
The dispatcher's `send().await` to a per-shard mpsc would block on
the upstream from saorsa-transport when one shard's RwLock contention
spiked, starving the other seven shards through the same single point
of serialisation the sharding was supposed to eliminate. The H2 fix's
non-blocking `try_send` also addresses this — recorded as M2 for
explicitness.

M3 — shard ordering docstring (src/transport_handle.rs):
`shard_index_for_addr` claimed to preserve per-peer ordering across
reconnects. True only across ephemeral-port reconnects from the same
IP. NAT rebinding to a new public IP, mobile Wi-Fi↔cellular roaming,
or dual-stack failover routes the peer to a different shard, breaking
causal ordering at the application layer. Docstring corrected to
"per source IP" with the caveat called out.

M5 — hold lock across register_connection_peer_id
(src/transport_handle.rs):
The shard consumer dropped the `peer_to_channel` write lock before
calling `dual_node.register_connection_peer_id`, leaving a window in
which the app-level map said the peer was known but the transport's
internal addr→peer map did not — a hole-punch lookup during that
window could fail spuriously. Lock now held across the registration
call.

Verification:
- cargo build --lib: clean
- cargo fmt --check: clean
- cargo clippy --lib -- -D warnings -D clippy::unwrap_used \
                       -D clippy::expect_used: clean
- cargo clippy --all-targets -- -D warnings: clean
- cargo test --lib: 282 passed, 0 failed
- cargo test --test dht_self_advertisement: 6/6 passed
- cargo test --test two_node_messaging: 7/7 passed
- cargo test --test node_lifecycle: 17/17 passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke force-pushed the fix/dht-dialable-addresses-nat-hardening branch from fb02b49 to 3b7131d Compare April 6, 2026 18:44
mickvandijke added a commit that referenced this pull request Apr 6, 2026
At 1000 nodes two hot paths dominated serialisation in the stack, both
showing up as response-delivery timeouts under load. This commit
combines those two fixes and the deep-review follow-ups that hardened
them against cascade failure and regression.

## perf(dht): atomic last_seen under a read lock

The routing-table touch was called on every inbound DHT message and
always took a write lock, serialising every read behind each update.
Wrap `NodeInfo.last_seen` in an `AtomicInstant` and add a fast path
that atomically bumps the timestamp under a read lock when the address
merge would be a no-op (address absent, already present with the same
type, or skipped by the loopback-injection rule). The write-lock slow
path still runs when a real address merge or re-classification is
needed. (a9a126f)

### Fast path respects addr_type (H4, from deep review)

Initially the read-lock fast path only checked whether the address
was already present, ignoring the caller's `addr_type`. The
relay-promotion call site in the DHT bridge would silently bump
`last_seen` and drop the relay re-classification, breaking the
type-priority ordering that the dialer relies on. Plumbed `addr_type`
through `try_touch_last_seen` and `touch_last_seen_if_merge_noop`;
the fast path now refuses to short-circuit when the existing
classification differs, forcing escalation to the slow path so
`merge_typed_address` can re-order.

## perf(transport): shard inbound message dispatcher by source IP

At 1000 peers the single consumer task draining every inbound message
became the dominant serialisation point — each message took three
async write locks plus an awaited peer-id registration before the next
could even be looked at, causing response delivery to miss the 25 s
caller timeout. Split the receive loop into a light dispatcher that
hashes the source IP and hands off to one of 8 parallel shard
consumers, each running the full parse/register/broadcast pipeline
independently. Hashing by `IpAddr` (not full `SocketAddr`) keeps a
peer pinned to the same shard across ephemeral-port reconnects so
per-peer message ordering is preserved for the common case. (ab3a5bf)

### Dispatcher fault isolation (H2/M2, from deep review)

A single shard-consumer panic caused the inbound dispatcher loop to
break, dropping the remaining seven shard senders and cascading down
all healthy shards. Switched to `try_send` with explicit handling of
both `Full` and `Closed` errors: drops are counted and warn-logged at
intervals (coalesced via `SHARD_DROP_LOG_INTERVAL`), and the
dispatcher only exits when its upstream channel itself closes. The
non-blocking `try_send` also addresses M2, where a blocking
`send().await` on one congested shard could starve the other seven
through the same serialisation point the sharding was meant to
eliminate. The eight shards now provide real fault isolation.

### Shard ordering docstring corrected (M3, from deep review)

`shard_index_for_addr` previously claimed to preserve per-peer
ordering across reconnects. This is true only across ephemeral-port
reconnects from the same source IP. NAT rebinding to a new public
IP, mobile Wi-Fi↔cellular roaming, or dual-stack failover routes the
peer to a different shard, breaking causal ordering at the
application layer. Docstring now says "per source IP" with the caveat
called out explicitly.

### Hold lock across register_connection_peer_id (M5, from deep review)

The shard consumer dropped the `peer_to_channel` write lock before
calling `dual_node.register_connection_peer_id`, leaving a window in
which the app-level map said the peer was known but the transport's
internal addr→peer map did not — a hole-punch lookup during that
window could fail spuriously. Lock is now held across the
registration call and dropped immediately after.

## chore: drop needless struct update in IPDiversityConfig test

Both fields of `IPDiversityConfig` are explicitly set, so the trailing
`..IPDiversityConfig::default()` triggers `clippy::needless_update`
under `cargo clippy --all-targets`. Test code only; no functional
change. (171af29)

## Verification

- `cargo build --lib`: clean
- `cargo fmt --check`: clean
- `cargo clippy --lib -- -D warnings -D clippy::unwrap_used -D clippy::expect_used`: clean
- `cargo clippy --all-targets -- -D warnings`: clean
- `cargo test --lib`: 269 passed, 0 failed

This is one of two PRs that split the original PR #70, which bundled
NAT-traversal hardening with these 1000-node scale fixes. The other
PR covers the NAT-traversal work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mickvandijke added a commit that referenced this pull request Apr 6, 2026
Splits the NAT-traversal and dialable-address work from the original
PR #70 bundle. Root cause and fix below; other correctness fixes
bundled.

## Root cause

`DhtNetworkManager::local_dht_node()` built its self-entry from the
static `NodeConfig::listen_addrs()` derivation, which returns wildcard
IPs (`0.0.0.0` / `::`) and the configured port (`0` for ephemeral
binds). Receivers' `dialable_addresses()` filter then rejected the
entry, so DHT-based peer discovery returned **zero contactable
addresses** for any node. Connections only succeeded via side channels
(static bootstrap, in-memory cache, relay fallback) — exactly the
sporadic-failure pattern reported by `saorsa-node-lmdb`.

## Fix

`local_dht_node()` is now an `async fn` that sources its self-entry
addresses from `TransportHandle::observed_external_addresses()` — the
post-NAT addresses observed via QUIC `OBSERVED_ADDRESS` frames, one
per local interface that has an observation on a multi-homed host.
An intermediate iteration added a `UdpSocket::connect`-then-
`getsockname` helper (`primary_local_ip`) to substitute wildcard IPs,
but it was dropped in a follow-up because it ran blocking syscalls
inside an async context and returned RFC1918 addresses behind home
NAT; the OBSERVED_ADDRESS-only flow supersedes it and needs no
interface probing.

If no source produces an address, the published self-entry has an
empty `addresses` vec. This is the right answer at the publish layer:
better to admit "I don't know how to be reached yet" than to lie
with a bind-side wildcard. The empty window closes naturally once the
first peer connects and OBSERVED_ADDRESS flows.

## Other NAT fixes bundled

- **Event-driven relay/address updates** — relay-established and
  peer-address-update events drain via `tokio::select!` over
  `TransportHandle::recv_relay_established()` /
  `recv_peer_address_update()` instead of the previous 1-second
  polling loop, eliminating the race window where outbound DHT queries
  advertised no relay address for up to a second after relay setup.

- **Identity exchange timeout bumped to 15 s** — `send_dht_request`
  computes its identity-exchange timeout as
  `min(request_timeout, IDENTITY_EXCHANGE_TIMEOUT)`; with the constant
  at 5 s the effective budget was always 5 s, too tight for
  cross-region or congested cellular links where the two-RTT exchange
  plus ML-DSA-65 verification can blow past with retransmits. Bumped
  to 15 s to match the bootstrap budget.

- **Observed-address cache fallback** — saorsa-transport exposes the
  externally-observed address only via active connections, so a node
  whose every connection has just dropped has no answer for
  `observed_external_address()`. This introduces
  `ObservedAddressCache`: an in-memory, frequency- and recency-aware
  cache populated by `P2pEvent::ExternalAddressDiscovered` events.
  When the live source is empty, the cache returns the
  most-frequently-observed address among recent entries, breaking
  ties by recency. NAT-rebind handling via a two-pass selection
  (recent window wins over stale counts). Keyed by
  `(local_bind, observed)` so multi-homed hosts don't cross-pollute
  observations from different interfaces. The cache is reset on
  process restart — a fresh node re-discovers its current address
  from live connections rather than trusting stale state.

- **Bridge task split + bounded forwarder channels** — the DHT bridge
  previously ran `find_closest_nodes_network` inline inside the select
  loop on the relay branch, blocking the peer-address-update branch
  for the duration of the iterative lookup; the underlying forwarder
  mpsc channels were `unbounded_channel` so the queue could grow
  without limit during the stall. Detached the lookup + publish into
  its own spawned task per relay event so the select loop keeps
  polling, and replaced both forwarder channels with bounded
  `channel(ADDRESS_EVENT_CHANNEL_CAPACITY)` plus a per-event drop
  counter.

- **`dial_addresses()` helper** — loops through every dialable
  address until one succeeds. Replaces single-address dial sites in
  `bootstrap_from_peers`, `spawn_bucket_refresh_task`,
  `trigger_self_lookup`, `find_closest_nodes_network` parallel-query
  block, and `send_dht_request` fallback so a stale NAT binding on
  entry #1 no longer kills the dial when entry #2 would have worked.

## Regression tests

`tests/dht_self_advertisement.rs` adds six `#[tokio::test]` cases that
pin the published self-entry behaviour:

- **Loopback bind path** (single-node, `local: true`):
  - `local_mode_publishes_dialable_loopback_address`
  - `local_mode_published_self_entry_matches_runtime_listen_addrs`
  - `local_mode_never_publishes_port_zero`

- **Wildcard / OBSERVED_ADDRESS path** (`local: false`):
  - `wildcard_bind_with_no_peers_publishes_empty_self_entry` — pins
    the "don't lie when you don't know" contract.
  - `wildcard_bind_publishes_observed_address_after_peer_connection` —
    primary regression test for the OBSERVED_ADDRESS flow.
  - `observed_address_cache_serves_fallback_after_connection_drop` —
    pins the cache fallback behaviour after all live connections drop.

All six FAIL on `rc-2026.4.1` with concrete evidence (e.g.
`[Quic(0.0.0.0:0)]` published as the only address) and PASS here.
`ObservedAddressCache` ships with 9 unit tests covering empty, single,
repeated, frequency, recency tiebreak, NAT-rebinding fallback,
long-quiet fallback, eviction, and re-observation-on-full.

## chore: drop needless struct update in IPDiversityConfig test

Both fields of `IPDiversityConfig` are explicitly set, so the trailing
`..IPDiversityConfig::default()` triggers `clippy::needless_update`
under `cargo clippy --all-targets`. Test code only; no functional
change.

## Verification

- `cargo build --lib`: clean
- `cargo fmt --check`: clean
- `cargo clippy --lib -- -D warnings -D clippy::unwrap_used -D clippy::expect_used`: clean
- `cargo clippy --all-targets -- -D warnings`: clean
- `cargo test --lib`: 282 passed, 0 failed
- `cargo test --test dht_self_advertisement`: 6/6 passed
- `cargo test --test two_node_messaging`: 7/7 passed
- `cargo test --test node_lifecycle`: 17/17 passed

This is one of two PRs that split the original PR #70, which bundled
this NAT-traversal hardening with two 1000-node scale fixes in the
DHT touch path and the inbound message dispatcher. The other PR
covers the 1000-node perf work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jacderida jacderida merged commit ada78f2 into rc-2026.4.1 Apr 6, 2026
1 of 2 checks passed
mickvandijke added a commit that referenced this pull request Apr 14, 2026
The classification session was sending raw listen_addrs (potentially
0.0.0.0) as candidate addresses in DialBackRequest messages. Probers
could not dial wildcard addresses, so all probes failed and every node
was classified as private regardless of actual reachability.

Now uses transport.observed_external_addresses() as the primary source
(real external IP from QUIC OBSERVED_ADDRESS frames). Falls back to
listen_addrs filtered to non-wildcard IPs only when no observations
are available yet (cold start before any peer has connected).

This mirrors the fix in local_dht_node() from PR #70.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants