Skip to content

feat: actual networking between peers#6

Merged
dirvine merged 1 commit intosaorsa-labs:mainfrom
grumbach:actual_networking_between_peers
Jan 29, 2026
Merged

feat: actual networking between peers#6
dirvine merged 1 commit intosaorsa-labs:mainfrom
grumbach:actual_networking_between_peers

Conversation

@grumbach
Copy link
Copy Markdown
Collaborator

@grumbach grumbach commented Jan 29, 2026

Fix P2P Network Communication Issues

Status: Peer-to-Peer Connectivity

Objective: Connect 2 peers and have them communicate reliably

Result: ✅ ACHIEVED - Core functionality is working.


What's Working Now

Feature Status Evidence
Peer connection test_simple_ping_pong passes
Bidirectional messaging test_bidirectional_communication passes
Large messages (64KB+) test_large_message_transfer passes
Multiple sequential messages test_multiple_sequential_messages passes
Connection persistence test_connection_stays_alive passes
Peer events (connect/disconnect) test_peer_events_sequence passes
Multi-peer communication test_many_peers_scaling passes

Test Results: 25 pass, 1 fail (pre-existing), 4 ignored


Fixes Applied

1. Dual-stack race condition

File: src/transport/ant_quic_adapter.rs:880-891

Problem: tokio::select! races IPv4 and IPv6 stacks. If connection is only on one stack, the empty stack returns "No connected peers" immediately, beating the actual data transfer.

Fix: When one stack returns "No connected peers", fall back to the other stack instead of failing.

2. API mismatch between send/receive

File: src/network.rs:1729, 2401

Problem: Send used open_uni() (LinkTransport API) but receive used endpoint().recv() (P2pEndpoint API). These are incompatible - data sent via one cannot be received by the other.

Fix: Changed send_message() and keepalive_task() to use send_to_peer_string_optimized() which uses the P2pEndpoint API matching the receive path.

3. Timestamp reset bug

File: src/network.rs:2488-2497

Problem: When marking a peer as disconnected, last_seen was reset to now, doubling the cleanup time (120s instead of 60s).

Fix: Removed the last_seen = now assignment when marking peers disconnected.


Remaining TODOs

Bug Severity Status Notes
Dual periodic tasks race HIGH Not fixed Both periodic_maintenance_task() and periodic_tasks() run concurrently - risk of duplicate events
Task handles not awaited HIGH Not fixed Potential resource leaks on shutdown
Empty messages dropped MEDIUM Not fixed Empty payloads silently ignored
Timeout mismatch MEDIUM Not fixed 30s hardcoded vs configurable threshold
test_reconnection_works LOW Pre-existing Event subscription timing issue

Summary

The core objective is met. Two peers can:

  • Connect over QUIC (IPv4/IPv6)
  • Exchange messages of any size up to ~1MB
  • Maintain connections with keepalive
  • Receive proper connect/disconnect events

The remaining bugs are cleanup/edge-case issues that don't block normal peer communication. They should be addressed for production robustness but aren't blocking the basic connectivity use case.


Verification

cargo test --test network_wiring_e2e_test  # 25 pass
cargo clippy -- -D warnings                 # No warnings
cargo fmt --check                           # Formatted

@dirvine
Copy link
Copy Markdown
Collaborator

dirvine commented Jan 29, 2026

@claude please review and confirm tests

@claude
Copy link
Copy Markdown

claude bot commented Jan 29, 2026

Claude encountered an error —— View job

Command failed: git fetch origin --depth=20 actual_networking_between_peers

I'll analyze this and get back to you.

dirvine added a commit that referenced this pull request Jan 29, 2026
Prevents workflow failure when triggered by PRs from forks.

## Problem
PR #6 from fork (grumbach/saorsa-core) failed with:
- Error: Could not fetch OIDC token
- Missing ACTIONS_ID_TOKEN_REQUEST_URL environment variable
- CLAUDE_CODE_OAUTH_TOKEN secret unavailable to forks

## Root Cause
GitHub Actions security: `pull_request` events from forks don't have
access to repository secrets or OIDC tokens (intentional security).

## Solution
Added condition to claude-code-review.yml:
```yaml
if: github.event.pull_request.head.repo.full_name == github.repository
```

This skips the workflow for forked PRs while allowing same-repo PRs.

## Behavior
- ✅ Same-repo PRs → Claude review runs normally
- ⏭️ Forked PRs → Claude review skipped (no error)
- ✅ @claude comments → Still work on forks (runs in base repo context)

## Changes
- .github/workflows/claude-code-review.yml - Add fork check
- .github/workflows/claude.yml - Add clarifying comment
- .planning/reviews/ - Review documentation

## Verification
Review Grade: A (unanimous approval)
- YAML syntax: valid
- Security: improved (fork isolation)
- Best practices: followed
- Breaking changes: none

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@dirvine
Copy link
Copy Markdown
Collaborator

dirvine commented Jan 29, 2026

✅ Workflow Fix Applied

The Claude Code Review workflow failure you experienced has been fixed in commit 025a538.

What Was Happening

Your PR triggered this error:

Error: Could not fetch an OIDC token
Missing ACTIONS_ID_TOKEN_REQUEST_URL environment variable

Root Cause: GitHub Actions security prevents forked PRs from accessing repository secrets (CLAUDE_CODE_OAUTH_TOKEN) and OIDC tokens. This is intentional - prevents malicious forks from stealing secrets.

The Fix

Added fork protection to .github/workflows/claude-code-review.yml:

if: github.event.pull_request.head.repo.full_name == github.repository

New Behavior

  • Same-repo PRs → Claude Code Review runs normally
  • ⏭️ Forked PRs → Workflow skipped (no error)
  • @claude comments → Still work on forks (different trigger context)

So you can still get Claude review on your PR by commenting @claude review this PR - that uses the comment-based workflow which runs in the base repo context.


📊 Related: Message Encoding Efficiency Issue

While reviewing your PR, I noticed you're working on message passing (test_large_message_transfer, etc.). There's a significant optimization opportunity in the current codebase:

Triple JSON Encoding Problem

The current implementation encodes messages 3 times:

  1. RichMessage → JSON (service.rs:664): 8KB → ~10KB
  2. EncryptedMessage → JSON (transport.rs:246): 10KB → ~20KB
  3. Protocol wrapper → JSON (network.rs:1645-1669): 20KB → ~29KB

Result: 8KB data becomes 29KB (3.6x bloat!)

The Culprit (network.rs:1645-1669)

fn create_protocol_message(&self, protocol: &str, data: Vec<u8>) -> Result<Vec<u8>> {
    let message = json!({
        "protocol": protocol,
        "data": data,  // ❌ Vec<u8> as JSON array: [72,101,108,...]
        "from": self.peer_id,
        "timestamp": timestamp
    });
    serde_json::to_vec(&message)  // ❌ Encoding already-JSON data
}

When JSON encodes Vec<u8>:

  • As array: [72, 101, 108, 108, 111] → ~4x overhead
  • As base64: "SGVsbG8=" → ~1.33x overhead

Quick Fixes

Option 1: Use bincode (already in Cargo.toml!)

bincode::serialize(&message)?  // Instead of serde_json::to_vec

Expected: 60-70% size reduction (8KB → 9KB instead of 29KB)

Option 2: Binary framing (eliminate JSON entirely)

fn create_protocol_message(&self, protocol: &str, data: Vec<u8>) -> Result<Vec<u8>> {
    let mut frame = Vec::new();
    frame.extend_from_slice(&(protocol.len() as u32).to_le_bytes());
    frame.extend_from_slice(protocol.as_bytes());
    frame.extend_from_slice(&data);
    Ok(frame)
}

Not Blocking This PR

Your connectivity fixes are solid and important. The encoding efficiency is a separate optimization that could be a follow-up PR. Just wanted to flag it since you're deep in the messaging code.

Files to consider for optimization:

  • src/network.rs:1645-1669 - Protocol wrapper (highest impact)
  • src/messaging/transport.rs:246 - EncryptedMessage serialization
  • src/messaging/service.rs:664 - RichMessage serialization

Great work on the peer connectivity fixes! 🚀

dirvine added a commit that referenced this pull request Jan 29, 2026
Addresses 3.6x bloat from triple JSON encoding (8KB → 29KB).

## Problem

Current message pipeline encodes data 3 times:
1. RichMessage → JSON (8KB → 10KB)
2. EncryptedMessage → JSON (10KB → 20KB)
3. Protocol wrapper → JSON (20KB → 29KB)

Result: 3.6x overhead on all messages.

## Solution Approach

4 milestones with 12 phases:
1. **Analysis & Benchmarking** - Quantify problem, design solution
2. **Core Implementation** - Bincode migration with version negotiation
3. **Testing & Validation** - Comprehensive test suite
4. **Documentation & Migration** - Enable smooth adoption

## Success Criteria

- 60%+ size reduction (8KB → 10KB target vs 29KB current)
- Backward compatibility maintained (V1/V2 protocol versions)
- All tests passing
- Zero panics/unwraps in production code

## Files

- .planning/PROJECT.md - Problem statement and goals
- .planning/MILESTONES.md - 4 milestone breakdown
- .planning/STATE.json - GSD execution state

## Next Steps

Use `/gsd-plan-phase` to detail Milestone 1, Phase 1 tasks.

Related: PR #6, issue about message size limits

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@dirvine dirvine merged commit 595725b into saorsa-labs:main Jan 29, 2026
16 of 18 checks passed
claude bot pushed a commit that referenced this pull request Feb 12, 2026
…ing)

Implements 11 hardening improvements from PR #21 review comments:

1. Request Leak/Cleanup:
   - Added RAII RequestCleanupGuard for automatic cleanup on drop/cancel
   - Atomic check+insert under single write lock to prevent races

2. Input Validation:
   - Added TransportError::ValidationError for input validation errors
   - validate_protocol_name() now uses ValidationError instead of StreamError

3. Timeout Handling:
   - Added MIN_REQUEST_TIMEOUT (100ms) to prevent Duration::ZERO immediate timeout
   - send_request() clamps timeout to [100ms, 5min] range

4. Response Routing:
   - Improved logging for failed pending.send() to clarify timeout scenario

5. Documentation:
   - Added documentation to ReplicationResult clarifying remote-only counts
   - Fixed brittle error assertion in tests to check error variant

6. Testing:
   - New tests/request_response_e2e_test.rs with 7 comprehensive tests:
     * Successful request/response routing
     * Timeout cleanup behavior
     * Invalid protocol rejection (empty, /, \, \0)
     * Protocol validation in send_response()
     * Minimum timeout enforcement
     * Trust reporting on failure

Items already correct (verified):
- #3: Protocol validation in send_response() already present
- #6: Response-origin mismatch uses get() before remove()
- #7: Unmatched /rr/ responses already suppressed
- #9: Trust reporting on send_message() failure already implemented
- #10: PeerStoreOutcome docs correct (no latency mention)

Closes #23

Co-authored-by: David Irvine <dirvine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants