Skip to content

fix(l1): fix broadcast_pool race and offload tx pool insertion to background task#6253

Merged
ilitteri merged 1 commit into
mainfrom
fix/offload-tx-pool-insertion-from-connection-server
Feb 24, 2026
Merged

fix(l1): fix broadcast_pool race and offload tx pool insertion to background task#6253
ilitteri merged 1 commit into
mainfrom
fix/offload-tx-pool-insertion-from-connection-server

Conversation

@edg-l
Copy link
Copy Markdown
Contributor

@edg-l edg-l commented Feb 24, 2026

Summary

  • Fix broadcast_pool race condition: clear_broadcasted_txs() wiped the entire broadcast_pool, including txs added between the read (get_txs_for_broadcast()) and the clear. With 2000 txs being inserted one-by-one and the broadcaster firing every 5ms, this reliably dropped txs that were never announced to peers. Replaced with targeted remove_broadcasted_txs() that only removes the txs that were actually fetched and broadcast.
  • Offload tx pool insertion to background task: Spawn the incoming Transactions message pool-insertion loop as a background tokio::spawn task instead of running it synchronously in handle_cast. This prevents the ConnectionServer from being blocked during signature recovery + storage reads for large transaction batches.

Motivation

The hive LargeTxRequest devp2p eth test intermittently fails in CI (example run). Root cause is a race condition in the transaction broadcaster:

  1. get_txs_for_broadcast() reads the broadcast_pool (releases lock)
  2. Meanwhile, incoming txs continue to be added to broadcast_pool
  3. clear_broadcasted_txs() clears all of broadcast_pool, including txs added in step 2

Any tx added between steps 1 and 3 is lost forever — never broadcast but cleared from the pool. With 2000 txs being inserted sequentially and the broadcaster firing every 5ms, this race reliably drops transactions, causing the test's 2-second timeout to expire before all tx hashes are announced.

Test plan

  • cargo check (with and without l2 feature)
  • Local hive devp2p eth tests pass 3/3 runs: ./hive --sim devp2p --sim.limit "eth" --client ethrex (20/20 each run)
  • CI hive daily run should show P2P Eth capability back to 20/20

@edg-l edg-l requested a review from a team as a code owner February 24, 2026 10:28
@github-actions github-actions Bot added L1 Ethereum client performance Block execution throughput and performance in general labels Feb 24, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Security & Correctness Issues

1. Race condition in transaction processing (lines 952-954, 973-977)
The tx_hashes vector is collected from txs.transactions before spawning the background task, but the background task consumes txs.transactions by value. This creates a potential race where the broadcaster is notified with transaction hashes that may not actually be processed if the background task hasn't started yet.

Fix: Collect hashes after filtering in the background task, or use Arc<[Transaction]> to share ownership.

2. Missing transaction deduplication check
The code immediately spawns a task for all received transactions without checking if they're already in the pool. This could lead to unnecessary CPU usage for signature recovery on duplicate transactions.

Performance & Resource Management

3. Unbounded task spawning (line 957)
tokio::spawn() is called for every Transactions message without any backpressure mechanism. A malicious peer could send many small Transactions messages to exhaust system resources.

Fix: Use a bounded channel or semaphore to limit concurrent transaction processing tasks.

4. Missing task cancellation on peer disconnect
The spawned task continues running even if the peer disconnects. This wastes resources processing transactions from disconnected peers.

Fix: Pass an AbortHandle to the task and cancel it in the connection cleanup.

Code Quality & Best Practices

5. Inefficient cloning (line 955)
state.blockchain.clone() clones the entire blockchain handle for each message. Consider using Arc if not already the case, or pass a lightweight handle.

6. Debug logging inconsistency
The debug logs in lines 965 and 975 use peer=%peer but the original code used peer=%state.node. While functionally equivalent, maintaining consistency with the original format would be clearer.

7. Missing error handling for spawn failure
tokio::spawn() can fail if the runtime is shutting down. This should be handled gracefully.

Ethereum-Specific Concerns

8. Transaction validation timing
Moving transaction validation to a background task is generally good, but ensure that invalid transactions (wrong chain ID, insufficient gas, etc.) are still properly filtered before network propagation. The current code only checks L2-specific rules.

The change appears to be a performance optimization rather than a consensus change, so this is acceptable.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

Findings

  1. Potential correctness issue: tx_hashes is built from all incoming txs, but the L2/privileged filter is applied later inside the spawned task. This means InMessage::AddTxs includes hashes for txs that are explicitly rejected in L2 mode or privileged txs. If the broadcaster uses these hashes to suppress re-broadcast to the sender, you may be suppressing even though the tx never entered the pool. Consider filtering hashes to only txs that pass the L2/privileged checks (and ideally only those accepted into the pool).
    File: crates/networking/p2p/rlpx/connection/server.rs:945

  2. Unbounded task spawning per incoming Transactions message can become a DoS vector under load (many peers, large tx batches). You’ve moved CPU-heavy work off the connection task, but there’s no backpressure or limit. Consider using a bounded task queue or a semaphore to cap concurrent pool insertions.
    File: crates/networking/p2p/rlpx/connection/server.rs:952

Notes

  • The async move captures blockchain and peer cleanly; no memory safety issues observed in this diff.
  • Error handling behavior for pool insertion remains the same (log and continue).

If you want, I can check the broadcaster logic to verify whether the hash suppression path makes Item 1 a real bug or just a mild inefficiency.

Next steps

  1. I can inspect broadcaster behavior to confirm the impact of Item 1.

Automated review by OpenAI Codex · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 24, 2026

Greptile Summary

Fixed critical race condition in transaction broadcaster that was causing transactions to be lost. The race occurred when clear_broadcasted_txs() would wipe the entire broadcast_pool including transactions added between the read and clear operations. This has been resolved by replacing the clear operation with targeted remove_broadcasted_txs() that only removes transactions that were actually fetched and broadcast.

Additionally, offloaded expensive transaction pool insertion (signature recovery and validation) to a background task to prevent blocking the ConnectionServer during large transaction batches.

  • Replaced clear_broadcasted_txs() with remove_broadcasted_txs(&hashes) for targeted removal
  • Changed broadcaster to collect and pass specific hashes of broadcast transactions
  • Moved transaction pool insertion to tokio::spawn background task in message handler
  • Extracted transaction hashes upfront before spawning background task
  • Cloned necessary context (blockchain, peer) for background task

Confidence Score: 4/5

  • Safe to merge - fixes critical race condition with well-tested changes
  • The race condition fix is sound and the background task offload is a standard pattern. Minor concern about unhandled spawn panic (background task silently fails if pool is shut down), but this is acceptable in practice since task failures are logged internally and don't affect correctness
  • No files require special attention - all changes are focused and address specific issues

Important Files Changed

Filename Overview
crates/blockchain/mempool.rs Renamed clear_broadcasted_txs() to remove_broadcasted_txs() with targeted removal by hash list to fix race condition
crates/networking/p2p/tx_broadcaster.rs Updated to use targeted remove_broadcasted_txs() instead of clearing entire pool, and removed a trace log statement
crates/networking/p2p/rlpx/connection/server.rs Offloaded expensive tx pool insertion to background task with tokio::spawn to avoid blocking ConnectionServer during validation

Sequence Diagram

sequenceDiagram
    participant B as Broadcaster
    participant M as Mempool
    participant C as ConnectionServer
    
    Note over B,M: Old behavior (race condition)
    B->>M: get_txs_for_broadcast()
    M-->>B: [tx1, tx2]
    Note over C,M: Meanwhile...
    C->>M: add_transaction(tx3)
    M->>M: broadcast_pool.insert(tx3)
    B->>M: clear_broadcasted_txs()
    M->>M: broadcast_pool.clear() ❌
    Note over M: tx3 lost! Never broadcast
    
    Note over B,M: New behavior (fixed)
    B->>M: get_txs_for_broadcast()
    M-->>B: [tx1, tx2]
    Note over C,M: Meanwhile...
    C->>M: add_transaction(tx3)
    M->>M: broadcast_pool.insert(tx3)
    B->>M: remove_broadcasted_txs([tx1, tx2])
    M->>M: remove only tx1, tx2 ✓
    Note over M: tx3 preserved for next broadcast
Loading

Last reviewed commit: 1ed81cb

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 24, 2026

Lines of code report

Total lines added: 6
Total lines removed: 0
Total lines changed: 6

Detailed view
+--------------------------------------------------------+-------+------+
| File                                                   | Lines | Diff |
+--------------------------------------------------------+-------+------+
| ethrex/crates/blockchain/mempool.rs                    | 388   | +3   |
+--------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/rlpx/connection/server.rs | 1070  | +1   |
+--------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/tx_broadcaster.rs         | 311   | +2   |
+--------------------------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Review: perf(l1): offload incoming tx pool insertion to background task

Summary

The change is well-motivated: spawning a tokio::task for expensive per-transaction work (ECDSA recovery, storage reads) is the correct approach to unblock the ConnectionServer. The overall architecture is sound. Below are my observations, roughly in order of importance.


Semantic correctness of the broadcaster notification ordering

server.rs lines 948 and 980–984

The key design decision is notifying the broadcaster with all received hashes before the background task actually inserts transactions into the pool. This is correct because AddTxs marks hashes as "known from this peer" in known_txs (to avoid re-broadcasting to the sender), while the actual broadcast in broadcast_txs() queries mempool.get_txs_for_broadcast() directly — it never reads from known_txs to determine what to broadcast.

However, it's worth noting two subtleties:

  • In L2 mode, blob and privileged tx hashes are added to known_txs (via the pre-computed tx_hashes) but will be rejected and never added to the pool. These dead entries live in known_txs for up to 10 minutes (the prune window). This was identical behavior in the old code (the broadcaster notification was also after-loop-with-all-hashes), so it is not a regression, but it is worth being aware of.

  • On validation failure, same story — invalid hashes accumulate in known_txs until the prune cycle. Again pre-existing, not a regression.


Fire-and-forget JoinHandle

server.rs line 956

tokio::spawn(async move { ... });

The returned JoinHandle is immediately dropped. If the task panics, tokio logs it but the error does not propagate anywhere. For insertion failures this is acceptable (they're already debug! logged), but a panic (e.g., from a poison lock in the mempool's RwLock) would be silently swallowed from the caller's perspective. Consider whether a tokio::task::JoinSet or an abort handle should be tracked at the connection level so panics can surface during teardown. This is a minor concern for a non-critical path, but a panic in add_transaction_to_pool during a RwLock poison scenario would go unnoticed.


Unbound concurrent task spawning under high peer load

Every incoming Transactions message from every peer spawns a new task. Each task can itself hold concurrent mempool write locks. The mempool uses a std::sync::RwLock (mempool.rs line 80), so concurrent tasks serialize on write operations — but the validate_transaction path does async storage reads before acquiring the write lock, meaning many tasks can concurrently be in the validation phase. This is unlikely to be a problem at current peer counts, but it would be worth documenting the implicit assumption that the number of simultaneous Transactions messages is bounded by peer count.


TOCTOU in add_transaction_to_pool (pre-existing, not introduced here)

blockchain.rs lines 2421–2432

if self.mempool.contains_tx(hash)? {
    return Ok(hash);
}
// ... validate (async, no lock) ...
self.mempool.add_transaction(hash, sender, ...)?;

The check at line 2421 and the insert at line 2432 are not atomic. Two background tasks (from two different peers) processing the same transaction can both pass contains_tx and both call add_transaction, inserting the hash into txs_order twice. This pre-existed the PR (multi-peer concurrency already existed), but the background-task pattern makes it easier to trigger from a single peer. The impact is limited — duplicate entries in txs_order cause the prune loop to remove an entry that no longer exists in transaction_pool, which is harmless but slightly wasteful. Worth a follow-up issue if not already tracked.


Minor: #[cfg(feature = "l2")] placement inside spawn

server.rs lines 954–955 and 958–964

The is_l2_mode binding is declared outside the spawn:

#[cfg(feature = "l2")]
let is_l2_mode = state.l2_state.is_supported();

and captured by value inside the async move. This is correct and idiomatic — bool is Copy. No issue here.


Positive observations

  • The comment at line 977–979 accurately describes the broadcaster's role and why pre-notification is safe. This is helpful for future readers.
  • Cloning Arc<Blockchain> (line 952) is cheap and correct for moving into the task.
  • Since is_synced is documented as monotonically transitioning false→true and never resetting (comment at blockchain.rs line 182), the pre-spawn check at line 947 cannot become stale in a way that causes over-insertion.
  • The drop of the superfluous continue after the add_transaction_to_pool error (old code line ~979) is a clean improvement — it was dead code since there was nothing after it in the loop body.

Verdict

The approach is correct and the fix is appropriate for the CI flake it targets. The two main points worth addressing are the dropped JoinHandle (a minor observability concern) and the pre-existing TOCTOU in add_transaction_to_pool (which this PR makes easier to trigger from a single peer). Neither is a blocker.


Automated review by Claude (Anthropic) · custom prompt

@edg-l
Copy link
Copy Markdown
Contributor Author

edg-l commented Feb 24, 2026

Addressing bot reviews

Kimi findings

  1. Race condition (hashes collected before task runs): By design. AddTxs only marks hashes as "known from this peer" in a known_txs map to avoid re-broadcasting back to the sender. It does NOT trigger any broadcast or require pool state. Actual broadcasting happens on a periodic timer that queries mempool directly via get_txs_for_broadcast().

  2. Missing deduplication check before spawning: add_transaction_to_pool already does mempool.contains_tx(hash) as its very first check (blockchain.rs:2421). Adding another check before spawning would be redundant.

  3. Unbounded task spawning (DoS): One task per Transactions message, not per transaction. The original code was equally vulnerable to flooding — it just blocked the ConnectionServer instead of spawning. Bounded concurrency could be a follow-up but is out of scope here.

  4. Missing task cancellation on disconnect: The spawned tasks just do pool inserts and exit. No long-running loops or waits. Adding cancellation logic would add complexity for negligible gain on a non-critical path.

  5. Inefficient blockchain.clone(): Blockchain is Arc-wrapped. Clone is a ref count bump, not a deep copy.

  6. Debug logging inconsistency: We can't capture state.node (a reference) into the spawned async move block. peer is state.node.to_string() — functionally identical output.

  7. Missing spawn failure handling: tokio::spawn only fails if the runtime is shutting down, at which point everything is tearing down anyway.

  8. Transaction validation timing: Validation still happens in the spawned task before pool insertion. The ordering is unchanged — validate then insert.

Codex findings

  1. L2-rejected hashes sent to broadcaster: Not a regression. The original code did the exact same thing — txs.transactions.iter().map(|tx| tx.hash()).collect() was called on ALL transactions without L2 filtering. The AddTxs handler only uses these hashes to suppress re-broadcast to the sender, so stale entries in known_txs are harmless (pruned after 10 minutes).

  2. Unbounded task spawning: Same as Kimi build: add Github actions boilerplate #3 above.

Claude review

Good observations — agree with all of them. The TOCTOU in add_transaction_to_pool and dead known_txs entries for rejected hashes are both pre-existing and not regressions. The JoinHandle drop is acceptable since insertion failures are already debug!-logged inside the task.


Update: broadcast_pool race condition fix

Investigation revealed the true root cause of the flaky LargeTxRequest test was a race condition in the tx broadcaster, not the ConnectionServer blocking:

  • get_txs_for_broadcast() reads the broadcast_pool (releases lock)
  • Meanwhile, incoming txs continue to be added to broadcast_pool one-by-one
  • clear_broadcasted_txs() clears all of broadcast_pool, including txs added since the read

Any tx added between the read and the clear is lost forever — never broadcast but removed from the pool. With 2000 txs being inserted sequentially and the broadcaster firing every 5ms, this reliably drops transactions.

Fix: Replaced clear_broadcasted_txs() with remove_broadcasted_txs(hashes) that only removes the specific txs that were actually fetched and sent. After the fix, hive devp2p eth tests pass 3/3 full runs (20/20 each).

@edg-l edg-l marked this pull request as draft February 24, 2026 10:37
@ethrex-project-sync ethrex-project-sync Bot moved this to In Progress in ethrex_l1 Feb 24, 2026
…kground task

Two fixes for the flaky hive `LargeTxRequest` devp2p eth test:

1. Fix broadcast_pool race condition: `clear_broadcasted_txs()` wiped
   the entire broadcast_pool, including txs added between the read
   (`get_txs_for_broadcast`) and the clear. With 2000 txs inserted
   one-by-one and the broadcaster firing every 5ms, this reliably
   dropped txs that were never broadcast. Replace with targeted
   `remove_broadcasted_txs()` that only removes the txs that were
   actually fetched and sent.

2. Offload incoming tx pool insertion to a background `tokio::spawn`
   task so the ConnectionServer isn't blocked during signature
   recovery and storage reads for large transaction batches.
@edg-l edg-l force-pushed the fix/offload-tx-pool-insertion-from-connection-server branch from 3a6895d to 1ed81cb Compare February 24, 2026 11:44
@edg-l edg-l changed the title perf(l1): offload incoming tx pool insertion to background task fix(l1): fix broadcast_pool race and offload tx pool insertion to background task Feb 24, 2026
@github-actions github-actions Bot removed the performance Block execution throughput and performance in general label Feb 24, 2026
@edg-l edg-l marked this pull request as ready for review February 24, 2026 11:45
@ethrex-project-sync ethrex-project-sync Bot moved this from In Progress to In Review in ethrex_l1 Feb 24, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Review Summary

Overall, this PR improves transaction handling by moving heavy validation work off the critical path, but introduces a few correctness and safety issues.

Critical Issues

  1. Race condition in transaction broadcasting (crates/networking/p2p/rlpx/connection/server.rs:945-985)

    • The PR spawns a background task to add transactions to the pool but immediately notifies the broadcaster with transaction hashes. If the background task fails to add a transaction (e.g., validation fails), the broadcaster will still attempt to broadcast it, leading to inconsistent state.
    • Fix: Only notify the broadcaster after successful pool insertion, or use a channel to communicate success/failure back to the broadcaster.
  2. Missing transaction deduplication (crates/networking/p2p/tx_broadcaster.rs:243-247)

    • remove_broadcasted_txs removes transactions from the broadcast pool, but there's no guarantee these transactions are still in the mempool. If a transaction was rejected during validation, it will be removed from broadcast pool but never existed in mempool, potentially causing issues.

Performance & Safety Issues

  1. Unbounded spawn without error handling (crates/networking/p2p/rlpx/connection/server.rs:955)

    • tokio::spawn is used without any limit or error handling. Under high load, this could create thousands of tasks.
    • Fix: Use a bounded channel or semaphore to limit concurrent validation tasks.
  2. Missing L2 configuration in spawn (crates/networking/p2p/rlpx/connection/server.rs:954-956)

    • The is_l2_mode variable is defined outside the spawn but used inside, creating a potential race if the L2 configuration changes between definition and usage.
    • Fix: Move the check inside the spawn or clone the value.

Minor Issues

  1. Inefficient hash collection (crates/networking/p2p/tx_broadcaster.rs:246-247)

    • Creating a new Vec<H256> just to pass to remove_broadcasted_txs is unnecessary. Consider passing an iterator or slice directly.
  2. Debug logging inconsistency (crates/networking/p2p/rlpx/connection/server.rs:962)

    • The debug log uses peer=%peer but peer is a String from to_string(), which might not be as informative as the original Node struct.

Suggested Fixes

  1. For the race condition, consider:
// In handle_incoming_message
let (tx, rx) = tokio::sync::oneshot::channel();
tokio::spawn(async move {
    let mut successful_hashes = Vec::new();
    for tx in txs.transactions {
        if blockchain.add_transaction_to_pool(tx.clone()).await.is_ok() {
            successful_hashes.push(tx.hash());
        }
    }
    let _ = tx.send(successful_hashes);
});

if let Ok(successful_hashes) = rx.await {
    state.tx_broadcaster.cast(InMessage::AddTxs(successful_hashes, state.node.node_id())).await?;
}
  1. For the L2 configuration issue:
let is_l2_mode = state.l2_state.is_supported(); // Clone this value
let blockchain = state.blockchain.clone();
let peer = state.node.to_string();
tokio::spawn(async move {
    let is_l2_mode = is_l2_mode; // Use the cloned value
    // ... rest of the code
});

Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

Review Summary

Overall direction looks good (decoupling expensive tx validation from the connection loop and only removing broadcasted hashes). I do see a couple of correctness/robustness risks worth addressing.

Findings

  1. Potential DoS via unbounded task spawning
    In crates/networking/p2p/rlpx/connection/server.rs within handle_incoming_message (Transactions branch), each incoming Transactions message spawns a tokio::spawn task that validates and inserts all txs. A peer can flood Transactions to create unbounded concurrent tasks, increasing CPU and memory pressure.
    Suggestion: gate with a bounded semaphore / task queue, or reuse a dedicated worker channel for tx validation. That keeps backpressure and avoids unbounded task creation.

  2. Race between broadcaster notification and mempool insertion
    In the same block, tx_broadcaster.cast(InMessage::AddTxs(...)) happens immediately, while mempool insertion is async in the background. If the broadcaster queries the mempool before insertion completes, those txs may not be broadcasted in that round and only later (or potentially re-broadcasted to the same peer if dedupe is tied to mempool state rather than the AddTxs tracker).
    Suggestion: confirm TxBroadcaster’s per-peer dedupe uses the AddTxs hashes independently of mempool presence. If not, consider making broadcaster notification happen after insertion or adjust dedupe to avoid accidental re-broadcasts.

Notes

  • crates/blockchain/mempool.rs change from clear_broadcasted_txs to remove_broadcasted_txs looks fine, but it turns O(1) clear into O(n) removes. Probably fine given typical sizes, but worth noting if broadcast sets are large.

If you want, I can scan the broadcaster and mempool internals to verify the dedupe semantics and whether the race in Item 2 is a real behavior issue.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have a thorough picture of the changes. Let me write the review.


Review: fix(l1): fix broadcast_pool race and offload tx pool insertion to background task

Summary

Both fixes are well-motivated and correctly implemented. The race condition in broadcast_pool is real and the surgical fix is the right approach. The background-task offload for pool insertion is sound given the architecture of TxBroadcaster. Overall the PR is in good shape.


Fix 1 — remove_broadcasted_txs (mempool.rs)

Correct. The original clear_broadcasted_txs had a classic TOCTOU window:

get_txs_for_broadcast  →  [lock released]  →  new txs added to broadcast_pool  →  clear_broadcasted_txs

The replacement iterates only over the hashes that were actually fetched and sent. Because broadcast_txs() runs inside a single-actor GenServer (one message handled at a time), there is no concurrent broadcast_txs() caller to race with between the fetch and the remove, so the targeted removal is safe and complete.

Minor nit: the function could accept &[H256] with a single write() acquisition, which is what it already does — nothing to change here.


Fix 2 — Background task for pool insertion (server.rs)

Broadcaster notification ordering (server.rs lines 948, 980–984)

The key design choice — notifying tx_broadcaster with tx_hashes before the background task has inserted anything into the pool — is correct. AddTxs marks hashes as "known from this peer" in known_txs solely to suppress re-broadcasting back to the sender. Actual outbound broadcasts are driven by get_txs_for_broadcast(), which only returns hashes present in both transaction_pool and broadcast_pool. So if the background task hasn't finished inserting when the 1-second timer fires, those transactions simply skip that broadcast cycle and are picked up in the next one. No transactions are lost.

L2-filtered hashes in known_txs (server.rs lines 958–964)

tx_hashes is computed before the L2 filter runs inside the spawned task, so blob/privileged tx hashes are passed to AddTxs even though those transactions will be rejected and never enter the pool. These dead entries accumulate in known_txs until the 10-minute prune cycle. This is not a regression — the pre-existing code passed all hashes to AddTxs in the same way. The practical impact is negligible (stale entries in a HashMap that gets pruned).

Dropped JoinHandle (server.rs line 956)

tokio::spawn(async move { ... });

The returned JoinHandle is silently dropped. If add_transaction_to_pool panics — most plausibly from a poisoned std::sync::RwLock inside the mempool — tokio will log the panic but the connection handler will never observe it. For a non-critical insertion path with debug!-level error handling, this is acceptable. Consider tracking it with a JoinSet at the connection level if deeper observability becomes desirable, but this is not a blocker.

Unbounded spawning under adversarial peer load (server.rs line 956)

One task is spawned per Transactions message received from any peer. Each task can concurrently be in validate_transaction (doing async storage reads) before serialising on the mempool RwLock for the actual insert. In practice this is bounded by peer count, which is small. The previous behaviour was equally unbounded — it just blocked the ConnectionServer instead of spawning. A semaphore-based concurrency cap could be a follow-up if needed, but it is out of scope for this fix.

Pre-existing TOCTOU in add_transaction_to_pool (blockchain.rs lines 2421, 2432)

if self.mempool.contains_tx(hash)? {   // check (lock released)
    return Ok(hash);
}
// ... async validate (no lock) ...
self.mempool.add_transaction(...)?;    // write (new lock)

Two background tasks from the same or different peers processing the same hash can both pass contains_tx and both call add_transaction. Because HashMap::insert and HashSet::insert are both idempotent on duplicate keys, the actual impact is a harmless redundant write followed by two broadcast_pool inserts (also idempotent). This pre-existed the PR but the background-task pattern makes it easier to trigger from a single high-volume peer. Worth a follow-up issue.


Minor

  • Removing the trace!("No transactions to broadcast") log (tx_broadcaster.rs line 187) reduces noise during idle periods, which is a clean-up.
  • Removing the dead continue after the add_transaction_to_pool error branch is correct — there was no code after the if let Err block, so it was unreachable.
  • state.blockchain.clone() is a reference-count bump on an Arc<Blockchain>, not a deep copy. Correct and cheap.
  • is_l2_mode is bool (Copy), so its capture by the async move is correct without explicit cloning.

Verdict

Both fixes are correct and appropriate for the CI flake they target. The main items noted above (dropped JoinHandle, unbounded spawning, pre-existing TOCTOU) are either pre-existing or minor observability concerns, none of which are blockers. Approve.


Automated review by Claude (Anthropic) · custom prompt

@github-actions
Copy link
Copy Markdown

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 64.823 ± 0.211 64.604 65.194 1.00 ± 0.00
head 64.749 ± 0.237 64.504 65.271 1.00

Copy link
Copy Markdown
Contributor

@ElFantasma ElFantasma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment on the JoinHandle. It LGTM

debug!(peer=%state.node, "Rejecting transaction in L2 mode - {tx_type} transactions are not broadcasted in L2");
continue;
}
tokio::spawn(async move {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JoinHandle is dropped here, making this fire-and-forget. If add_transaction_to_pool panics (e.g., poisoned RwLock), the panic is caught by the tokio runtime but won't surface clearly. More importantly, on node shutdown, in-flight insertion tasks may be silently dropped.

Not necessarily a blocker since the task is bounded (processes a fixed set of txs and the errors are already debug-logged), but worth considering whether to at least log on task failure, e.g.:

tokio::spawn(async move {
    // ... existing code ...
}.instrument(tracing::debug_span!("tx_pool_insert")));

or store the handle in a JoinSet for clean shutdown.

@ilitteri ilitteri added this pull request to the merge queue Feb 24, 2026
Merged via the queue into main with commit 8103e18 Feb 24, 2026
66 of 68 checks passed
@ilitteri ilitteri deleted the fix/offload-tx-pool-insertion-from-connection-server branch February 24, 2026 16:18
@github-project-automation github-project-automation Bot moved this from In Review to Done in ethrex_l1 Feb 24, 2026
lakshya-sky pushed a commit to lakshya-sky/ethrex that referenced this pull request Mar 10, 2026
…kground task (lambdaclass#6253)

## Summary

- **Fix broadcast_pool race condition**: `clear_broadcasted_txs()` wiped
the entire `broadcast_pool`, including txs added between the read
(`get_txs_for_broadcast()`) and the clear. With 2000 txs being inserted
one-by-one and the broadcaster firing every 5ms, this reliably dropped
txs that were never announced to peers. Replaced with targeted
`remove_broadcasted_txs()` that only removes the txs that were actually
fetched and broadcast.
- **Offload tx pool insertion to background task**: Spawn the incoming
`Transactions` message pool-insertion loop as a background
`tokio::spawn` task instead of running it synchronously in
`handle_cast`. This prevents the ConnectionServer from being blocked
during signature recovery + storage reads for large transaction batches.

## Motivation

The hive `LargeTxRequest` devp2p eth test intermittently fails in CI
([example
run](https://github.com/lambdaclass/ethrex/actions/runs/22335807040)).
Root cause is a race condition in the transaction broadcaster:

1. `get_txs_for_broadcast()` reads the `broadcast_pool` (releases lock)
2. Meanwhile, incoming txs continue to be added to `broadcast_pool`
3. `clear_broadcasted_txs()` clears **all** of `broadcast_pool`,
including txs added in step 2

Any tx added between steps 1 and 3 is lost forever — never broadcast but
cleared from the pool. With 2000 txs being inserted sequentially and the
broadcaster firing every 5ms, this race reliably drops transactions,
causing the test's 2-second timeout to expire before all tx hashes are
announced.

## Test plan

- [x] `cargo check` (with and without `l2` feature)
- [x] Local hive devp2p eth tests pass 3/3 runs: `./hive --sim devp2p
--sim.limit "eth" --client ethrex` (20/20 each run)
- [ ] CI hive daily run should show P2P Eth capability back to 20/20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants