Skip to content

chore: standalone forge broadcast wrapper with retry, timeout, and anvil detection#19824

Merged
ludamad merged 12 commits intomerge-train/spartanfrom
ad/chore/forge-upgrade-slow-batching
Feb 11, 2026
Merged

chore: standalone forge broadcast wrapper with retry, timeout, and anvil detection#19824
ludamad merged 12 commits intomerge-train/spartanfrom
ad/chore/forge-upgrade-slow-batching

Conversation

@ludamad
Copy link
Collaborator

@ludamad ludamad commented Jan 21, 2026

(I confirm I have edited and reviewed this AI summary)

Summary

Replaces ad-hoc --batch-size / --timeout flags across shell scripts and TypeScript with a single standalone wrapper script (l1-contracts/scripts/forge_broadcast.ts) that handles all forge broadcast reliability concerns in one place:

  • Anvil workaround — We control the timeout ourselves, as foundry is unreliable around broadcast timeouts. With anvil, retries clear broadcast artifacts and start from scratch (anvil's automine can strand batched transactions in the mempool). On real chains, uses --resume.
  • Separated verification — strips --verify from broadcast attempts so the timeout only covers transaction landing. After broadcast succeeds, runs forge script --resume --verify --broadcast with no timeout. Verification failure is logged but doesn't affect the exit code (transactions already landed). This prevents two problems: (1) timeout killing forge mid-verification even though all transactions mined, and (2) Etherscan failures triggering broadcast retries.

Files changed

File Change
l1-contracts/scripts/forge_broadcast.ts New standalone wrapper script
l1-contracts/package.json Add "type": "module" for ESM
l1-contracts/scripts/run_rollup_upgrade.sh Use forge_broadcast.ts
l1-contracts/scripts/test_rollup_upgrade.sh Use forge_broadcast.ts
yarn-project/ethereum/src/deploy_aztec_l1_contracts.ts Spawn forge_broadcast.ts instead of forge directly; remove inline retry/timeout logic
yarn-project/l1-artifacts/scripts/copy-foundry-artifacts.sh Compile forge_broadcast.ts to JS with swc (Node refuses .ts from node_modules)

Stress test results

50,000 deploy+upgrade cycles against local anvil instances (20 parallel workers, half instant-mining, half --block-time 1):

Metric Value
Total runs 46,484
Pass 46,484 (100%)
Fail 0
Retries needed 13 (0.03%)
Retry recovery 13/13 (100%)

All 13 retries occurred on instant-mining anvil workers (the automine race condition). Zero retries on --block-time workers. All recovered successfully. Previous attempts could not recover on those rare hangs.

Test plan

  • Existing CI tests cover L1 deployment paths (deploy + upgrade)
  • l1-contracts/scripts/test_rollup_upgrade.sh exercises the full deploy+upgrade flow end-to-end
  • Stress tested with 50,000 runs across 20 parallel anvil instances

@ludamad ludamad force-pushed the ad/chore/forge-upgrade-slow-batching branch from 739397d to 7c00e82 Compare January 21, 2026 23:33
@ludamad ludamad marked this pull request as ready for review January 21, 2026 23:34
@ludamad ludamad requested a review from charlielye as a code owner January 21, 2026 23:34
@ludamad ludamad force-pushed the ad/chore/forge-upgrade-slow-batching branch 6 times, most recently from 7b36542 to dde561c Compare January 23, 2026 21:04
@ludamad ludamad force-pushed the ad/chore/forge-upgrade-slow-batching branch 4 times, most recently from f52c815 to cbd553d Compare January 27, 2026 19:54
@ludamad ludamad force-pushed the ad/chore/forge-upgrade-slow-batching branch from dc4eb3b to 0e33d76 Compare February 4, 2026 14:33
@ludamad ludamad changed the title chore: upgrade foundry to 1.5.1 and use --slow flag in deployments chore: fix forge broadcast hang with --batch-size and --timeout Feb 4, 2026
@ludamad ludamad changed the base branch from next to merge-train/spartan February 4, 2026 14:33
@ludamad ludamad force-pushed the ad/chore/forge-upgrade-slow-batching branch from 0e33d76 to edca85c Compare February 4, 2026 17:31
@ludamad ludamad force-pushed the ad/chore/forge-upgrade-slow-batching branch 3 times, most recently from 0eaa392 to 80ba385 Compare February 5, 2026 18:02
…port

Replace inline retry logic and direct forge calls with a standalone
TypeScript wrapper script (l1-contracts/scripts/forge_broadcast.ts).

Key behaviors:
- --batch-size 8 to prevent forge broadcast hangs
- External timeout (forge's --timeout is unreliable for broadcast hangs)
- On anvil: detect via web3_clientVersion, retry from scratch on failure
  (anvil's auto-miner race condition strands txs in mempool)
- On real chains: retry with --resume to pick up unmined transactions
- Buffers stdout per attempt, only emits the successful attempt's output
- Waits for stdout drain before exit to avoid pipe truncation

References:
- foundry-rs/foundry#6796 (batch size hang)
- foundry-rs/foundry#8919 (anvil auto-miner race)
… exit

emitAndExit uses process.stdout.write with a callback to call process.exit,
which is async. Without awaiting, execution falls through to subsequent log
lines and retry loops before process.exit fires. Change return type to
Promise<never> and await all call sites.
Move chain-specific timeout logic into forge_broadcast.ts itself —
queries eth_chainId at startup and selects 300s for mainnet/sepolia,
50s for everything else. FORGE_BROADCAST_TIMEOUT env var still works
as an override. Remove getForgeBroadcastTimeout from callers.

Also fixes issues from code review:
- rpcCall now rejects on JSON-RPC error responses (was silently
  resolving undefined, which could cause false positives in
  verifyBroadcastOnChain)
- rpcCall has a 10s timeout to prevent wrapper's own RPC calls from
  hanging indefinitely
- Remove incorrect foundry issue #8919 link (was a code refactoring
  PR, not about anvil)
- Remove unnecessary try/catch around rmSync with force:true
- Replace magic exit code 124 with named EXIT_TIMEOUT constant
@ludamad ludamad force-pushed the ad/chore/forge-upgrade-slow-batching branch from 80ba385 to 102ee4f Compare February 9, 2026 12:28
@ludamad ludamad changed the title chore: fix forge broadcast hang with --batch-size and --timeout chore: standalone forge broadcast wrapper with retry, timeout, and anvil detection Feb 9, 2026
@AztecBot
Copy link
Collaborator

AztecBot commented Feb 9, 2026

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/a3c2164037e5135d�a3c2164037e5135d8;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_mbps.parallel.test.ts "builds multiple blocks per slot with transactions anchored to proposed blocks" (220s) (code: 1) group:e2e-p2p-epoch-flakes

The on-chain nonce verification logic was never triggered as a save
in 46,484 stress test runs. On anvil (where all retries occurred),
it's useless because broadcast artifacts are cleared before retry.
The nonce heuristic is also unreliable — a higher nonce doesn't prove
those specific transactions were mined. Remove to simplify.
The script has a shebang that handles --experimental-strip-types, so
spawn it directly instead of via node. It also doesn't need to be
copied to the temp dir — it just spawns forge which inherits cwd.
Strip --verify from broadcast attempts and run it as a separate step
after transactions land. Forge runs verification AFTER all receipts
are collected (crates/script/src/lib.rs:333-338) and exits non-zero
if any verification fails (crates/script/src/verify.rs), even when
all transactions succeeded. This caused two problems:

1. The broadcast timeout could kill forge mid-verification, wasting
   the attempt even though all transactions were already mined.
2. A verification failure (e.g. Etherscan rate limit) triggered
   broadcast retries, resubmitting already-mined transactions.

After a successful broadcast, we now run `forge script --resume
--verify --broadcast` without a timeout. The --resume path re-compiles
and re-links using libraries from the broadcast artifacts
(crates/script/src/build.rs CompiledState::resume) — it doesn't need
simulation data. The broadcast step is a no-op since all receipts
already exist. Verification failure is logged but doesn't affect the
exit code, since the critical work (transaction landing) already
succeeded.
@ludamad ludamad requested a review from charlielye February 9, 2026 14:25
#!/usr/bin/env -S node --experimental-strip-types

// forge_broadcast.ts - Reliable forge script broadcast with retry and timeout.
//
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that I have reviewed this carefully and have high confidence

Node v22 treats .ts files as CJS by default. The forge_broadcast.ts
script uses ESM imports, which requires "type": "module" in the
nearest package.json. Without copying l1-contracts/package.json, Node
finds l1-artifacts/package.json instead (no "type": "module") and
fails to parse the imports.
Node.js refuses to load .ts files from node_modules even with
--experimental-strip-types. Compile with swc during copy-foundry-artifacts
and invoke via process.execPath from deploy code.
Move swc compilation from copy-foundry-artifacts.sh into l1-contracts
bootstrap so the .js is built at the source. Add @swc/cli and @swc/core
as devDependencies. copy-foundry-artifacts.sh now just copies the
pre-built .js file.
@socket-security
Copy link

socket-security bot commented Feb 9, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatednpm/​ox@​0.8.3 ⏵ 0.8.99910099 +196 -1100

View full report

ludamad and others added 2 commits February 10, 2026 11:52
Node.js refuses to load .ts files from node_modules, so make the script
plain JavaScript instead of compiling with swc. Removes @swc/cli and
@swc/core devDependencies from l1-contracts.
@ludamad ludamad enabled auto-merge (squash) February 10, 2026 13:54
Copy link
Contributor

@alexghr alexghr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

return new Promise((resolve, reject) => {
const url = new URL(rpcUrl);
const body = JSON.stringify({ jsonrpc: "2.0", id: 1, method, params });
const reqFn = url.protocol === "https:" ? httpsRequest : httpRequest;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch is available in node. I think it would simplify the code greatly (built-in stream handling, timeouts, promise support etc).

Comment on lines +51 to +54
const MAX_RETRIES = parseInt(
process.env.FORGE_BROADCAST_MAX_RETRIES ?? "3",
10,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can still return NaN. Might be worth just adding a quick check

Suggested change
const MAX_RETRIES = parseInt(
process.env.FORGE_BROADCAST_MAX_RETRIES ?? "3",
10,
);
const MAX_RETRIES = parseInt(
process.env.FORGE_BROADCAST_MAX_RETRIES ?? "3",
10,
);
if (!Number.isSafeInteger(MAX_RETRIES)) {
process.stderr.write(`MAX_RETRIES is not a valid integer.\n`);
process.exit(1);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not too bad because the way it's being used (attempt <= MAX_RETRIES) is safe, ie. it will never run an attempt. I was worried this could cause it to retry indefinitely

Comment on lines +172 to +176
const timer = setTimeout(() => {
timedOut = true;
proc.kill("SIGTERM");
killTimer = setTimeout(() => proc.kill("SIGKILL"), KILL_GRACE);
}, timeoutSecs * 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW spawn takes signal: AbortSignal, timeout: number, killSignal: number to automatically kill the process. It might be enough to just SIGKILL the process if it hangs.

// - Forge computes new nonces from on-chain state
// - New transactions replace any stuck ones with the same nonce
// - The race condition is intermittent (~0.04%), so retries almost always succeed
rmSync("broadcast", { recursive: true, force: true });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just preempting flakes here

Suggested change
rmSync("broadcast", { recursive: true, force: true });
rmSync("broadcast", { recursive: true, force: true, maxRetries: 3, retryDelay: 100 });

@ludamad ludamad merged commit 005cf8b into merge-train/spartan Feb 11, 2026
11 checks passed
@ludamad ludamad deleted the ad/chore/forge-upgrade-slow-batching branch February 11, 2026 10:19
github-merge-queue bot pushed a commit that referenced this pull request Feb 12, 2026
BEGIN_COMMIT_OVERRIDE
chore: Should fix proving benchmarks by reducing committee lag (#20381)
chore: standalone forge broadcast wrapper with retry, timeout, and anvil
detection (#19824)
chore: benchmark tx val (#20227)
fix: cloudflare terraform API (#20387)
fix: HA e2e test order & retries (#20383)
fix: incorporate forge broadcast review feedback (#20390)
refactor(p2p): rewrite FileStoreTxCollection with retry, backoff, and
shared worker pool (#20317)
chore(ci): run ci job on draft PRs (#20395)
feat(bot): allow anchoring txs to proposed chain (#20392)
feat(p2p): slot-based soft deletion for TxPoolV2 (#20388)
chore: add setup-container script (#20309)
feat: build aztec-prover-agent with baked-in CRS (#20391)
fix!: change protocol contracts deployer to be the contract address
(#20396)
feat(p2p): enforce minimum tx pool age for block building (#20384)
refactor(sentinel): update validator statuses to checkpoint-based naming
(#20372)
chore: log tx hash (#20413)
chore: update l1 fee analysis to measure blob count in L1 blocks
(#20414)
chore: set up tx file store in next-net (#20418)
chore(ci): gate draft PRs from CI, allow override with ci-draft label
(#20426)
fix: stabilize writing_an_account_contract.test.ts (#20420)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants