chore: standalone forge broadcast wrapper with retry, timeout, and anvil detection#19824
Conversation
739397d to
7c00e82
Compare
7b36542 to
dde561c
Compare
f52c815 to
cbd553d
Compare
dc4eb3b to
0e33d76
Compare
0e33d76 to
edca85c
Compare
0eaa392 to
80ba385
Compare
…port Replace inline retry logic and direct forge calls with a standalone TypeScript wrapper script (l1-contracts/scripts/forge_broadcast.ts). Key behaviors: - --batch-size 8 to prevent forge broadcast hangs - External timeout (forge's --timeout is unreliable for broadcast hangs) - On anvil: detect via web3_clientVersion, retry from scratch on failure (anvil's auto-miner race condition strands txs in mempool) - On real chains: retry with --resume to pick up unmined transactions - Buffers stdout per attempt, only emits the successful attempt's output - Waits for stdout drain before exit to avoid pipe truncation References: - foundry-rs/foundry#6796 (batch size hang) - foundry-rs/foundry#8919 (anvil auto-miner race)
… exit emitAndExit uses process.stdout.write with a callback to call process.exit, which is async. Without awaiting, execution falls through to subsequent log lines and retry loops before process.exit fires. Change return type to Promise<never> and await all call sites.
Move chain-specific timeout logic into forge_broadcast.ts itself — queries eth_chainId at startup and selects 300s for mainnet/sepolia, 50s for everything else. FORGE_BROADCAST_TIMEOUT env var still works as an override. Remove getForgeBroadcastTimeout from callers. Also fixes issues from code review: - rpcCall now rejects on JSON-RPC error responses (was silently resolving undefined, which could cause false positives in verifyBroadcastOnChain) - rpcCall has a 10s timeout to prevent wrapper's own RPC calls from hanging indefinitely - Remove incorrect foundry issue #8919 link (was a code refactoring PR, not about anvil) - Remove unnecessary try/catch around rmSync with force:true - Replace magic exit code 124 with named EXIT_TIMEOUT constant
80ba385 to
102ee4f
Compare
Flakey Tests🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
The on-chain nonce verification logic was never triggered as a save in 46,484 stress test runs. On anvil (where all retries occurred), it's useless because broadcast artifacts are cleared before retry. The nonce heuristic is also unreliable — a higher nonce doesn't prove those specific transactions were mined. Remove to simplify.
The script has a shebang that handles --experimental-strip-types, so spawn it directly instead of via node. It also doesn't need to be copied to the temp dir — it just spawns forge which inherits cwd.
Strip --verify from broadcast attempts and run it as a separate step after transactions land. Forge runs verification AFTER all receipts are collected (crates/script/src/lib.rs:333-338) and exits non-zero if any verification fails (crates/script/src/verify.rs), even when all transactions succeeded. This caused two problems: 1. The broadcast timeout could kill forge mid-verification, wasting the attempt even though all transactions were already mined. 2. A verification failure (e.g. Etherscan rate limit) triggered broadcast retries, resubmitting already-mined transactions. After a successful broadcast, we now run `forge script --resume --verify --broadcast` without a timeout. The --resume path re-compiles and re-links using libraries from the broadcast artifacts (crates/script/src/build.rs CompiledState::resume) — it doesn't need simulation data. The broadcast step is a no-op since all receipts already exist. Verification failure is logged but doesn't affect the exit code, since the critical work (transaction landing) already succeeded.
| #!/usr/bin/env -S node --experimental-strip-types | ||
|
|
||
| // forge_broadcast.ts - Reliable forge script broadcast with retry and timeout. | ||
| // |
There was a problem hiding this comment.
I confirm that I have reviewed this carefully and have high confidence
Node v22 treats .ts files as CJS by default. The forge_broadcast.ts script uses ESM imports, which requires "type": "module" in the nearest package.json. Without copying l1-contracts/package.json, Node finds l1-artifacts/package.json instead (no "type": "module") and fails to parse the imports.
Node.js refuses to load .ts files from node_modules even with --experimental-strip-types. Compile with swc during copy-foundry-artifacts and invoke via process.execPath from deploy code.
Move swc compilation from copy-foundry-artifacts.sh into l1-contracts bootstrap so the .js is built at the source. Add @swc/cli and @swc/core as devDependencies. copy-foundry-artifacts.sh now just copies the pre-built .js file.
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Node.js refuses to load .ts files from node_modules, so make the script plain JavaScript instead of compiling with swc. Removes @swc/cli and @swc/core devDependencies from l1-contracts.
| return new Promise((resolve, reject) => { | ||
| const url = new URL(rpcUrl); | ||
| const body = JSON.stringify({ jsonrpc: "2.0", id: 1, method, params }); | ||
| const reqFn = url.protocol === "https:" ? httpsRequest : httpRequest; |
There was a problem hiding this comment.
fetch is available in node. I think it would simplify the code greatly (built-in stream handling, timeouts, promise support etc).
| const MAX_RETRIES = parseInt( | ||
| process.env.FORGE_BROADCAST_MAX_RETRIES ?? "3", | ||
| 10, | ||
| ); |
There was a problem hiding this comment.
this can still return NaN. Might be worth just adding a quick check
| const MAX_RETRIES = parseInt( | |
| process.env.FORGE_BROADCAST_MAX_RETRIES ?? "3", | |
| 10, | |
| ); | |
| const MAX_RETRIES = parseInt( | |
| process.env.FORGE_BROADCAST_MAX_RETRIES ?? "3", | |
| 10, | |
| ); | |
| if (!Number.isSafeInteger(MAX_RETRIES)) { | |
| process.stderr.write(`MAX_RETRIES is not a valid integer.\n`); | |
| process.exit(1); | |
| } | |
There was a problem hiding this comment.
I guess it's not too bad because the way it's being used (attempt <= MAX_RETRIES) is safe, ie. it will never run an attempt. I was worried this could cause it to retry indefinitely
| const timer = setTimeout(() => { | ||
| timedOut = true; | ||
| proc.kill("SIGTERM"); | ||
| killTimer = setTimeout(() => proc.kill("SIGKILL"), KILL_GRACE); | ||
| }, timeoutSecs * 1000); |
There was a problem hiding this comment.
FWIW spawn takes signal: AbortSignal, timeout: number, killSignal: number to automatically kill the process. It might be enough to just SIGKILL the process if it hangs.
| // - Forge computes new nonces from on-chain state | ||
| // - New transactions replace any stuck ones with the same nonce | ||
| // - The race condition is intermittent (~0.04%), so retries almost always succeed | ||
| rmSync("broadcast", { recursive: true, force: true }); |
There was a problem hiding this comment.
just preempting flakes here
| rmSync("broadcast", { recursive: true, force: true }); | |
| rmSync("broadcast", { recursive: true, force: true, maxRetries: 3, retryDelay: 100 }); |
BEGIN_COMMIT_OVERRIDE chore: Should fix proving benchmarks by reducing committee lag (#20381) chore: standalone forge broadcast wrapper with retry, timeout, and anvil detection (#19824) chore: benchmark tx val (#20227) fix: cloudflare terraform API (#20387) fix: HA e2e test order & retries (#20383) fix: incorporate forge broadcast review feedback (#20390) refactor(p2p): rewrite FileStoreTxCollection with retry, backoff, and shared worker pool (#20317) chore(ci): run ci job on draft PRs (#20395) feat(bot): allow anchoring txs to proposed chain (#20392) feat(p2p): slot-based soft deletion for TxPoolV2 (#20388) chore: add setup-container script (#20309) feat: build aztec-prover-agent with baked-in CRS (#20391) fix!: change protocol contracts deployer to be the contract address (#20396) feat(p2p): enforce minimum tx pool age for block building (#20384) refactor(sentinel): update validator statuses to checkpoint-based naming (#20372) chore: log tx hash (#20413) chore: update l1 fee analysis to measure blob count in L1 blocks (#20414) chore: set up tx file store in next-net (#20418) chore(ci): gate draft PRs from CI, allow override with ci-draft label (#20426) fix: stabilize writing_an_account_contract.test.ts (#20420) END_COMMIT_OVERRIDE
(I confirm I have edited and reviewed this AI summary)
Summary
Replaces ad-hoc
--batch-size/--timeoutflags across shell scripts and TypeScript with a single standalone wrapper script (l1-contracts/scripts/forge_broadcast.ts) that handles all forge broadcast reliability concerns in one place:--resume.--verifyfrom broadcast attempts so the timeout only covers transaction landing. After broadcast succeeds, runsforge script --resume --verify --broadcastwith no timeout. Verification failure is logged but doesn't affect the exit code (transactions already landed). This prevents two problems: (1) timeout killing forge mid-verification even though all transactions mined, and (2) Etherscan failures triggering broadcast retries.Files changed
l1-contracts/scripts/forge_broadcast.tsl1-contracts/package.json"type": "module"for ESMl1-contracts/scripts/run_rollup_upgrade.shl1-contracts/scripts/test_rollup_upgrade.shyarn-project/ethereum/src/deploy_aztec_l1_contracts.tsyarn-project/l1-artifacts/scripts/copy-foundry-artifacts.shStress test results
50,000 deploy+upgrade cycles against local anvil instances (20 parallel workers, half instant-mining, half
--block-time 1):All 13 retries occurred on instant-mining anvil workers (the automine race condition). Zero retries on
--block-timeworkers. All recovered successfully. Previous attempts could not recover on those rare hangs.Test plan
l1-contracts/scripts/test_rollup_upgrade.shexercises the full deploy+upgrade flow end-to-end