Conversation
… setups Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire peerFailedBanTimeMs as new env and set tx collector test ban time to 5 minutes -> 5 seconds. The test would flake due to timeout and aggregation of peers took 1 full minute on attempting to get peers per subtest despite never obtaining all peers. This is because the peer dial is serialized and limited to 5 for this test and peers may dial repeatedetly without success then get banned for 5 minutes, never being able to reconnect within the 1 minute wait. This should allow all peers to connect in time and lower the 1 minute timeout, resulting in less timeouts overall for the test.
#21605) ## Motivation When `VALIDATOR_MAX_TX_PER_BLOCK` is not set but `VALIDATOR_MAX_TX_PER_CHECKPOINT` is, the gossip-level proposal validator enforces no per-block transaction limit at all. A single block can't have more transactions than the entire checkpoint allows, so the checkpoint limit is a valid upper bound for per-block validation. ## Approach Use `validateMaxTxsPerCheckpoint` as a fallback when `validateMaxTxsPerBlock` is not set in the proposal validator construction. This applies at both construction sites: the P2P libp2p service (gossip validation) and the validator-client factory (block proposal handler). ## Changes - **p2p**: Added `validateMaxTxsPerCheckpoint` to `P2PConfig` interface and config mappings (reads from `VALIDATOR_MAX_TX_PER_CHECKPOINT` env var) - **p2p (libp2p_service)**: Use `validateMaxTxsPerBlock ?? validateMaxTxsPerCheckpoint` when constructing proposal validators - **validator-client (factory)**: Same fallback when constructing the `BlockProposalValidator` Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Fix: ARM64 Mac (M3) Devcontainer Build Failures ## Problem Building inside a devcontainer on Mac with Apple M3 chip fails in multiple ways: 1. **SIGILL crashes** — The `bb-sol` build step crashes when running `honk_solidity_key_gen`, and E2E tests fail with `Illegal instruction` errors. 2. **Rust compilation failures** — The `noir` build fails with `can't find crate for serde` and similar errors when noir and avm-transpiler build in parallel, racing on the shared `CARGO_HOME`. ## Root Cause ### SVE instructions from zig `-target native` 1. CI runs on **AWS Graviton** (ARM64 with SVE vector extensions) 2. The zig compiler wrapper uses `-target native-linux-gnu.2.35`, which on Graviton enables **SVE instructions** 3. Mac M3 devcontainer (ARM64 **without SVE**) downloads the same cached binaries 4. Binaries contain SVE opcodes (e.g. `0x04be4000`) that Apple Silicon can't execute → **SIGILL** Cache keys already include architecture via `cache_content_hash` (which appends `$OSTYPE-$(uname -m)`), so amd64 vs arm64 caches never collide. The problem is specifically that two ARM64 machines (Graviton with SVE vs Apple Silicon without SVE) share the same architecture tag but have different CPU feature sets. The fix is to stop emitting CPU-specific instructions in the first place. ### Parallel Rust build race condition The top-level bootstrap runs `noir` and `avm-transpiler` builds in parallel. Both invoke `cargo build`, and both share the same `CARGO_HOME` (`~/.cargo`) which contains the crate registry and download cache. When both cargo processes run concurrently, they race on shared registry state, causing downstream crates (e.g. `serde-big-array`, `ecdsa`) to fail with `can't find crate` errors during compilation. This does not happen on CI where builds are cached, only on local fresh builds (e.g. `NO_CACHE=1`). ## Fixes ### 1. Zig compiler wrappers: explicit ARM64 target **Files:** `barretenberg/cpp/scripts/zig-cc.sh`, `barretenberg/cpp/scripts/zig-c++.sh` Changed `-target native-linux-gnu.2.35` to use explicit `aarch64-linux-gnu.2.35` on ARM64 Linux. This produces generic ARM64 code without CPU-specific extensions (SVE, etc.), ensuring binaries work on all ARM64 machines — Graviton, Apple Silicon, Ampere, etc. x86_64 behavior is unchanged (still uses `native`). ### 2. Extract native_cache_key variable in barretenberg bootstrap **File:** `barretenberg/cpp/bootstrap.sh` Extracted the repeated cache key pattern `barretenberg-$native_preset-$hash` into a single `native_cache_key` variable, used by `build_native_objects`, `build_native`, and related functions. Pure refactor, no change in cache key values. ### 3. Better error handling in init_honk.sh **File:** `barretenberg/sol/scripts/init_honk.sh` Added `set -eu` so the script fails immediately on error instead of silently continuing after SIGILL. Added an existence check for the `honk_solidity_key_gen` binary with a clear error message. ### 4. Serialize parallel cargo builds with flock **Files:** `noir/bootstrap.sh`, `avm-transpiler/bootstrap.sh` Both scripts wrap their `cargo build` invocations with `flock -x 200` on a shared lock file (`/tmp/rustup.lock`): ```bash ( flock -x 200 cd noir-repo && cargo build --locked --release --target-dir target ) 200>/tmp/rustup.lock ``` This acquires an exclusive file lock before running cargo, so if both `noir` and `avm-transpiler` builds run in parallel, one waits for the other to finish. The lock is automatically released when the subshell exits. This eliminates the `CARGO_HOME` race condition without requiring changes to the top-level parallelism. ## Notes ### E2E Tests The E2E test failures (SIGKILL from invalid instructions) have the same root cause as the SIGILL crashes — the `bb` binary used by tests was from the SVE-contaminated cache. After rebuilding with these fixes, E2E tests work. --------- Co-authored-by: Aztec Bot <49558828+AztecBot@users.noreply.github.com> Co-authored-by: ludamad <adam.domurad@gmail.com>
Collaborator
Author
|
🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass. |
PR #21597 increased the finalized block lookback from epochDuration*2 to epochDuration*2*4. This caused the finalized block number to jump backwards past blocks that had already been pruned from world-state, causing advance_finalized_block to fail with 'Failed to read block data'. Two fixes: 1. TypeScript: clamp blockNumber to oldestHistoricalBlock before calling setFinalized, so we never request a pruned block. 2. C++: reorder checks in advance_finalized_block to check the no-op condition (already finalized past this block) before attempting to read block data. This makes the native layer resilient to receiving a stale finalized block number.
…uned blocks Tests that handleBlockStreamEvent with chain-finalized for a block older than the oldest available block does not throw, validating the clamping fix in handleChainFinalized.
Calling `Array.from({length})` allocates length immediately. We were
calling this method in the context of deserialization with untrusted
input.
This PR changes it so we use `new Array(size)` for untrusted input. A
bit less efficient, but more secure.
## Summary PR #21597 increased the finalized block lookback from `epochDuration*2` to `epochDuration*2*4`, which caused the finalized block number to jump backwards past blocks already pruned from world-state. The native `advance_finalized_block` then failed trying to read pruned block data, crashing the block stream with: ``` Error: Unable to advance finalized block: 15370. Failed to read block data. Tree name: NullifierTree ``` Two fixes: - **TypeScript** (`server_world_state_synchronizer.ts`): Clamp the finalized block number to `oldestHistoricalBlock` before calling `setFinalized`, so we never request a pruned block. - **C++** (`cached_content_addressed_tree_store.hpp`): Reorder checks in `advance_finalized_block` to check the no-op condition (`finalizedBlockHeight >= blockNumber`) before attempting `read_block_data`. This makes the native layer resilient to stale finalized block numbers. Full analysis: https://gist.github.com/AztecBot/6221fb074ed7bbd8a753ec3602133b42 ClaudeBox log: https://claudebox.work/s/8e97449f22ba9343?run=1
Correlate script by trace ID.
Wire peerFailedBanTimeMs as new env and set tx collector test ban time to 5 minutes -> 5 seconds. The test would flake due to timeout and aggregation of peers took 1 full minute on attempting to get peers per subtest despite never obtaining all peers. This is because the peer dial is serialized and limited to 5 for this test and peers may dial repeatedetly without success then get banned for 5 minutes, never being able to reconnect within the 1 minute wait. This should allow all peers to connect in time and lower the 1 minute timeout, resulting in less timeouts overall for the test.
When the finalized block jumps backwards past pruned state, return early instead of clamping and continuing into the pruning logic. The previous clamping fix avoided the setFinalized error but then removeHistoricalBlocks would fail trying to prune to a block that is already the oldest. Also guard removeHistoricalBlocks against being called with a block number that is not newer than the current oldest available block.
… setups (#21603) ## Motivation In an HA setup, two nodes (A and B) share the same validator keys. When node A proposes a block, node B receives it via gossipsub but ignores it because `validateBlockProposal` detects the proposer address matches its own validator keys and returns early. This means node B never re-executes the block, never pushes it to its archiver, and falls behind the proposed chain. Additionally, both HA peers independently try to build and propose blocks for the same slot. If the losing peer commits its block to the archiver before signing fails, it ends up with a stale block that prevents it from accepting the winning peer's proposal. ## Approach Three changes work together to fix HA proposed chain sync: 1. **Remove self-filtering**: Remove the early return in `validateBlockProposal` for self-proposals, letting them flow through the normal re-execution path so the HA peer pushes the winning block to its archiver. 2. **Sign before syncing to archiver**: Reorder the checkpoint proposal job so that non-last blocks are signed via `createBlockProposal` *before* being synced to the archiver. If the shared slashing protection DB rejects signing (because the HA peer already signed), the block is never added to the archiver, keeping it clean to accept the winning peer's block via gossipsub. 3. **Shared slashing protection for testing**: Add `createSharedSlashingProtectionDb` (backed by a shared LMDB store) and `createSignerFromSharedDb` factories, and thread an optional `slashingProtectionDb` through the validator creation chain. This allows e2e tests to simulate HA signing coordination without PostgreSQL. ## Changes - **validator-client**: Remove self-proposal filtering in `validateBlockProposal`. Add optional `slashingProtectionDb` parameter to `ValidatorClient.new` and `createValidatorClient` factory for injecting a shared signing protection DB. - **validator-client (tests)**: Add unit test verifying block proposals signed with the validator's own key are processed and forwarded to `handleBlockProposal`. - **sequencer-client**: Reorder `checkpoint_proposal_job` so non-last blocks call `createBlockProposal` before `syncProposedBlockToArchiver`. If signing fails (HA signer rejects), the block is never added to the archiver. - **validator-ha-signer**: Add `createSharedSlashingProtectionDb` and `createSignerFromSharedDb` factory functions for testing HA setups with a shared in-memory LMDB store. - **aztec-node**: Thread `slashingProtectionDb` through `AztecNodeService.createAndSync` deps. - **end-to-end**: Add `epochs_ha_sync` e2e test with 4 nodes in 2 HA pairs (each pair sharing validator keys and a slashing protection DB), different coinbase addresses per node, MBPS enabled, checkpoint publishing disabled. Asserts all 4 nodes converge on the same proposed block hash before any checkpoint is published. Fixes A-675
Calling `Array.from({length})` allocates length immediately. We were
calling this method in the context of deserialization with untrusted
input.
This PR changes it so we use `new Array(size)` for untrusted input. A
bit less efficient, but more secure.
…21656) ## Summary Follow-up to #21643. The clamping fix avoided the `setFinalized` error, but the method continued into the pruning logic where `removeHistoricalBlocks` failed with: ``` Unable to remove historical blocks to block number 15812, blocks not found. Current oldest block: 15812 ``` Two changes: - When the finalized block is older than `oldestHistoricalBlock`, return early instead of clamping and continuing. There's nothing useful to do — world-state is already finalized past this point. - Guard `removeHistoricalBlocks` against being called with a block `<= oldestHistoricalBlock`, which the C++ layer rejects. The C++ reorder fix from #21643 is preserved. ClaudeBox log: https://claudebox.work/s/8e97449f22ba9343?run=4
## Summary Demotes the "Finalized block X is older than oldest available block Y. Skipping." log from `warn` to `trace`. This message fires on every block stream tick while the finalized block is behind the oldest available, filling up operator logs on deployed networks. ClaudeBox log: https://claudebox.work/s/8e97449f22ba9343?run=6
## Summary Fixes CI failure on merge-train/spartan caused by `-march=skylake` being injected into aarch64 cross-compilation builds (arm64-android, arm64-ios, arm64-macos). **Root cause:** The `arch.cmake` auto-detection added in #21611 defaults `TARGET_ARCH` to `skylake` when `ARM` is not detected. Cross-compile presets (ios, android) don't set `CMAKE_SYSTEM_PROCESSOR`, so ARM detection fails and `-march=skylake` gets passed to aarch64 Zig builds — which errors with `unknown CPU: 'skylake'`. For arm64-macos, `-march=generic` overrides Zig's `-mcpu=apple_a14`, breaking libdeflate. **Fix:** Gate auto-detection on `NOT CMAKE_CROSSCOMPILING`. Cross-compile toolchains handle architecture targeting via their own flags (e.g. Zig `-mcpu`). Presets that explicitly set `TARGET_ARCH` (amd64-linux, arm64-linux) are unaffected. Also restores `native_build_dir` variable dropped in the build infrastructure refactor. ## Test plan - Verified all cross-compile presets (arm64-android, arm64-ios, arm64-ios-sim, arm64-macos, x86_64-android) configure with zero `-march` flags - Verified native presets (default, amd64-linux, arm64-linux) still get correct `-march` values
We were reporting txs not available as an unknown error.
…ionResult (#21676) ## Summary - Modifies the bot factory to estimate gas for all transactions during setup (deploy, mint, add liquidity, etc.) instead of using default gas settings. - Makes `BatchCall.simulate()` always return `SimulationResult` (consistent with `ContractFunctionInteraction` and `DeployMethod`), instead of returning different shapes depending on whether gas estimation was requested. ## Test plan - [x] `yarn build` passes (no new type errors) - [x] `yarn workspace @aztec/aztec.js test src/contract/batch_call.test.ts` — all 7 tests pass - [ ] Spartan network deployment with bot enabled 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e proposal test (#21673) ## Summary When PR #21603 changed the validator to process (not ignore) block proposals from HA peers (same validator key), the `duplicate_proposal_slash` test broke. The second malicious node now processes the first node's proposal, adds the block to its archiver via `blockSource.addBlock()`, and the sequencer sees "slot was taken" — preventing it from ever building its own conflicting proposal. **Root cause**: `validateBlockProposal` no longer returns `false` for self-proposals (changed to process them for HA support). The block_proposal_handler re-executes the proposal and pushes it to the archiver. The sequencer then skips the slot. **Fix**: Set `skipPushProposedBlocksToArchiver=true` on the malicious nodes. This allows: 1. Node 1 builds and broadcasts its proposal 2. Node 2 receives it, re-executes (as HA peer), but does NOT add to archiver 3. Node 2's sequencer doesn't see "slot taken" → builds its own block with different coinbase 4. Node 2 broadcasts (allowed by `broadcastEquivocatedProposals=true`) 5. Honest nodes see both proposals → detect duplicate → offense recorded ## Test plan - The `duplicate_proposal_slash` e2e test should now pass consistently - Other slashing tests should be unaffected (only malicious nodes in this test are changed) ClaudeBox log: https://claudebox.work/s/ced449aa0eabbcb4?run=1
Errors in readMessage (invalid status bytes, oversized snappy responses, corrupt data) were caught and silently converted to UNKNOWN status returns. Since sendRequestToPeer only calls handleResponseError in its own catch block, none of these errors resulted in peer penalties. The request was simply retried with another peer, allowing a malicious peer to waste bandwidth indefinitely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Motivation
Errors during `readMessage` (oversized snappy responses, corrupt data,
etc.) were caught and silently converted to `{ status: UNKNOWN }` return
values instead of re-throwing. Since `sendRequestToPeer` only calls
`handleResponseError` in its own catch block, none of these errors
resulted in peer penalties. The request was simply retried with another
peer, allowing a malicious peer to waste bandwidth indefinitely.
## Approach
Re-throw non-protocol errors from `readMessage` so they propagate to
`sendRequestToPeer`'s catch block where `handleResponseError` applies
peer penalties. Additionally, introduce a dedicated
`OversizedSnappyResponseError` class so oversized responses get a
harsher `LowToleranceError` penalty (score -50, banned after 2 offenses)
instead of falling through to the generic `HighToleranceError`
catch-all.
## Changes
- **p2p (reqresp)**: Changed `readMessage` catch block to only return
status for `ReqRespStatusError` and re-throw all other errors, so they
reach `handleResponseError` for penalization
- **p2p (encoding)**: Added `OversizedSnappyResponseError` class for
explicit categorization
- **p2p (reqresp)**: Added `OversizedSnappyResponseError` handling in
`categorizeResponseError` with `LowToleranceError` severity
Baseline stuff for buildahead - adds enable bool - adds some metrics that will be required
# fix: ARM64 devcontainer builds — skip `-march` on ARM and use explicit zig aarch64 target ## Summary Fixes SIGILL (Illegal Instruction) crashes and build failures on ARM64 Mac (M3/Apple Silicon) devcontainers caused by incorrect `-march` handling introduced in #21611. ## Problem PR #21611 originally fixed ARM64 devcontainer builds by using explicit `aarch64-linux-gnu.2.35` zig targets. During the merge, that approach was replaced with cmake-based auto-detection that sets `TARGET_ARCH=generic` on ARM and passes `-march=generic` to the compiler. This caused two distinct failures: ### 1. SIGILL crashes (`Illegal instruction`) The zig compiler wrappers still used `-target native-linux-gnu.2.35`, which auto-detects the host CPU. On CI (AWS Graviton with SVE extensions), this produces binaries containing SVE instructions. These cached binaries are then downloaded on Apple Silicon devcontainers (ARM64 without SVE), causing SIGILL when executed — e.g. `honk_solidity_key_gen` crashing during `barretenberg/sol` bootstrap. The `-march=generic` flag was supposed to override this, but `-march=generic` is **not a valid value on aarch64**. It's an x86 concept. LLVM/zig silently ignored it, so the native CPU detection still produced SVE instructions. ### 2. Build failures (`unknown CPU: 'armv8'`) Even attempting `-march=armv8-a` (a valid GCC/Clang aarch64 value) fails because zig uses its own CPU naming scheme (e.g. `generic`, `cortex_a72`, `apple_m3`), not GCC-style architecture strings. Zig interprets `-march=armv8-a` as CPU name `armv8`, which doesn't exist → `error: unknown CPU: 'armv8'`. **Bottom line:** The `-march` cmake approach fundamentally doesn't work with zig on ARM. Zig has its own architecture targeting via `-target`, which is the correct mechanism. ## What this PR changes ### 1. `arch.cmake` — Skip `-march` auto-detection on ARM Removed the ARM branch from the auto-detection. On x86_64, we still auto-detect `TARGET_ARCH=skylake`. On ARM, we don't set `TARGET_ARCH` at all, so no `-march` flag is passed — the zig wrappers handle architecture targeting instead. ### 2. `zig-cc.sh` / `zig-c++.sh` — Explicit aarch64 target on ARM Linux Restored the original fix from #21611 that was dropped during merge. On ARM64 Linux, the wrappers now use `-target aarch64-linux-gnu.2.35` instead of `-target native-linux-gnu.2.35`. This produces generic ARM64 code without CPU-specific extensions (SVE, etc.), ensuring cached binaries work on all ARM64 machines — Graviton, Apple Silicon, Ampere, etc. x86_64 behavior is unchanged (still uses `-target native`). ## Context: what happened after #21611 After #21611 merged with the cmake auto-detection approach, it triggered a cascade of follow-up PRs trying to fix the fallout: | PR | Status | Issue | |----|--------|-------| | #21621 | Merged | Introduced the auto-detect approach (replaced zig wrapper fix with cmake `-march`) | | #21356 | Merged | Added `NOT CMAKE_CROSSCOMPILING` guard for cross-compile failures | | #21637 | Open | Attempting to fix cross-compiles + restore `native_build_dir` | | #21660 | Open | Attempting to fix cross-compile targets | | #21632 | Open | Attempting to fix cross-compile targets | | #21662 | Open | Adding `CMAKE_SYSTEM_PROCESSOR` to ARM64 cross-compile presets | | #21653 | Open | Attempting to skip auto-detection when cross-compiling | | #21655 | Open | Attempting to skip auto-detection for cross-compilation targets | This PR supersedes the still-open PRs above by addressing the root cause: `-march` via cmake doesn't work with zig on ARM. The zig `-target` mechanism is the correct approach.
change ordering for `lastBlock` case when creating a checkpoint proposal, so that we first sign the last block and then the checkpoint
…ons (A-683) (#21686) Remove early return in for...of loop that caused only the first contract class's functions to be stored when multiple classes had broadcasts in the same block. Fixes https://linear.app/aztec-labs/issue/A-683
Collaborator
Author
Flakey Tests🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix(p2p): fall back to maxTxsPerCheckpoint for per-block tx validation (#21605)
chore: fixing M3 devcontainer builds (#21611)
fix: clamp finalized block to oldest available in world-state (#21643)
chore: fix proving logs script (#21335)
fix: (A-649) tx collector bench test (#21619)
fix(validator): process block proposals from own validator keys in HA setups (#21603)
fix: add bounds when allocating arrays in deserialization (#21622)
fix: skip handleChainFinalized when block is behind oldest available (#21656)
chore: demote finalized block skip log to trace (#21661)
fix: skip -march auto-detection for cross-compilation presets (#21356)
chore: revert "add bounds when allocating arrays in deserialization" (#21622) (#21666)
fix: capture txs not available error reason in proposal handler (#21670)
fix: estimate gas in bot and make BatchCall.simulate() return SimulationResult (#21676)
fix: prevent HA peer proposals from blocking equivocation in duplicate proposal test (#21673)
fix(p2p): penalize peers for errors during response reading (#21680)
feat(sequencer): add build-ahead config and metrics (#20779)
chore: fixing build on mac (#21685)
fix: HA deadlock for last block edge case (#21690)
fix: process all contract classes in storeBroadcastedIndividualFunctions (A-683) (#21686)
END_COMMIT_OVERRIDE