Skip to content

chore: Backport of fixes into v2#17206

Merged
PhilWindle merged 19 commits intov2from
pw/bp-friday-fixes
Sep 29, 2025
Merged

chore: Backport of fixes into v2#17206
PhilWindle merged 19 commits intov2from
pw/bp-friday-fixes

Conversation

alexghr and others added 4 commits September 22, 2025 17:56
When deciding whether to slash committee members for an epoch, the epoch
prune watcher tries reexecuting all blocks in the pruned epoch. It also
uses that to decide whether to slash for data withholding, if not all
data is available.

However, the blocks being reexecuted (and the txs being gathered) were
from ALL pruned epochs, which could be more than one if the proof
submission window was long enough. So we ended up slashing committee
members from the first epoch for data withholding offenses perpetrated
by members in future epochs.
This PR rolls back uncommitted state upon a failed world state block sync.
Handles malformed proposals where the archive root is not properly
derived from the other tree roots, aka _the @alexghr Friday attack_.
Also checks that the block proposal is not for a block that already
exists, since the proposer had complete freedom to set any past block
number for the reexecution (though this would later fail in L1).

Fixes
https://linear.app/aztec-labs/issue/A-84/validate-full-block-proposal-before-attesting
@fcarreiro fcarreiro removed their request for review September 23, 2025 08:34
spalladino and others added 5 commits September 25, 2025 11:55
The sequencer publisher enforces a max size for a block, depending on
the size it takes up in blobs. If the block exceeds that size, it's
rejected by the publisher.

https://github.com/AztecProtocol/aztec-packages/blob/bb87ea4a58a63771e61d551d105d8b52ba2014e6/yarn-project/stdlib/src/block/body.ts#L56-L70

This PR adds a check during block building to ensure that we don't go
past that limit.
Do not acknowledge an L1 to L2 message as synced until the rollup
pending block number has caught up with the message block. The inbox
block number may drift way ahead of the rollup block number in the event
of a reorg or if there are too many l1 to l2 messages being inserted.

Note that the existing approach used throughout the codebase of waiting
for two blocks if flawed, since if there was an earlier reorg on the
chain, then the inbox will have drifted and the message will require
more blocks to become available.

This PR does NOT remove the existing isL1ToL2MessageSynced call, since
it's used all over the place, but rather flags it as deprecated.

Instead, the node and pxe now expose a function that returns the block
in which the message is to be available, and aztecjs provides a helper
to wait until the block is reached.

The bot factory is updated to use this new approach.
Second attempt at
#17207

The issue was that the `start` method on p2p client would overwrite all
previously registered subprotocol handlers, hence the errors we were
seeing on CI:
```
22:00:04 [22:00:04.697] WARN: p2p:4506:libp2p_service:4506:libp2p_service:4506:reqresp:4506 Unknown stream error while handling the stream, aborting {"protocol":"/aztec/req/auth/1.0.0"}
22:00:04     err: {
22:00:04       "type": "TypeError",
22:00:04       "message": "handler is not a function",
22:00:04       "stack":
22:00:04           TypeError: handler is not a function
22:00:04               at /home/aztec-dev/aztec-packages/yarn-project/p2p/dest/services/reqresp/reqresp.js:452:40
22:00:04               at processTicksAndRejections (node:internal/process/task_queues:105:5)
22:00:04               at duplex.sink (/home/aztec-dev/aztec-packages/yarn-project/node_modules/it-byte-stream/src/index.ts:86:22)
22:00:04     }
```

The test was also failing locally (it had passed before I submitted due
to a build issue I missed), and since it was tagged as flake it was
greenlighted by CI on the PR.
While we keep slashing rounds for a long time (lifetime in rounds is
currently defined as 100), we only attempt to execute them during the
first round they become executable.

If for whatever reason they don't get executed, they are just forgotten
and never actually triggered.

This PR adds a new config `slashExecuteRoundsLookBack` (defaults to 4)
with how many execution rounds to look back from the latest executable
round to see if there was any round pending execution. Each round is
checked in sequence, so setting this value too high can introduce
performance issues. Setting this value to zero keeps the same behaviour
as we have today.

This PR also fixes another issue: we were re-checking if a round was
executable based on the isReadyToExecute flag returned from the
contract. However, that flag was computed based on the current at the
time of the call, and not based on the time in which the tx would land.
This meant that we always failed to execute the slash payload on the
first slot of a round.

These two issues combined were triggering the flakes in the
`inactivity-slash` tests. Since we had small rounds (each round was 4
slots), and the first slot was consistently missed, it was a matter of
being unlucky enough such that the inactive validator was picked 3 slots
in a row as a sequencer. Given we have 6 validators, this happened
roughly once every 216 runs. See [this
run](http://ci.aztec-labs.com/3bff0b862dd4156f) for an example.
Adds a flag to always reexecute block proposals. If set, a validator
node will always reexecute, even if not part of the committee, though
they will not attest. If the node is not a validator, they will just log
the result fo the execution.

Note that this does NOT affect p2p propagation, since the reexecution is
done after the attestation is propagated, as it happens on a separate
handler and not in a p2p-registered validator.

To handle reexecutions in a non-validator node, reexecution was moved to
a block-proposal-handler class, which is instantiated instead of a
validator client in non-validators.

This PR also causes validators to reexecute a proposal if they are not
in the committee if there is a slash penalty defined for broadcasting
invalid block proposals. Since this feature is not yet properly tested,
I've disabled the default slash for these offenses for the time being
(they were not working at the moment). See A-57 for more info.

Fixes A-54
@spalladino spalladino added ci-no-squash ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels Sep 25, 2025
As part of #17273 I had added a cleanup to the gossip network test to
delete data dirs for the prover. However, the `stop` method on the
prover failed to await for all operations, so when the test finished
successfully, it would still try to use the db (in particular, it seems
to be for the proving broker database `getEpochDatabase`) and abort with
a core dump.

This reverts the folder cleanup.
(cherry picked from commit 2edc51a)
Fix A-43

(cherry picked from commit 61ad020)
@alexghr alexghr requested a review from charlielye as a code owner September 26, 2025 13:16
@alexghr alexghr enabled auto-merge (squash) September 29, 2025 15:58
@alexghr alexghr disabled auto-merge September 29, 2025 15:59
@PhilWindle PhilWindle enabled auto-merge (squash) September 29, 2025 15:59
@PhilWindle PhilWindle disabled auto-merge September 29, 2025 16:00
@PhilWindle PhilWindle merged commit 5f821fd into v2 Sep 29, 2025
6 checks passed
@PhilWindle PhilWindle deleted the pw/bp-friday-fixes branch September 29, 2025 16:01
alexghr pushed a commit that referenced this pull request Nov 5, 2025
# v2.0.3..v2.1.0-rc.1 Notes

## Significant L1 Changes


### 1.  **Rollup Contract Interface Changes**

- **`propose()`  function signature changed**: Now requires an
additional  `_attestationsAndSignersSignature`  parameter
- **`validateHeaderWithAttestations()`  function signature changed**:
Also requires the new signature parameter
- This affects any code that directly calls these functions on the
rollup contract

### 2.  **New Required Configuration Parameters**

Several new configuration parameters are now required for deployment:

- `localEjectionThreshold`: Stricter ejection threshold local to
specific rollup (default: 196,000 tokens)
- `slashingDisableDuration`: How long slashing can be disabled in
seconds (default: 5 days)

### 3.  **GSE Contract Changes**

- **New function**:  `setProofOfPossessionGasLimit()`  \- allows
governance to adjust gas limits for BLS proof validation
- **Gas-limited proof validation**: Proof of possession validation now
has configurable gas limits (default: 200,000 gas)

### 4.  **Validator Queue Management Changes**

- **`flushEntryQueue()`  behavior changed**: Now has an overload
accepting a  `_toAdd`  parameter to limit validator additions
- **New validator flush accounting**: System now tracks available
validator flushes per epoch

Significant Non-Breaking Changes
--------------------------------

### 1.  **Enhanced Slashing Controls**

- **Temporary slashing disable**: Vetoers can now temporarily disable
slashing for the configured duration
- **New function**:  `setSlashingEnabled(bool)`  for controlling
slashing state

### 2.  **Improved Validator Selection**

- **Configurable lag period**: Validator sampling now uses configurable
epoch lag instead of fixed 2-epoch delay
- **Better bootstrapping**: Enhanced validator set bootstrapping with
improved flush size calculations

### 3.  **Updated Default Values**

- **Coin issuer rate**: Updated to  `25,000,000,000 tokens / year` 
(approximately 793 tokens per second)
- **Local ejection threshold**: Set to 196,000 tokens (stricter than
global 50,000 threshold)

## Significant Node Changes

### Fixes

- Rollback world state on failed block sync – Prevents bad state
persistence by rolling back uncommitted data if block sync fails.
[(#17158)](github.com//pull/17158)
- Early rejection of duplicate nullifiers – Detects and rejects
transactions with duplicate nullifiers before inclusion.
[(#17157)](github.com//pull/17157)
- Watcher pruning fix – Watcher now re-executes only blocks from the
relevant pruned epoch, avoiding cross-epoch slashing issues.
[(#17145)](github.com//pull/17145)
- Improved proposal validation – Fully validates proposal headers
(including archive root derivation) and blocks attempts to reuse
existing block numbers.
[(#17144)](github.com//pull/17144)
- L1 to L2 message sync reliability – Waits for rollup to reach the
inbox block before marking L1→L2 messages as synced; adds helpers to
track message readiness.
[(#17132)](github.com//pull/17132)
- Slashing round recovery – Executes pending slashing rounds skipped
during the first executable round; adds slashExecuteRoundsLookBack to
control re-check depth.
[(#17125)](github.com//pull/17125)
- Broker restart on rollup change – Ensures broker restarts when rollup
chain changes to stay synchronized.
[(#17120)](github.com//pull/17120)
- Remote signer readiness check – Verifies that a remote signer is
available before use.
[(#17119)](github.com//pull/17119)
- Orchestrator and agent retry improvements – Makes connections to the
broker more robust under transient failures.
[(#17117)](github.com//pull/17117)
- Telemetry cleanup – Fixes incorrect or spammy telemetry warnings.
[(#17155)](github.com//pull/17155)

### Features

- Network configuration support – Introduces centralized configuration
for network parameters.
[(#17113)](github.com//pull/17113)


## Full Changelog

You can generate this yourself with `./scripts/commits
v2.0.3..v2.1.0-rc.1 1000 -m -g`.

#### Fixes

- fix: use archiveAt(0) instead of getBlock to get genesis archive tree
- backport v2
([#17447](#17447)) —
spypsy, 5 days ago
- fix: add keystoreDirectory option to sequencer
([#17265](#17265)) —
spypsy, 13 days ago
- fix: testnet archival node - v2
([#17142](#17142)) —
Aztec Bot, 3 weeks ago

#### Chores

- chore: bump minor version — Mitch, 4 days ago —
[dbc243f](dbc243f)
- chore: backport dependabot deps
([#17463](#17463)) —
Aztec Bot, 5 days ago
- chore: Backport slack alerts
([#17460](#17460)) —
PhilWindle, 5 days ago
- chore(backport-to-v2): chore: New salt for staging-ignition (#17453)
([#17453](#17453)) —
Aztec Bot, 5 days ago
- chore(backport-to-v2): fix: improve libp2p connection limits for
network discovery (#17425)
([#17425](#17425)) —
Aztec Bot, 5 days ago
- chore(backport-to-v2): feat: add flushing rewarder (#17335)
([#17335](#17335)) —
Aztec Bot, 6 days ago
- chore(backport-to-v2): feat: add date gated relayer (#17323)
([#17323](#17323)) —
Aztec Bot, 6 days ago
- chore(backport-to-v2): feat: support using existing ERC20 token for
fee and staking (#17413)
([#17413](#17413)) —
Aztec Bot, 6 days ago
- chore: Delete contract addresses from chain l2 config
([#17430](#17430)) —
PhilWindle, 6 days ago
- chore: More updated staging public config
([#17364](#17364)) —
PhilWindle, 7 days ago
- chore(backport-to-V2): L1 backports
([#17365](#17365)) —
Lasse Herskind, 7 days ago
- chore: Ensure DB map sizes are configured for networks
([#17383](#17383)) —
PhilWindle, 7 days ago
- chore: Backport of fixes into v2
([#17206](#17206)) —
PhilWindle, 8 days ago
- chore: update zkpassport version
([#17339](#17339)) —
saleel, 8 days ago
- chore: Backport of workflow fix
([#17333](#17333)) —
PhilWindle, 11 days ago
- chore: Streamline staging deployments
([#17328](#17328)) —
PhilWindle, 11 days ago
- chore(backport-to-v2): fix: avm gracefully handles shifts (shl) with
huge bit sizes (#17171)
([#17171](#17171)) —
Aztec Bot, 12 days ago
- chore(backport-to-v2): chore: remove unconstrained generics from trait
impls (#17075)
([#17075](#17075)) —
Aztec Bot, 12 days ago
- chore: Backport deployment refactor
([#17280](#17280)) —
PhilWindle, 12 days ago
- chore(backport-to-v2): fix(docs): Update Counter contract tutorial
imports and remove unnecessary sections (#17241)
([#17241](#17241)) —
Aztec Bot, 13 days ago
- chore: remove ACCEPT_DISABLED_AVM_VK_TREE_ROOT
([#17238](#17238)) —
Alex Gherghisan, 13 days ago
- chore: remove bad rollup-version default
([#17223](#17223)) —
Alex Gherghisan, 2 weeks ago
- chore(docs): node docs to v2
([#17205](#17205)) —
esau, 2 weeks ago
- chore(backport-to-v2): chore(avm)!: Fix a misleading log in recursive
verifier related to public input (#17184)
([#17184](#17184)) —
Aztec Bot, 2 weeks ago
- chore: Backport of ignition fix attempt 2
([#17201](#17201)) —
PhilWindle, 2 weeks ago
- chore: turn on testnet compat test
([#17195](#17195)) —
Alex Gherghisan, 2 weeks ago
- chore: Backport fix to staging-ignition to v2
([#17159](#17159)) —
PhilWindle, 3 weeks ago
- chore: kubectl
([#17140](#17140)) —
Alex Gherghisan, 3 weeks ago

#### Other

- backport dependabots p2
([#17488](#17488)) —
mralj, 4 days ago

---------

Co-authored-by: AztecBot <tech@aztecprotocol.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure ci-no-squash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants