docs: rollup-boost HA design by 0xKitsune · Pull Request #185 · flashbots/rollup-boost

0xKitsune · 2025-04-21T17:53:18Z

Opening this PR to migrate the design doc discussion from #181 so reviewers can leave comments directly on specific sections and suggest edits inline.

vercel · 2025-04-21T17:53:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
rollup-boost	⬜️ Ignored (Inspect)	Visit Preview		May 16, 2025 2:09pm

0x00101010 · 2025-04-21T22:49:46Z

docs/rollup-boost-ha.md

+- Maintain compatibility with `op-conductor` and its sequencing assumptions.
+
+## Non Goals
+- Define how Flashblocks are handled, consumed or streamed to the network.


This is actually a very important part that should be included in this design.

Think about one situation, if say in #2 load balancing solution below, what if get_payload call returns from a different builder that streams to public? This will cause flashblock <-> actual block inconsistencies which we aim to avoid as much as possible.

should we update this now? I think we need to take this into consideration when designing HA solutions for flashblocks

For clarity, on the design review call last week we aligned on focusing this doc on an HA design that is forward compatible with Flashblocks without defining the full Flashblocks HA specifics here.

The plan is to fast follow with a separate document that extends the HA design detailing Flashblocks HA behavior including op-conductor integration and consistency guarantees. Happy to update and clarify to make this more explicit. Let me know if you have any additional thoughts on this.

id do understand that this spec does not cover FBS, but for the future:
There is little contradiction in the failover section because of flashblocks.
In https://flashbots.notion.site/Draft_Flashblocks_Eth_OPstack-md-1d66b4a0d876808493c9dcc4428ccad7#1d66b4a0d87680c8b550d356ad113318 it says that we prioritize preconfirmations, so if leader would become partially healthy (builder is down) we will still use block built by builder. In https://github.com/flashbots/rollup-boost/pull/185/files#diff-df3270fd8794e55e7c8f29d84e2605dc61fecaf32696ac4788661c0038e19a6fR242 it says that we will choose block from the EL.

Thanks for flagging this. Agreed we should updated this. Ideally we can specify this behavior in the follow-on update that specifies Flashblocks HA behavior.

Given that Flashblocks is a "feature" that may or may not be enabled in a Rollup-Boost-based setup, it makes sense imho to have the HA architecture be orthogonal to Flashblocks and then perhaps have an extension for how to handle Flashblocks specifically.

docs/rollup-boost-ha.md

0x00101010 · 2025-04-21T23:18:18Z

docs/rollup-boost-ha.md

+
+## 1:1 Rollup Boost to Builder Deployments
+
+In this design, each `rollup-boost` instance is configured with a single external builder and default execution client. When `op-node` sends an FCU containing payload attributes, `rollup-boost` forwards the request to both the default execution client and its paired builder. Upon receiving a `get_payload` request from `op-node`, `rollup-boost` queries both the execution client and the builder. If the builder returns a payload, it is validated via a `new_payload` request sent to the default execution client. If the builder payload is invalid or unavailable, `rollup-boost` falls back to the execution client's payload.


In addition to the current setup, another feature will need to be added here.

We'll need to allow conductor to proxy flashblocks web socket connections / results to the public (based on if the current node is leader or not). This is a common pattern used in conductor, proxy any interaction with the system that strictly requires to interact with the leader.

One thing to note here is that during leadership transfer, there could be flashblocks reorg in the current design, but much more controllable and only happens during leadership transfer

And if we want to prevent reorgs, we'll need to consider committing flashblocks to the raft consensus for consistency during leadership transfer, and that is second phase of our planned work

…rtial health

docs/rollup-boost-ha.md

Co-authored-by: Francis Li <lihuiyi0406@gmail.com>

docs/rollup-boost-ha.md

0x00101010 · 2025-05-06T17:22:04Z

docs/rollup-boost-ha.md

+## Health Checks
+
+In high availability deployments, `op-conductor` must assess the full health of the block production path. Rollup Boost will expose a composite `/healthz` endpoint to report on both builder synchronization and payload production status. These checks allow `op-conductor` to detect degraded block building conditions and make informed leadership decisions.
+
+Rollup Boost will continuously monitors two independent conditions to inform the health of the builder and the default execution client:
+
+- **Builder Synchronization**:  
+  A background task periodically queries the builder’s latest unsafe block via `engine_getBlockByNumber`. The task compares the timestamp of the returned block to the local system time. If the difference exceeds a configured maximum unsafe interval (`max_unsafe_interval`), the builder is considered out of sync. Failure to fetch a block from the builder or detection of an outdated block timestamp results in the health status being downgraded to Partial. If the builder is responsive and the block timestamp is within the acceptable interval, the builder is considered synchronized and healthy. Alternatively instead of periodic polling, builder synchronization can be inferred if the builder returns a `VALID` response to a `newPayload` call forwarded from Rollup Boost.
+
+- **Payload Production**:  
+  During each `get_payload` request, Rollup Boost will verify payload availability from both the builder and the execution client. If the builder fails to deliver a payload, Rollup Boost will report partial health. If the execution client fails to deliver a payload, Rollup Boost will report unhealthy.
+
+`op-conductor` should also be configurable in how it interprets health status for failover decisions. This allows chain operators to define thresholds based on their risk tolerance and operational goals. For example, operators may choose to maintain leadership with a sequencer reporting `206 Partial Content` to avoid unnecessary fail overs or they may configure `op-conductor` to immediately fail over when any degradation is detected. This flexibility allows the chain operator to configure a failover policy that aligns with network performance expectations and builder reliability.


Thinking about this a little bit more, curious is there any specific reason (other than convenience that rollup-boost already connects to both builder and default EL) that we do the health check in rollup-boost, not conductor?

The potential drawback here is that due to potentially different configurations or actual health check cadence, there might be a delay in when builder is reported unhealthy and conductor knows it and starts the leadership transfer.

It's not a big deal if we can tolerate a little bit delay in that, but feels like in general conductor (with a new flashblocks / rbuilder health check strategy) is the ideal home for this

Thinking about this a little bit more, curious is there any specific reason (other than convenience that rollup-boost already connects to both builder and default EL) that we do the health check in rollup-boost, not conductor?

Our current thinking is that rollup-boost health checks should evaluate both if the builder is synced as well as if it is producing valid payloads.

By placing the builder health check in rollup-boost, we can assess payload health during get_payload calls. Each builder payload is validated via a newPayload call to the local execution client, confirming that the builder is producing valid payloads before marking the builder as healthy.

While op-conductor could perform basic sync checks, this setup allows the health check to ensure block production/validity, giving us a stronger signal of health than sync checks alone. Let me know if you have any thoughts on this.

Cool, I think this makes sense, one potential suggestion here is to mention (maybe not exactly here as design doc) that to minimize health check delay, would suggest to clearly consider / configure matching health check intervals for both conductor and rollup-boost

0x00101010

LGTM!

jelias2

lgtm

docs/rollup-boost-ha.md

Co-authored-by: odysseas.eth <odyslam@gmail.com>

* feat: rollup boost ha design doc * fix: add assets * fix: path to assets * fix: update op stack ha diagram * fix: op-stack-ha diagram * fix: wording * fix: update builder load balancer diagram * chore: remove n builders, load balancer design options * chore: add failover details to overview * feat: add health check section * docs: specify op conductor failover behavior * docs: add execution mode, debug api, todos * chore: update exectuion mode section * fix: fix execution mode wording * docs: update debug api * fix: typos, wording * docs: add note about bypassing builder when unhealthy * chore: update failure scenarios * fmt: formatting * fmt: update spacing * fix: update debug api request/response blocks * fmt: spacing * fix: update debug api anchor * fix: clarify sequencer failure scenario * fix: clarify sequencer failure scenario * chore: update goals/non goals * chore: add alternative builder sync check * chore: note about configurable failover policy * chore: update failure scenarios to include conditional failover if partial health * chore: add note about flashblocks forward compatibility * fix: update partial health description Co-authored-by: Francis Li <lihuiyi0406@gmail.com> * fix: clarify health status prioritization * chore: clarify language regarding inactive sequencer health checks * fmt: update ordering * fix: clarify failure modes * fix: update failure modes to include default execution client failure * docs: fix typo Co-authored-by: odysseas.eth <odyslam@gmail.com> --------- Co-authored-by: Francis Li <lihuiyi0406@gmail.com> Co-authored-by: odysseas.eth <odyslam@gmail.com>

0xKitsune and others added 7 commits April 17, 2025 14:41

feat: rollup boost ha design doc

077439c

fix: add assets

689f1ea

fix: path to assets

c261651

fix: update op stack ha diagram

6fab865

fix: op-stack-ha diagram

14f36dd

fix: wording

8607a12

fix: update builder load balancer diagram

6f05caa

0xKitsune mentioned this pull request Apr 21, 2025

RFC: Rollup Boost HA Design #181

Closed

0x00101010 reviewed Apr 21, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed Apr 21, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed Apr 21, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed Apr 21, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed Apr 21, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed Apr 21, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed Apr 21, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed Apr 21, 2025

View reviewed changes

0xKitsune and others added 12 commits April 25, 2025 13:59

chore: remove n builders, load balancer design options

62d3a51

chore: add failover details to overview

91a25a0

feat: add health check section

c420689

docs: specify op conductor failover behavior

d7dec2f

docs: add execution mode, debug api, todos

7701270

chore: update exectuion mode section

df8e04f

fix: fix execution mode wording

de6c60f

docs: update debug api

fd3403d

fix: typos, wording

53e3350

docs: add note about bypassing builder when unhealthy

47d49dc

chore: update failure scenarios

a0290d6

fmt: formatting

1b5441d

0xKitsune added 4 commits May 6, 2025 00:38

chore: add alternative builder sync check

5f8e9ba

chore: note about configurable failover policy

b9dfe8e

chore: update failure scenarios to include conditional failover if pa…

b5e7393

…rtial health

chore: add note about flashblocks forward compatibility

eb6cc52

0x00101010 reviewed May 6, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

fix: update partial health description

a808eb1

Co-authored-by: Francis Li <lihuiyi0406@gmail.com>

0x00101010 reviewed May 6, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed May 6, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed May 6, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed May 6, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed May 6, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

0x00101010 reviewed May 6, 2025

View reviewed changes

0xKitsune added 5 commits May 6, 2025 15:54

fix: clarify health status prioritization

0cb7042

chore: clarify language regarding inactive sequencer health checks

eb85744

fmt: update ordering

0c50eb7

fix: clarify failure modes

1fb88b3

fix: update failure modes to include default execution client failure

da48180

0x00101010 approved these changes May 7, 2025

View reviewed changes

jelias2 approved these changes May 7, 2025

View reviewed changes

0xForerunner approved these changes May 8, 2025

View reviewed changes

0xOsiris approved these changes May 8, 2025

View reviewed changes

karankurbur approved these changes May 8, 2025

View reviewed changes

odyslam reviewed May 16, 2025

View reviewed changes

docs/rollup-boost-ha.md Outdated Show resolved Hide resolved

odyslam reviewed May 16, 2025

View reviewed changes

docs/rollup-boost-ha.md Show resolved Hide resolved

docs: fix typo

8afb0a5

Co-authored-by: odysseas.eth <odyslam@gmail.com>

0xKitsune changed the title ~~RFC: Rollup Boost HA Design~~ docs: rollup-boost HA design May 16, 2025

0xKitsune merged commit 4d4ee74 into flashbots:main May 16, 2025
6 of 7 checks passed

0xKitsune deleted the rollup-boost-ha branch May 16, 2025 15:46


		## 1:1 Rollup Boost to Builder Deployments

		In this design, each `rollup-boost` instance is configured with a single external builder and default execution client. When `op-node` sends an FCU containing payload attributes, `rollup-boost` forwards the request to both the default execution client and its paired builder. Upon receiving a `get_payload` request from `op-node`, `rollup-boost` queries both the execution client and the builder. If the builder returns a payload, it is validated via a `new_payload` request sent to the default execution client. If the builder payload is invalid or unavailable, `rollup-boost` falls back to the execution client's payload.

Conversation

0xKitsune commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0x00101010 Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

0x00101010 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

0xKitsune May 6, 2025

Choose a reason for hiding this comment

Uh oh!

SozinM May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xKitsune May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

odyslam May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

0x00101010 Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

0x00101010 Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

0x00101010 May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xKitsune May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0x00101010 May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0x00101010 left a comment

Choose a reason for hiding this comment

Uh oh!

jelias2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

0xKitsune commented Apr 21, 2025 •

edited

Loading

vercel bot commented Apr 21, 2025 •

edited

Loading

SozinM May 12, 2025 •

edited

Loading

0xKitsune May 15, 2025 •

edited

Loading

0x00101010 Apr 21, 2025 •

edited

Loading

0x00101010 May 6, 2025 •

edited

Loading

0xKitsune May 7, 2025 •

edited

Loading

0x00101010 May 7, 2025 •

edited

Loading