docs: rollup-boost HA design#185
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
| - Maintain compatibility with `op-conductor` and its sequencing assumptions. | ||
|
|
||
| ## Non Goals | ||
| - Define how Flashblocks are handled, consumed or streamed to the network. |
There was a problem hiding this comment.
This is actually a very important part that should be included in this design.
Think about one situation, if say in #2 load balancing solution below, what if get_payload call returns from a different builder that streams to public? This will cause flashblock <-> actual block inconsistencies which we aim to avoid as much as possible.
There was a problem hiding this comment.
should we update this now? I think we need to take this into consideration when designing HA solutions for flashblocks
There was a problem hiding this comment.
For clarity, on the design review call last week we aligned on focusing this doc on an HA design that is forward compatible with Flashblocks without defining the full Flashblocks HA specifics here.
The plan is to fast follow with a separate document that extends the HA design detailing Flashblocks HA behavior including op-conductor integration and consistency guarantees. Happy to update and clarify to make this more explicit. Let me know if you have any additional thoughts on this.
There was a problem hiding this comment.
id do understand that this spec does not cover FBS, but for the future:
There is little contradiction in the failover section because of flashblocks.
In https://flashbots.notion.site/Draft_Flashblocks_Eth_OPstack-md-1d66b4a0d876808493c9dcc4428ccad7#1d66b4a0d87680c8b550d356ad113318 it says that we prioritize preconfirmations, so if leader would become partially healthy (builder is down) we will still use block built by builder. In https://github.com/flashbots/rollup-boost/pull/185/files#diff-df3270fd8794e55e7c8f29d84e2605dc61fecaf32696ac4788661c0038e19a6fR242 it says that we will choose block from the EL.
There was a problem hiding this comment.
Thanks for flagging this. Agreed we should updated this. Ideally we can specify this behavior in the follow-on update that specifies Flashblocks HA behavior.
There was a problem hiding this comment.
Given that Flashblocks is a "feature" that may or may not be enabled in a Rollup-Boost-based setup, it makes sense imho to have the HA architecture be orthogonal to Flashblocks and then perhaps have an extension for how to handle Flashblocks specifically.
docs/rollup-boost-ha.md
Outdated
|
|
||
| ## 1:1 Rollup Boost to Builder Deployments | ||
|
|
||
| In this design, each `rollup-boost` instance is configured with a single external builder and default execution client. When `op-node` sends an FCU containing payload attributes, `rollup-boost` forwards the request to both the default execution client and its paired builder. Upon receiving a `get_payload` request from `op-node`, `rollup-boost` queries both the execution client and the builder. If the builder returns a payload, it is validated via a `new_payload` request sent to the default execution client. If the builder payload is invalid or unavailable, `rollup-boost` falls back to the execution client's payload. |
There was a problem hiding this comment.
In addition to the current setup, another feature will need to be added here.
We'll need to allow conductor to proxy flashblocks web socket connections / results to the public (based on if the current node is leader or not). This is a common pattern used in conductor, proxy any interaction with the system that strictly requires to interact with the leader.
There was a problem hiding this comment.
One thing to note here is that during leadership transfer, there could be flashblocks reorg in the current design, but much more controllable and only happens during leadership transfer
And if we want to prevent reorgs, we'll need to consider committing flashblocks to the raft consensus for consistency during leadership transfer, and that is second phase of our planned work
Co-authored-by: Francis Li <lihuiyi0406@gmail.com>
| ## Health Checks | ||
|
|
||
| In high availability deployments, `op-conductor` must assess the full health of the block production path. Rollup Boost will expose a composite `/healthz` endpoint to report on both builder synchronization and payload production status. These checks allow `op-conductor` to detect degraded block building conditions and make informed leadership decisions. | ||
|
|
||
| Rollup Boost will continuously monitors two independent conditions to inform the health of the builder and the default execution client: | ||
|
|
||
| - **Builder Synchronization**: | ||
| A background task periodically queries the builder’s latest unsafe block via `engine_getBlockByNumber`. The task compares the timestamp of the returned block to the local system time. If the difference exceeds a configured maximum unsafe interval (`max_unsafe_interval`), the builder is considered out of sync. Failure to fetch a block from the builder or detection of an outdated block timestamp results in the health status being downgraded to Partial. If the builder is responsive and the block timestamp is within the acceptable interval, the builder is considered synchronized and healthy. Alternatively instead of periodic polling, builder synchronization can be inferred if the builder returns a `VALID` response to a `newPayload` call forwarded from Rollup Boost. | ||
|
|
||
| - **Payload Production**: | ||
| During each `get_payload` request, Rollup Boost will verify payload availability from both the builder and the execution client. If the builder fails to deliver a payload, Rollup Boost will report partial health. If the execution client fails to deliver a payload, Rollup Boost will report unhealthy. | ||
|
|
||
| `op-conductor` should also be configurable in how it interprets health status for failover decisions. This allows chain operators to define thresholds based on their risk tolerance and operational goals. For example, operators may choose to maintain leadership with a sequencer reporting `206 Partial Content` to avoid unnecessary fail overs or they may configure `op-conductor` to immediately fail over when any degradation is detected. This flexibility allows the chain operator to configure a failover policy that aligns with network performance expectations and builder reliability. |
There was a problem hiding this comment.
Thinking about this a little bit more, curious is there any specific reason (other than convenience that rollup-boost already connects to both builder and default EL) that we do the health check in rollup-boost, not conductor?
The potential drawback here is that due to potentially different configurations or actual health check cadence, there might be a delay in when builder is reported unhealthy and conductor knows it and starts the leadership transfer.
It's not a big deal if we can tolerate a little bit delay in that, but feels like in general conductor (with a new flashblocks / rbuilder health check strategy) is the ideal home for this
There was a problem hiding this comment.
Thinking about this a little bit more, curious is there any specific reason (other than convenience that rollup-boost already connects to both builder and default EL) that we do the health check in rollup-boost, not conductor?
Our current thinking is that rollup-boost health checks should evaluate both if the builder is synced as well as if it is producing valid payloads.
By placing the builder health check in rollup-boost, we can assess payload health during get_payload calls. Each builder payload is validated via a newPayload call to the local execution client, confirming that the builder is producing valid payloads before marking the builder as healthy.
While op-conductor could perform basic sync checks, this setup allows the health check to ensure block production/validity, giving us a stronger signal of health than sync checks alone. Let me know if you have any thoughts on this.
There was a problem hiding this comment.
Cool, I think this makes sense, one potential suggestion here is to mention (maybe not exactly here as design doc) that to minimize health check delay, would suggest to clearly consider / configure matching health check intervals for both conductor and rollup-boost
Co-authored-by: odysseas.eth <odyslam@gmail.com>
* feat: rollup boost ha design doc * fix: add assets * fix: path to assets * fix: update op stack ha diagram * fix: op-stack-ha diagram * fix: wording * fix: update builder load balancer diagram * chore: remove n builders, load balancer design options * chore: add failover details to overview * feat: add health check section * docs: specify op conductor failover behavior * docs: add execution mode, debug api, todos * chore: update exectuion mode section * fix: fix execution mode wording * docs: update debug api * fix: typos, wording * docs: add note about bypassing builder when unhealthy * chore: update failure scenarios * fmt: formatting * fmt: update spacing * fix: update debug api request/response blocks * fmt: spacing * fix: update debug api anchor * fix: clarify sequencer failure scenario * fix: clarify sequencer failure scenario * chore: update goals/non goals * chore: add alternative builder sync check * chore: note about configurable failover policy * chore: update failure scenarios to include conditional failover if partial health * chore: add note about flashblocks forward compatibility * fix: update partial health description Co-authored-by: Francis Li <lihuiyi0406@gmail.com> * fix: clarify health status prioritization * chore: clarify language regarding inactive sequencer health checks * fmt: update ordering * fix: clarify failure modes * fix: update failure modes to include default execution client failure * docs: fix typo Co-authored-by: odysseas.eth <odyslam@gmail.com> --------- Co-authored-by: Francis Li <lihuiyi0406@gmail.com> Co-authored-by: odysseas.eth <odyslam@gmail.com>
* feat: rollup boost ha design doc * fix: add assets * fix: path to assets * fix: update op stack ha diagram * fix: op-stack-ha diagram * fix: wording * fix: update builder load balancer diagram * chore: remove n builders, load balancer design options * chore: add failover details to overview * feat: add health check section * docs: specify op conductor failover behavior * docs: add execution mode, debug api, todos * chore: update exectuion mode section * fix: fix execution mode wording * docs: update debug api * fix: typos, wording * docs: add note about bypassing builder when unhealthy * chore: update failure scenarios * fmt: formatting * fmt: update spacing * fix: update debug api request/response blocks * fmt: spacing * fix: update debug api anchor * fix: clarify sequencer failure scenario * fix: clarify sequencer failure scenario * chore: update goals/non goals * chore: add alternative builder sync check * chore: note about configurable failover policy * chore: update failure scenarios to include conditional failover if partial health * chore: add note about flashblocks forward compatibility * fix: update partial health description Co-authored-by: Francis Li <lihuiyi0406@gmail.com> * fix: clarify health status prioritization * chore: clarify language regarding inactive sequencer health checks * fmt: update ordering * fix: clarify failure modes * fix: update failure modes to include default execution client failure * docs: fix typo Co-authored-by: odysseas.eth <odyslam@gmail.com> --------- Co-authored-by: Francis Li <lihuiyi0406@gmail.com> Co-authored-by: odysseas.eth <odyslam@gmail.com>
Opening this PR to migrate the design doc discussion from #181 so reviewers can leave comments directly on specific sections and suggest edits inline.