Skip to content

Kubernetes Probes#142

Merged
0xOsiris merged 47 commits intoflashbots:mainfrom
worldcoin:forerunner/probes
Apr 29, 2025
Merged

Kubernetes Probes#142
0xOsiris merged 47 commits intoflashbots:mainfrom
worldcoin:forerunner/probes

Conversation

@0xForerunner
Copy link
Collaborator

This PR is dependent on #141
You can view the diff from that PR here

Health, readiness, liveness checks layer added.

  • healthz determines wether everything is 100% up and running. If the builder fails to produce paylaods the healthz endpoint will return an error. op-conductor should eventually be able to use this signal to switch to a different sequencer in an HA sequencer setup.
  • livez currently just exists to make sure we're serving any requests at all. If this ever returns an error at any point, kubernetes can use this signal to restart the pod.
  • readyz returns true as long as we're effectively building payloads from the l2 client. This means that we still build blocks with this instance of rollup-boost.

@vercel
Copy link

vercel bot commented Mar 19, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
rollup-boost ⬜️ Ignored (Inspect) Visit Preview Apr 26, 2025 0:26am

@avalonche
Copy link
Collaborator

Can you add docs on how this works with op-conductor? from the description it seems like each sequencer in the HA setup has 1 builder. how would op-conductor behave if every builder was down and not healthy?

@0xForerunner
Copy link
Collaborator Author

from the description it seems like each sequencer in the HA setup has 1 builder

Yeah this seems to be the typical setup. A 1 to many builder-sequencer relationship is undefined behaviour according to OP. So you need 1:1 sequencer builder.

from the description it seems like each sequencer in the HA setup has 1 builder

Currently op conductor doesn't have any logic for this. Once we have the appropriate probes here we can try to get op-conductor support.

how would op-conductor behave if every builder was down and not healthy

If all 3 are not healthy then we're in a pretty bad state. I suppose a random instance would be chosen to be the primary sequencer in that case.

@0xForerunner
Copy link
Collaborator Author

@avalonche @0xOsiris I've added some docs.

I'm second guessing my use of the /readyz probe here. I'm sort of imparting some extra meaning to it that isn't typically expected. I think perhaps a better way to accomplish this would be to change the response status codes from /healthz.

/healthz responses:

  • 200 OK - healthy
  • 206 Partial Content - l2 creating blocks, builder is down
  • 503 Service Unavailable - We're not building any blocks

What do you think?

Copy link
Collaborator

@avalonche avalonche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, only thing is that I would add in the docs /healthz returns 206 and 503 after the builder / l2 fails to produce a block only once and that endpoint will still return 200 if the builder is up but the local l2 is not

@teddyknox
Copy link
Collaborator

FYI the op-conductor change ethereum-optimism/optimism#15316 depends on this PR now.

@teddyknox
Copy link
Collaborator

lgtm, only thing is that I would add in the docs /healthz returns 206 and 503 after the builder / l2 fails to produce a block only once and that endpoint will still return 200 if the builder is up but the local l2 is not

This should probably be debounced locally if this is expected to happen frequently.

@zhwrd
Copy link

zhwrd commented Apr 24, 2025

Have been integration testing this with our conductor rollup-boost monitoring PR

It seems the /healthz response is sticky which I dont think will work great for conductor, for ex:

  1. sequencer -> rollup-boost -> rbuilder are all healthy, actively sequencing
  2. stop rbuilder (simulating deployment or failure)
  3. rollup-boost reports partial health, triggers conductor healthcheck failure and leadership transfer
  4. start rbuilder (expecting rollup-boost would pick this up and start reporting 200)
  5. rollup-boost still reports partial health, conductor stuck reporting the sequencer as unhealthy

Similarly, you can kill r-builder on a non-active sequencer and conductor will report is as healthy.

Is there any way we can move the health probe to background so we get async rbuilder health updates?

@0xOsiris
Copy link
Collaborator

Have been integration testing this with our conductor rollup-boost monitoring PR

It seems the /healthz response is sticky which I dont think will work great for conductor, for ex:

  1. sequencer -> rollup-boost -> rbuilder are all healthy, actively sequencing
  2. stop rbuilder (simulating deployment or failure)
  3. rollup-boost reports partial health, triggers conductor healthcheck failure and leadership transfer
  4. start rbuilder (expecting rollup-boost would pick this up and start reporting 200)
  5. rollup-boost still reports partial health, conductor stuck reporting the sequencer as unhealthy

Similarly, you can kill r-builder on a non-active sequencer and conductor will report is as healthy.

Is there any way we can move the health probe to background so we get async rbuilder health updates?

Yeah this makes sense - this is because the health probe is only updated during a get_payload call which would only be instantiated when op-node is in a sequencing state. So - you are right after a failover happens partial health will never turn back to healthy. Will see how we can fix this!

@0xOsiris
Copy link
Collaborator

0xOsiris commented Apr 27, 2025

@zhwrd @teddyknox Thanks for the callout on the sticky health status. I've added an additional background health check to the rollup-boost server that continuously monitors unsafe head progression of the builder which should functionally work the same as the health check op-conductor is performing to ensure the unsafe head is progressing within CONDUCTOR_HEALTH_CHECK_UNSAFE_INTERVAL.

This should resolve the issue of the health status not being updated on non-sequencing EL's. In the sequencing case we have now have 2 health checks running in parallel.

  • The builder is returning valid payloads
  • The builder is synced

@0xOsiris 0xOsiris merged commit 8f29ecf into flashbots:main Apr 29, 2025
5 checks passed
@0xOsiris 0xOsiris deleted the forerunner/probes branch April 29, 2025 16:21
theochap pushed a commit to ethereum-optimism/optimism that referenced this pull request Feb 11, 2026
* wip

* wip

* wip

* wip

* clean things up

* fix for cloned service

* cleanup process_response

* eyre bail

* remove unnecessary deps

* Add kubernetes probe layer

* implement health/ready check logic

* modify ready logic

* fix comment/feature

* delete old file

* working

* Update src/client/http.rs

Co-authored-by: shana <avalonche@protonmail.com>

* parse response cod

* clippy fix

* Probe docs

* Switch to returning health status only from /healthz using http status codes

* Update docs to describe health status codes

* remove stray comments

* cleanup, add tests

* remove stray comment

* chore: rm mocks

* chore: fmt

* fix: default to healthy status

* chore: fix dockerignore

* feat: add background process to query block height as a health check for non-sequencing el's

* fix: signatures

* chore: update comments

* chore: clippy

* test: add tests

* fix: stress tests

* fix: change health check to check unsafe head progression on builder

* chore: update doc comments

* fix: loop

* chore: update comments

* merge main

---------

Co-authored-by: shana <avalonche@protonmail.com>
Co-authored-by: 0xOsiris <djosiris@proton.me>
theochap pushed a commit to ethereum-optimism/optimism that referenced this pull request Feb 13, 2026
* wip

* wip

* wip

* wip

* clean things up

* fix for cloned service

* cleanup process_response

* eyre bail

* remove unnecessary deps

* Add kubernetes probe layer

* implement health/ready check logic

* modify ready logic

* fix comment/feature

* delete old file

* working

* Update src/client/http.rs

Co-authored-by: shana <avalonche@protonmail.com>

* parse response cod

* clippy fix

* Probe docs

* Switch to returning health status only from /healthz using http status codes

* Update docs to describe health status codes

* remove stray comments

* cleanup, add tests

* remove stray comment

* chore: rm mocks

* chore: fmt

* fix: default to healthy status

* chore: fix dockerignore

* feat: add background process to query block height as a health check for non-sequencing el's

* fix: signatures

* chore: update comments

* chore: clippy

* test: add tests

* fix: stress tests

* fix: change health check to check unsafe head progression on builder

* chore: update doc comments

* fix: loop

* chore: update comments

* merge main

---------

Co-authored-by: shana <avalonche@protonmail.com>
Co-authored-by: 0xOsiris <djosiris@proton.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants