feat: health check in worker #1006

SantiagoPittella · 2024-12-06T15:30:34Z

This PR adds:

Health Check endpoints to the remote prover using the tonic-health crate, which adds the required endpoints without the need of defining the methods in a .proto file.
Add a background service in the Proxy that periodically checks the health of each worker and removes the ones that are not working.

igamigo

Nice! Leaving some comments for now, will take a deeper look (and maybe test it as well) later

bin/tx-prover/src/proxy/mod.rs

bobbinth

Looks good! Thank you. I left some comments inline.

Another thing I was thinking about (not for this PR): do we actually want to maintain the list of workers in the config file? It seems like it requires quite a bit of synchronization. Maybe we remove it from there and pass the list of initial workers on proxy startup as CLI parameters?

bin/tx-prover/src/commands/mod.rs

bin/tx-prover/src/commands/worker.rs

bin/tx-prover/src/proxy/mod.rs

SantiagoPittella · 2024-12-09T17:01:51Z

@bobbinth @igamigo I addressed the comments and opened #1008 and #1009 , the first being related to remove the workers from the configuration file and the second to define and implement a retry policy for the health check.

tomyrd

Looks good, left a couple of questions. I want to test it locally before approving it.

tomyrd · 2024-12-09T20:23:26Z

bin/tx-prover/src/commands/proxy.rs

+        tokio::task::spawn_blocking(|| server.run_forever())
+            .await
+            .map_err(|err| err.to_string())?;


Why do we need to run run_forever in a separate thread if we instantly wait for it to end?

We do so because Pingora create a new runtime in .run_forever() without the possibility of passing and existing one. The tokio::task::spawn_block was introduced in order to avoid a panic each time that we instantiate a proxy.

tomyrd · 2024-12-09T20:57:02Z

bin/tx-prover/src/proxy/mod.rs

-        configuration
-            .save_to_config_file()
-            .map_err(|err| format!("Failed to save configuration: {}", err))?;
+        ProxyConfig::set_workers(new_list_of_workers)?;


Do we need to lock the file when writing the workers? What would happen if both the update_workers function and health check background service want to write to the config file at the same time? This won't be an issue when we remove the list from the file (#1008) so maybe it's ok to leave it for now.

I think this is safe because both calls are under a worker's write. Though as you mentioned, this is going to be removed.

bobbinth

Looks good! Thank you! I left a couple of comments inline. After these comments are addressed (as well as comments from @tomyrd and @igamigo), we can probably merge.

Though, I would follow this up with a PR to address #1008 as this should simplify the structure considerably.

bin/tx-prover/src/proxy/mod.rs

bin/tx-prover/Cargo.toml

bin/tx-prover/src/proxy/mod.rs

tomyrd

Tested it on my machine and it works correctly

bobbinth

Looks good! Thank you! I left a few small comments inline - but we are basically good to merge.

bin/tx-prover/src/proxy/worker.rs

igamigo

LGTM! Left some minor comments, but no need to tackle them in this PR

igamigo · 2024-12-11T12:43:16Z

bin/tx-prover/src/commands/mod.rs


        Ok(())
    }
+
+    /// Updates the workers in the configuration with the new list.
+    pub(crate) fn set_workers(workers: Vec<WorkerConfig>) -> Result<(), String> {


Maybe this could be done as part of the incoming follow-up work, but we should probably check (unless it's being dome somewhere already) that there are no duplicate workers at any point (both in the worker list and persisted config file) to avoid problems if the user accidentally adds the same address/port twice.

It is not being checked. Though the worker config is planned to be removed from the configuration file. I think we can dismiss this for now and use that issue to fix this.

igamigo · 2024-12-11T12:47:12Z

bin/tx-prover/src/proxy/mod.rs

+        let mut healthy_workers = Vec::new();
+
+        for worker in workers {
+            if worker.is_healthy().await {


Not for this PR but I think it would be nice to parallelize these checks somehow. Since they are sequential, if you have a couple of them that are failing, they can both stall all the checks by the defined timeout amount.

Maybe it will be good to add this to the #1009 issue (the one of the retries), and change a bit the scope of that issue as a more general "improve health check" in the proxy side. I'm updating the issue, let me know your opinion.

igamigo · 2024-12-11T12:50:34Z

bin/tx-prover/src/proxy/mod.rs

+/// This wrapper is used to implement the ProxyHttp trait for Arc<LoadBalancer>.
+/// This is necessary because we want to share the load balancer between the proxy server and the
+/// health check background service.
+pub struct LoadBalancerWrapper(pub Arc<LoadBalancer>);


nit: Because now ProxyHttp is being implemented for the wrapper struct, I think we could rename LoadBalancerWrapper into LoadBalancer and LoadBalancer could be something like LoadBalancerState or LoadBalancerConfig since the idea of adding the wrapper was to be able to easily share the state anyway

I like LoadBalancerState, pretty much straightforward to wait it does. I'm renaming it.

bobbinth

All looks good to me! Thank you!

@SantiagoPittella or @igamigo - feel free to merge w/e you think is appropriate.

…from proxy feat: add health check server in worker feat: add gRPC healthcheck methods feat: add health check in proxy feat: add ProxyConfig::update_workers method docs: add entry to changelog docs: update README docs: remove old doc from field chore: remove unused fields in Worker Config, improve Worker::execute documentation docs: add documentation to LBWrapper, add documentation to BackgroundService, remove unwraps chore: split BackgroundService implementation in different methods

SantiagoPittella force-pushed the santiagopittella-add-healthcheck-endpoint-in-worker branch from f1e7b89 to 077902b Compare December 6, 2024 18:27

SantiagoPittella changed the title ~~feat: Health check in workers + update on proxy~~ feat: health check in worker Dec 6, 2024

SantiagoPittella marked this pull request as ready for review December 6, 2024 18:36

SantiagoPittella requested review from bobbinth and igamigo December 6, 2024 18:37

igamigo reviewed Dec 6, 2024

View reviewed changes

bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved

bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved

bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved

bin/tx-prover/src/proxy/mod.rs Show resolved Hide resolved

bobbinth reviewed Dec 7, 2024

View reviewed changes

SantiagoPittella requested review from bobbinth and igamigo December 9, 2024 19:09

tomyrd reviewed Dec 9, 2024

View reviewed changes

bobbinth approved these changes Dec 10, 2024

View reviewed changes

bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved

bin/tx-prover/Cargo.toml Outdated Show resolved Hide resolved

bin/tx-prover/src/proxy/mod.rs Show resolved Hide resolved

tomyrd approved these changes Dec 10, 2024

View reviewed changes

bobbinth approved these changes Dec 11, 2024

View reviewed changes

bin/tx-prover/src/proxy/worker.rs Outdated Show resolved Hide resolved

bin/tx-prover/src/proxy/worker.rs Outdated Show resolved Hide resolved

bin/tx-prover/src/proxy/worker.rs Outdated Show resolved Hide resolved

igamigo approved these changes Dec 11, 2024

View reviewed changes

bobbinth approved these changes Dec 11, 2024

View reviewed changes

SantiagoPittella added 2 commits December 11, 2024 14:26

review: address review comments

b3c7f2f

SantiagoPittella force-pushed the santiagopittella-add-healthcheck-endpoint-in-worker branch from 91d82b2 to b3c7f2f Compare December 11, 2024 17:28

SantiagoPittella merged commit c0449ac into next Dec 11, 2024
9 checks passed

SantiagoPittella deleted the santiagopittella-add-healthcheck-endpoint-in-worker branch December 11, 2024 18:23

bobbinth mentioned this pull request Dec 13, 2024

Add health check endpoint in the prover service worker #945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: health check in worker #1006

feat: health check in worker #1006

SantiagoPittella commented Dec 6, 2024 •

edited

Loading

igamigo left a comment

bobbinth left a comment

SantiagoPittella commented Dec 9, 2024

tomyrd left a comment

tomyrd Dec 9, 2024

SantiagoPittella Dec 10, 2024

tomyrd Dec 9, 2024

SantiagoPittella Dec 10, 2024

bobbinth left a comment

tomyrd left a comment

bobbinth left a comment

igamigo left a comment

igamigo Dec 11, 2024

SantiagoPittella Dec 11, 2024

igamigo Dec 11, 2024

SantiagoPittella Dec 11, 2024

igamigo Dec 11, 2024 •

edited

Loading

SantiagoPittella Dec 11, 2024

bobbinth left a comment

feat: health check in worker #1006

feat: health check in worker #1006

Conversation

SantiagoPittella commented Dec 6, 2024 • edited Loading

igamigo left a comment

Choose a reason for hiding this comment

bobbinth left a comment

Choose a reason for hiding this comment

SantiagoPittella commented Dec 9, 2024

tomyrd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobbinth left a comment

Choose a reason for hiding this comment

tomyrd left a comment

Choose a reason for hiding this comment

bobbinth left a comment

Choose a reason for hiding this comment

igamigo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igamigo Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobbinth left a comment

Choose a reason for hiding this comment

SantiagoPittella commented Dec 6, 2024 •

edited

Loading

igamigo Dec 11, 2024 •

edited

Loading