Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: health check in worker #1006

Merged

Conversation

SantiagoPittella
Copy link
Collaborator

@SantiagoPittella SantiagoPittella commented Dec 6, 2024

This PR adds:

  • Health Check endpoints to the remote prover using the tonic-health crate, which adds the required endpoints without the need of defining the methods in a .proto file.
  • Add a background service in the Proxy that periodically checks the health of each worker and removes the ones that are not working.

@SantiagoPittella SantiagoPittella force-pushed the santiagopittella-add-healthcheck-endpoint-in-worker branch from f1e7b89 to 077902b Compare December 6, 2024 18:27
@SantiagoPittella SantiagoPittella changed the title feat: Health check in workers + update on proxy feat: health check in worker Dec 6, 2024
@SantiagoPittella SantiagoPittella marked this pull request as ready for review December 6, 2024 18:36
Copy link
Collaborator

@igamigo igamigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Leaving some comments for now, will take a deeper look (and maybe test it as well) later

bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/mod.rs Show resolved Hide resolved
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you. I left some comments inline.

Another thing I was thinking about (not for this PR): do we actually want to maintain the list of workers in the config file? It seems like it requires quite a bit of synchronization. Maybe we remove it from there and pass the list of initial workers on proxy startup as CLI parameters?

bin/tx-prover/src/commands/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/commands/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/commands/worker.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
@SantiagoPittella
Copy link
Collaborator Author

@bobbinth @igamigo I addressed the comments and opened #1008 and #1009 , the first being related to remove the workers from the configuration file and the second to define and implement a retry policy for the health check.

Copy link
Collaborator

@tomyrd tomyrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left a couple of questions. I want to test it locally before approving it.

Comment on lines +54 to +56
tokio::task::spawn_blocking(|| server.run_forever())
.await
.map_err(|err| err.to_string())?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to run run_forever in a separate thread if we instantly wait for it to end?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do so because Pingora create a new runtime in .run_forever() without the possibility of passing and existing one. The tokio::task::spawn_block was introduced in order to avoid a panic each time that we instantiate a proxy.

configuration
.save_to_config_file()
.map_err(|err| format!("Failed to save configuration: {}", err))?;
ProxyConfig::set_workers(new_list_of_workers)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to lock the file when writing the workers? What would happen if both the update_workers function and health check background service want to write to the config file at the same time? This won't be an issue when we remove the list from the file (#1008) so maybe it's ok to leave it for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is safe because both calls are under a worker's write. Though as you mentioned, this is going to be removed.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left a couple of comments inline. After these comments are addressed (as well as comments from @tomyrd and @igamigo), we can probably merge.

Though, I would follow this up with a PR to address #1008 as this should simplify the structure considerably.

bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/Cargo.toml Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/mod.rs Show resolved Hide resolved
Copy link
Collaborator

@tomyrd tomyrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it on my machine and it works correctly

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left a few small comments inline - but we are basically good to merge.

bin/tx-prover/src/proxy/worker.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/worker.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/worker.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@igamigo igamigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some minor comments, but no need to tackle them in this PR


Ok(())
}

/// Updates the workers in the configuration with the new list.
pub(crate) fn set_workers(workers: Vec<WorkerConfig>) -> Result<(), String> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be done as part of the incoming follow-up work, but we should probably check (unless it's being dome somewhere already) that there are no duplicate workers at any point (both in the worker list and persisted config file) to avoid problems if the user accidentally adds the same address/port twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not being checked. Though the worker config is planned to be removed from the configuration file. I think we can dismiss this for now and use that issue to fix this.

let mut healthy_workers = Vec::new();

for worker in workers {
if worker.is_healthy().await {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but I think it would be nice to parallelize these checks somehow. Since they are sequential, if you have a couple of them that are failing, they can both stall all the checks by the defined timeout amount.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it will be good to add this to the #1009 issue (the one of the retries), and change a bit the scope of that issue as a more general "improve health check" in the proxy side. I'm updating the issue, let me know your opinion.

/// This wrapper is used to implement the ProxyHttp trait for Arc<LoadBalancer>.
/// This is necessary because we want to share the load balancer between the proxy server and the
/// health check background service.
pub struct LoadBalancerWrapper(pub Arc<LoadBalancer>);
Copy link
Collaborator

@igamigo igamigo Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Because now ProxyHttp is being implemented for the wrapper struct, I think we could rename LoadBalancerWrapper into LoadBalancer and LoadBalancer could be something like LoadBalancerState or LoadBalancerConfig since the idea of adding the wrapper was to be able to easily share the state anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like LoadBalancerState, pretty much straightforward to wait it does. I'm renaming it.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me! Thank you!

@SantiagoPittella or @igamigo - feel free to merge w/e you think is appropriate.

…from proxy

feat: add health check server in worker

feat: add gRPC healthcheck methods

feat: add health check in proxy

feat: add ProxyConfig::update_workers method

docs: add entry to changelog

docs: update README

docs: remove old doc from field

chore: remove unused fields in Worker Config, improve Worker::execute documentation

docs: add documentation to LBWrapper, add documentation to BackgroundService, remove unwraps

chore: split BackgroundService implementation in different methods
@SantiagoPittella SantiagoPittella force-pushed the santiagopittella-add-healthcheck-endpoint-in-worker branch from 91d82b2 to b3c7f2f Compare December 11, 2024 17:28
@SantiagoPittella SantiagoPittella merged commit c0449ac into next Dec 11, 2024
9 checks passed
@SantiagoPittella SantiagoPittella deleted the santiagopittella-add-healthcheck-endpoint-in-worker branch December 11, 2024 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants