-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: health check in worker #1006
feat: health check in worker #1006
Conversation
f1e7b89
to
077902b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Leaving some comments for now, will take a deeper look (and maybe test it as well) later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you. I left some comments inline.
Another thing I was thinking about (not for this PR): do we actually want to maintain the list of workers in the config file? It seems like it requires quite a bit of synchronization. Maybe we remove it from there and pass the list of initial workers on proxy startup as CLI parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left a couple of questions. I want to test it locally before approving it.
tokio::task::spawn_blocking(|| server.run_forever()) | ||
.await | ||
.map_err(|err| err.to_string())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to run run_forever
in a separate thread if we instantly wait for it to end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do so because Pingora create a new runtime in .run_forever()
without the possibility of passing and existing one. The tokio::task::spawn_block
was introduced in order to avoid a panic each time that we instantiate a proxy.
configuration | ||
.save_to_config_file() | ||
.map_err(|err| format!("Failed to save configuration: {}", err))?; | ||
ProxyConfig::set_workers(new_list_of_workers)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to lock the file when writing the workers? What would happen if both the update_workers
function and health check background service want to write to the config file at the same time? This won't be an issue when we remove the list from the file (#1008) so maybe it's ok to leave it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is safe because both calls are under a worker's write. Though as you mentioned, this is going to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested it on my machine and it works correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you! I left a few small comments inline - but we are basically good to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left some minor comments, but no need to tackle them in this PR
|
||
Ok(()) | ||
} | ||
|
||
/// Updates the workers in the configuration with the new list. | ||
pub(crate) fn set_workers(workers: Vec<WorkerConfig>) -> Result<(), String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be done as part of the incoming follow-up work, but we should probably check (unless it's being dome somewhere already) that there are no duplicate workers at any point (both in the worker list and persisted config file) to avoid problems if the user accidentally adds the same address/port twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not being checked. Though the worker config is planned to be removed from the configuration file. I think we can dismiss this for now and use that issue to fix this.
let mut healthy_workers = Vec::new(); | ||
|
||
for worker in workers { | ||
if worker.is_healthy().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR but I think it would be nice to parallelize these checks somehow. Since they are sequential, if you have a couple of them that are failing, they can both stall all the checks by the defined timeout amount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it will be good to add this to the #1009 issue (the one of the retries), and change a bit the scope of that issue as a more general "improve health check" in the proxy side. I'm updating the issue, let me know your opinion.
bin/tx-prover/src/proxy/mod.rs
Outdated
/// This wrapper is used to implement the ProxyHttp trait for Arc<LoadBalancer>. | ||
/// This is necessary because we want to share the load balancer between the proxy server and the | ||
/// health check background service. | ||
pub struct LoadBalancerWrapper(pub Arc<LoadBalancer>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Because now ProxyHttp
is being implemented for the wrapper struct, I think we could rename LoadBalancerWrapper
into LoadBalancer
and LoadBalancer
could be something like LoadBalancerState
or LoadBalancerConfig
since the idea of adding the wrapper was to be able to easily share the state anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like LoadBalancerState
, pretty much straightforward to wait it does. I'm renaming it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me! Thank you!
@SantiagoPittella or @igamigo - feel free to merge w/e you think is appropriate.
…from proxy feat: add health check server in worker feat: add gRPC healthcheck methods feat: add health check in proxy feat: add ProxyConfig::update_workers method docs: add entry to changelog docs: update README docs: remove old doc from field chore: remove unused fields in Worker Config, improve Worker::execute documentation docs: add documentation to LBWrapper, add documentation to BackgroundService, remove unwraps chore: split BackgroundService implementation in different methods
91d82b2
to
b3c7f2f
Compare
This PR adds:
tonic-health
crate, which adds the required endpoints without the need of defining the methods in a.proto
file.