-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce component based health #636
Introduce component based health #636
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
nativelink-util/src/health_utils.rs
Outdated
} | ||
|
||
#[async_trait] | ||
pub trait HealthStatusIndicator<'a>: Sync + Send + Unpin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'a
seems to be not-needed.
355ce2f
to
c3b19d1
Compare
c3b19d1
to
0e1bae0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), pre-commit-checks, publish-image, ubuntu-20.04, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable
nativelink-util/src/health_utils.rs
line 30 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
'a
seems to be not-needed.
Done. ... When trying to use this trait outside of a test context, ran into a bunch of lifetime problems which probably got infected by pingponging error message suggestions
0e1bae0
to
f0f12e2
Compare
f0f12e2
to
1e4cf06
Compare
1e4cf06
to
6154e28
Compare
6154e28
to
c963b67
Compare
c963b67
to
01188ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+@blakehatch Can you first-pass this?
Reviewed 2 of 6 files at r1, 5 of 11 files at r2, 5 of 8 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained (waiting on @blakehatch)
nativelink-util/src/store_trait.rs
line 185 at r3 (raw file):
fn register_metrics(self: Arc<Self>, _registry: &mut Registry) {} fn register_health(self: Arc<Self>, _registry: &mut HealthRegistry) {}
nit: This is a public/stable api, so it needs comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 11 files at r2, 3 of 8 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained
nativelink-util/src/health_utils.rs
line 89 at r3 (raw file):
parent_component: &HealthComponent, component: &HealthComponent, indicators: &Vec<Arc<dyn HealthStatusIndicator>>,
If indicator and registry are turned into slices of Arc<dyn HealthStatusIndicator>
you should be able to avoid a deep copy. Allowing you to just pass around references to flatten and check health directly while iterating over them.
However, if this is not a large performance bottleneck at the scale this will run I would not recommend refactoring and having to fight the restructured lifetimes.
nativelink-util/src/health_utils.rs
line 94 at r3 (raw file):
let component_name: Cow<'static, str> = Cow::Owned(format!("{parent_component}/{component}")); for indicator in indicators { let result = indicator.clone().check_health().await;
Concerned that the deep copy here could be a performance bottleneck.
nativelink-util/src/health_utils.rs
line 112 at r3 (raw file):
for registry in registries { let _ = self .clone()
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 👍
Reviewable status: 0 of 1 LGTMs obtained (waiting on @blakehatch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained (waiting on @blakehatch)
nativelink-util/src/health_utils.rs
line 89 at r3 (raw file):
Previously, blakehatch (Blake Hatch) wrote…
If indicator and registry are turned into slices of
Arc<dyn HealthStatusIndicator>
you should be able to avoid a deep copy. Allowing you to just pass around references to flatten and check health directly while iterating over them.However, if this is not a large performance bottleneck at the scale this will run I would not recommend refactoring and having to fight the restructured lifetimes.
Are you suggesting that indicators: &Vec<Arc<dyn HealthStatusIndicator>>
-> indicators: &Vec<Arc<&dyn HealthStatusIndicator>>
?
nativelink-util/src/health_utils.rs
line 94 at r3 (raw file):
Previously, blakehatch (Blake Hatch) wrote…
Concerned that the deep copy here could be a performance bottleneck.
@blakehatch Do you mean deep copy by calling clone()
here?
Also yes I’m referring to making a deep copy by calling clone(). From my understanding cloning a |
01188ae
to
64bef39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04 (waiting on @blakehatch)
nativelink-util/src/health_utils.rs
line 94 at r3 (raw file):
Previously, blakehatch (Blake Hatch) wrote…
indicators: &Vec<Arc<dyn HealthStatusIndicator>>
->indicators: &[Arc<dyn HealthStatusIndicator>]
.Also yes I’m referring to making a deep copy by calling clone(). From my understanding cloning a
Vec
in Rust creates a deep copy because it duplicates the Vec's elements and underlying memory of said elements.
Done.
nativelink-util/src/health_utils.rs
line 112 at r3 (raw file):
Previously, blakehatch (Blake Hatch) wrote…
Ditto
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r4, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 1 of 1 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04
nativelink-util/src/health_utils.rs
line 139 at r6 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Don't hate me for this, but I have an idea.
What if we rewrite all this with this contract:
pub fn flatten(&self) -> impl Iterator<Item = (HealthComponentName, Arc<dyn HealthStatusIndicator>)> { // ... } trait HealthStatusReporter<S: Stream<Type = HealthStatusDescription>> { fn health_status_report(&self) -> S; } impl <T, S> HealthStatusReporter<S> for T where T: Iterator<Item = (HealthComponentName, Arc<dyn HealthStatusIndicator>)>, S: Stream<Type = HealthStatusDescription> { fn health_status_report(&self) -> S { // ... } }The idea here is that you can
flatten()
this into a collection, then with just the collection generate the report, without having to build the collection every time (if you are not going to add more items to the registry any more, which we don't).I know this is more similar to what you had before.
Thoughts?
Done.
nativelink-util/src/health_utils.rs
line 139 at r6 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit:
mut
doesn't appear to be needed.
Done.
52ebbe0
to
24a98c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Vercel, asan / ubuntu-22.04, pre-commit-checks, publish-image, ubuntu-20.04 / stable
a discussion (no related file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Finished my ultra-fine grained review.
A lot of these are more suggestions. The reason I'm being a bit more careful here is because I think we may want to publish this API to a rust crate, so we should get the API solid before doing so.
@allada ptal.
I also forgot, which stores we don't want to provide generic health check verifications for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r7, 23 of 25 files at r11, 2 of 2 files at r12, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained
-- commits
line 2 at r12:
nit: This is [Breaking]
, since it changes the behavior of /status
and some users may rely on /status
hard codded to return Ok
in a happy path. I also suggest mentioning /status
in the title.
nativelink-util/src/health_utils.rs
line 40 at r12 (raw file):
message: Cow<'static, Message>, }, /// Initializing status, used to signal or indicate component is initializing.
nit: all these comments seems silly and just describing the definition of the word. I'd suggest removing them, as I don't see these comments adding any value except maybe the non-fatal
part.
nativelink-util/src/health_utils.rs
line 64 at r12 (raw file):
impl HealthStatus { /// Make a healthy status. pub fn new_ok(component: &(impl HealthStatusIndicator + ?Sized), message: Cow<'static, str>) -> Self {
nit: Does this need impl
and Sized
? Can't it just be &HealthStatusIndicator
? (same with below). I'd prefer not to let this pattern be copy-pasta'ed if possible.
nativelink-util/src/health_utils.rs
line 115 at r12 (raw file):
/// Returns the name of the struct implementing the trait. fn struct_name(&self) -> &'static str {
nit: const fn
? Not sure if this is possible in trait
but worth a try.
nativelink-util/src/health_utils.rs
line 196 at r12 (raw file):
/// Generally used for components that don't need custom implementations /// of the `check_health` function. #[macro_export]
nit: Lets try not exporting with this macro and then import with nativelink_util::health_utils::default_health_status_indicator
.
nativelink-util/src/store_trait.rs
line 176 at r12 (raw file):
// in situations where the default implementation is not sufficient. async fn check_health(self: Pin<&Self>, _namespace: Cow<'static, str>) -> HealthStatus { let bytes = _namespace.as_bytes();
nit: Don't prefix with _
if it's being used.
nativelink-util/src/store_trait.rs
line 177 at r12 (raw file):
async fn check_health(self: Pin<&Self>, _namespace: Cow<'static, str>) -> HealthStatus { let bytes = _namespace.as_bytes(); let hash = blake3::hash(bytes).into();
nit: does namespace
have the component name itself too or just the path up to the component? We should probably pipe in the store's name if it's not part of the namespace.
nativelink-util/src/store_trait.rs
line 180 at r12 (raw file):
let len = bytes.len(); let digest_info = DigestInfo::new(hash, len as i64); let bytes_copy = bytes::Bytes::copy_from_slice(bytes);
nit: Lets make this a global config that contains the size of the digest to upload, filled with statically random generated (ie: use a constant random seed) with some of the bytes substituted with the namespace.
I'm worried that memory & filesystem store checks will have these default payloads far too small and never trigger the real problems.
nativelink-util/src/store_trait.rs
line 182 at r12 (raw file):
let bytes_copy = bytes::Bytes::copy_from_slice(bytes); match self.update_oneshot(digest_info, bytes_copy.clone()).await {
nit, might be cleaner using
if let Err(e) = self.update_oneshot(digest_info, bytes_copy.clone()).await {
return HealthStatus::new_failed(
self.get_ref(),
format!("Store.update_oneshot() failed: {}", e).into(),
);
}
nativelink-util/src/store_trait.rs
line 206 at r12 (raw file):
} match self.has(digest_info).await {
super nit: I suggest doing this before get_part
as knowing it didn't exist is going to be more useful than .get_part_unchunked failed()
error.
nativelink-util/tests/health_utils_test.rs
line 22 at r12 (raw file):
use futures::StreamExt; use nativelink_error::Error; use nativelink_util::health_utils::*;
nit: Can we spell out what we use? I really don't like using *
as it makes code search less useful, since it hides away where the stuff is really coming from.
src/bin/nativelink.rs
line 387 at r12 (raw file):
match serde_json5::to_string(&health_status_descriptions) { Ok(health_status_descriptions) => Response::builder() .status(StatusCode::OK)
Hmmm, I think we want a 503
in the case where any of the reports came back error (and maybe warning?). Failing to do this would make using this in a cloud health checker unusable.
24a98c2
to
6537abc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: This is
[Breaking]
, since it changes the behavior of/status
and some users may rely on/status
hard codded to returnOk
in a happy path. I also suggest mentioning/status
in the title.
Done.
nativelink-util/src/health_utils.rs
line 40 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: all these comments seems silly and just describing the definition of the word. I'd suggest removing them, as I don't see these comments adding any value except maybe the
non-fatal
part.
Done.
nativelink-util/src/health_utils.rs
line 64 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Does this need
impl
andSized
? Can't it just be&HealthStatusIndicator
? (same with below). I'd prefer not to let this pattern be copy-pasta'ed if possible.
https://gist.github.com/adam-singer/49faf251ca251ff7fe40f832105cd34b#file-gistfile0-txt-L2 if we remove ?Sized
, if we try &HealthStatusIndicator
https://gist.github.com/adam-singer/c75e1d177dd105307b68230bd2bafc30 and then adding dyn
just leads back to https://gist.github.com/adam-singer/ff64c6acd44234e001a77a1685a8c4a6.
I like that the helper methods depend on HealthStatusIndicator
generally speaking, but using ?Sized
is bit overly flexible, so I'd rather just remove the helper methods then trying to fighting this.
nativelink-util/src/health_utils.rs
line 115 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit:
const fn
? Not sure if this is possible intrait
but worth a try.
not allowed https://gist.github.com/adam-singer/219603d86abba8f66cbf10f6e9b6e97a
Done.
nativelink-util/src/health_utils.rs
line 196 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Lets try not exporting with this macro and then import with
nativelink_util::health_utils::default_health_status_indicator
.
Done.
Tried reading up on different ways to cross the crate boundaries with macros, it doesn't seem you can do so without macro_export
. You can "re-scope" by doing pub use create::<macro_name>
within a mod
, that will make the macro "scoped" in some use for use
statements.
https://users.rust-lang.org/t/re-exporting-macro-export-ed-macros-by-module-where-they-are-defined/84130
https://veykril.github.io/tlborm/decl-macros/minutiae/scoping.html
https://veykril.github.io/tlborm/decl-macros/minutiae/scoping.html#path-based-scope
https://doc.rust-lang.org/reference/macros-by-example.html
At least I didn't find another pattern that makes that possible from the above.
nativelink-util/src/store_trait.rs
line 176 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Don't prefix with
_
if it's being used.
Done.
nativelink-util/src/store_trait.rs
line 177 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: does
namespace
have the component name itself too or just the path up to the component? We should probably pipe in the store's name if it's not part of the namespace.
it does not, but we should have a reference to the store name, self.get_name()
nativelink-util/src/store_trait.rs
line 182 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit, might be cleaner using
if let Err(e) = self.update_oneshot(digest_info, bytes_copy.clone()).await { return HealthStatus::new_failed( self.get_ref(), format!("Store.update_oneshot() failed: {}", e).into(), ); }
Nice!
Done.
nativelink-util/tests/health_utils_test.rs
line 22 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Can we spell out what we use? I really don't like using
*
as it makes code search less useful, since it hides away where the stuff is really coming from.
Done.
src/bin/nativelink.rs
line 387 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Hmmm, I think we want a
503
in the case where any of the reports came back error (and maybe warning?). Failing to do this would make using this in a cloud health checker unusable.
sgtm, the k8s stuff is very status code dependent and probably won't inspect body much.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 16 of 17 files at r13, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13
nativelink-util/src/store_trait.rs
line 177 at r12 (raw file):
Previously, adam-singer (Adam Singer) wrote…
it does not, but we should have a reference to the store name,
self.get_name()
In that case we probably want to add self.get_name()
salt to the hash or else we might get collisions.
6537abc
to
36172a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Vercel, pre-commit-checks
nativelink-util/src/store_trait.rs
line 177 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
In that case we probably want to add
self.get_name()
salt to the hash or else we might get collisions.
Done.
nativelink-util/src/store_trait.rs
line 180 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Lets make this a global config that contains the size of the digest to upload, filled with statically random generated (ie: use a constant random seed) with some of the bytes substituted with the namespace.
I'm worried that memory & filesystem store checks will have these default payloads far too small and never trigger the real problems.
Done.
36172a0
to
f243c53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Vercel, pre-commit-checks
nativelink-util/src/store_trait.rs
line 206 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
super nit: I suggest doing this before
get_part
as knowing it didn't exist is going to be more useful than.get_part_unchunked failed()
error.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 8 files at r14, 1 of 1 files at r15, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained
nativelink-config/src/cas_server.rs
line 634 at r15 (raw file):
/// Default digest size to use for health check when running /// diagnostics checks.
super nit: Lets be a bit more descriptive on what this is doing.
nativelink-util/src/health_utils.rs
line 189 at r15 (raw file):
&self, namespace: std::borrow::Cow<'static, str>, ) -> nativelink_util::health_utils::HealthStatus {
nit: Just import the name.
nativelink-util/src/store_trait.rs
line 39 at r15 (raw file):
pub fn default_digest_size_health_check() -> usize { *DEFAULT_DIGEST_SIZE_HEALTH_CHECK.get_or_init(|| 1024 * 1024)
nit: Make this number a const and make sure it's documented that if it changes it must be updated in the config file.
nativelink-util/src/store_trait.rs
line 204 at r15 (raw file):
// Fill data with all random bytes. HEALTH_CHECK_RNG.lock().fill_bytes(&mut digest_data);
nit: We want this number to be the same on every call, but different for every component.
The way I'd do this is use the namespace and component name to generate a random int32, then feed that number into the random number generator seed, then fill this.
This ensures we get the same data every call, but different for every component. (otherwise we pollute our store with a lot of useless data.
There's other ways to do this too, but the way I outlined above seems simplest albeit slower.
nativelink-util/src/store_trait.rs
line 222 at r15 (raw file):
} let digest_data_hash = blake3::hash(&digest_data).into();
nit: Don't use blake directly, use DigestHasher::from(default_digest_hasher_func()).update(...);
f243c53
to
9caec5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 1 LGTMs obtained
nativelink-config/src/cas_server.rs
line 634 at r15 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
super nit: Lets be a bit more descriptive on what this is doing.
Done.
nativelink-util/src/health_utils.rs
line 189 at r15 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Just import the name.
Done.
When macros are expanded some imports will conflict, in this case std::borrow::Cow
and nativelink\_util::health\_utils::HealthStatus
could already be imported in some scopes and not others. Tried suggested of just importing in this scope and in the macro scope, became a ping/pong of who imported what.
nativelink-util/src/store_trait.rs
line 39 at r15 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Make this number a const and make sure it's documented that if it changes it must be updated in the config file.
Done.
nativelink-util/src/store_trait.rs
line 204 at r15 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: We want this number to be the same on every call, but different for every component.
The way I'd do this is use the namespace and component name to generate a random int32, then feed that number into the random number generator seed, then fill this.
This ensures we get the same data every call, but different for every component. (otherwise we pollute our store with a lot of useless data.
There's other ways to do this too, but the way I outlined above seems simplest albeit slower.
Done.
nativelink-util/src/store_trait.rs
line 222 at r15 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Don't use blake directly, use
DigestHasher::from(default_digest_hasher_func()).update(...);
Done.
… based health Introduce a component based health check system. Each type of component should be able to opt into registering handlers implement some mechanical checks of health. Health in this context is functionality expected to work but runtime wise are semi no-op in terms of influencing the underlying storage / rpc systems. Opting in to the system requires for component to define a `HealthStatusIndicator` and implement the `check_health()` function. Registration should automatically be done by the existence of implementation and calling the `Store.register_health()` function in component implementation. At the moment only `Store` based components can register and a single health check is defined for `FilesystemStore`. A global parameter `default_digest_size_health_check` has been introduced for configuring the static seeded random bytes to fill data payload. The default value is 1MB. The `/status` endpoint has been updated to return the resulting string serialized instances of `HealthStatus`, smart serialization of such objects is not implemented at this time.
9caec5f
to
8eb0ee1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r16, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 1 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 1 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 1 LGTMs obtained
Description
Introduce a component based health check system. Each type of component should be able to opt into registering handlers implement some mechanical checks of health. Health in this context is functionality expected to work but runtime wise are semi no-op in terms of influencing the underlying storage / rpc systems.
Opting in to the system requires for component to define a
HealthStatusIndicator
and implement thecheck_health()
function. Registration should automatically be done by the existence of implementation and calling theStore.register_health()
function in component implementation. All stores can opt into a default health check behavior by usingdefault_health_status_indicator!
, this macro gives a simple definition that depends on theStore.check_health()
function definition of a health check.The
/status
endpoint has been updated to return the resulting json serializedHealthStatusDescription
objects.Fixes # (issue)
Type of change
Please delete options that are not relevant.
not work as expected)
How Has This Been Tested?
Example of calling
/status
endpoint to return back status indicator.Please also list any relevant details for your test configuration
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is