Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce component based health #636

Merged

Conversation

adam-singer
Copy link
Contributor

@adam-singer adam-singer commented Jan 25, 2024

Description

Introduce a component based health check system. Each type of component should be able to opt into registering handlers implement some mechanical checks of health. Health in this context is functionality expected to work but runtime wise are semi no-op in terms of influencing the underlying storage / rpc systems.

Opting in to the system requires for component to define a HealthStatusIndicator and implement the check_health() function. Registration should automatically be done by the existence of implementation and calling the Store.register_health() function in component implementation. All stores can opt into a default health check behavior by using default_health_status_indicator!, this macro gives a simple definition that depends on the Store.check_health() function definition of a health check.

The /status endpoint has been updated to return the resulting json serialized HealthStatusDescription objects.

Fixes # (issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Example of calling /status endpoint to return back status indicator.

10:33:01 adam@Adams-MBP nativelink-2 ±|adams/health-check-components|→ curl -sS 0.0.0.0:50051/status|jq .
[
  {
    "namespace": "/nativelink/stores/AC_MAIN_STORE/FilesystemStore",
    "status": {
      "Ok": {
        "struct_name": "nativelink_store::filesystem_store::FilesystemStore",
        "message": "ok"
      }
    }
  }
]

Please also list any relevant details for your test configuration

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Copy link

vercel bot commented Jan 25, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nativelink-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 15, 2024 1:16am

}

#[async_trait]
pub trait HealthStatusIndicator<'a>: Sync + Send + Unpin {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'a seems to be not-needed.

@adam-singer adam-singer force-pushed the adams/health-check-components branch from 355ce2f to c3b19d1 Compare January 26, 2024 07:29
@adam-singer adam-singer force-pushed the adams/health-check-components branch from c3b19d1 to 0e1bae0 Compare January 26, 2024 08:04
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), pre-commit-checks, publish-image, ubuntu-20.04, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable


nativelink-util/src/health_utils.rs line 30 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

'a seems to be not-needed.

Done. ... When trying to use this trait outside of a test context, ran into a bunch of lifetime problems which probably got infected by pingponging error message suggestions

@adam-singer adam-singer force-pushed the adams/health-check-components branch from 0e1bae0 to f0f12e2 Compare January 27, 2024 00:25
@adam-singer adam-singer force-pushed the adams/health-check-components branch from f0f12e2 to 1e4cf06 Compare January 27, 2024 01:28
@adam-singer adam-singer force-pushed the adams/health-check-components branch from 1e4cf06 to 6154e28 Compare January 27, 2024 01:54
@adam-singer adam-singer marked this pull request as ready for review January 27, 2024 01:59
@adam-singer adam-singer force-pushed the adams/health-check-components branch from 6154e28 to c963b67 Compare January 27, 2024 02:00
@adam-singer adam-singer force-pushed the adams/health-check-components branch from c963b67 to 01188ae Compare January 27, 2024 02:02
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@blakehatch Can you first-pass this?

Reviewed 2 of 6 files at r1, 5 of 11 files at r2, 5 of 8 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained (waiting on @blakehatch)


nativelink-util/src/store_trait.rs line 185 at r3 (raw file):

    fn register_metrics(self: Arc<Self>, _registry: &mut Registry) {}

    fn register_health(self: Arc<Self>, _registry: &mut HealthRegistry) {}

nit: This is a public/stable api, so it needs comments.

Copy link
Member

@blakehatch blakehatch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 11 files at r2, 3 of 8 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained


nativelink-util/src/health_utils.rs line 89 at r3 (raw file):

        parent_component: &HealthComponent,
        component: &HealthComponent,
        indicators: &Vec<Arc<dyn HealthStatusIndicator>>,

If indicator and registry are turned into slices of Arc<dyn HealthStatusIndicator> you should be able to avoid a deep copy. Allowing you to just pass around references to flatten and check health directly while iterating over them.

However, if this is not a large performance bottleneck at the scale this will run I would not recommend refactoring and having to fight the restructured lifetimes.


nativelink-util/src/health_utils.rs line 94 at r3 (raw file):

        let component_name: Cow<'static, str> = Cow::Owned(format!("{parent_component}/{component}"));
        for indicator in indicators {
            let result = indicator.clone().check_health().await;

Concerned that the deep copy here could be a performance bottleneck.


nativelink-util/src/health_utils.rs line 112 at r3 (raw file):

        for registry in registries {
            let _ = self
                .clone()

Ditto

Copy link
Member

@blakehatch blakehatch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 👍

Reviewable status: 0 of 1 LGTMs obtained (waiting on @blakehatch)

Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained (waiting on @blakehatch)


nativelink-util/src/health_utils.rs line 89 at r3 (raw file):

Previously, blakehatch (Blake Hatch) wrote…

If indicator and registry are turned into slices of Arc<dyn HealthStatusIndicator> you should be able to avoid a deep copy. Allowing you to just pass around references to flatten and check health directly while iterating over them.

However, if this is not a large performance bottleneck at the scale this will run I would not recommend refactoring and having to fight the restructured lifetimes.

Are you suggesting that indicators: &Vec<Arc<dyn HealthStatusIndicator>> -> indicators: &Vec<Arc<&dyn HealthStatusIndicator>>?


nativelink-util/src/health_utils.rs line 94 at r3 (raw file):

Previously, blakehatch (Blake Hatch) wrote…

Concerned that the deep copy here could be a performance bottleneck.

@blakehatch Do you mean deep copy by calling clone() here?

@blakehatch
Copy link
Member

Reviewable status: 0 of 1 LGTMs obtained (waiting on @blakehatch)

nativelink-util/src/health_utils.rs line 89 at r3 (raw file):

Previously, blakehatch (Blake Hatch) wrote…
Are you suggesting that indicators: &Vec<Arc<dyn HealthStatusIndicator>> -> indicators: &Vec<Arc<&dyn HealthStatusIndicator>>?

nativelink-util/src/health_utils.rs line 94 at r3 (raw file):

Previously, blakehatch (Blake Hatch) wrote…
@blakehatch Do you mean deep copy by calling clone() here?

indicators: &Vec<Arc<dyn HealthStatusIndicator>> -> indicators: &[Arc<dyn HealthStatusIndicator>] .

Also yes I’m referring to making a deep copy by calling clone(). From my understanding cloning a Vec in Rust creates a deep copy because it duplicates the Vec's elements and underlying memory of said elements.

@adam-singer adam-singer force-pushed the adams/health-check-components branch from 01188ae to 64bef39 Compare February 1, 2024 22:00
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04 (waiting on @blakehatch)


nativelink-util/src/health_utils.rs line 94 at r3 (raw file):

Previously, blakehatch (Blake Hatch) wrote…

indicators: &Vec<Arc<dyn HealthStatusIndicator>> -> indicators: &[Arc<dyn HealthStatusIndicator>] .

Also yes I’m referring to making a deep copy by calling clone(). From my understanding cloning a Vec in Rust creates a deep copy because it duplicates the Vec's elements and underlying memory of said elements.

Done.


nativelink-util/src/health_utils.rs line 112 at r3 (raw file):

Previously, blakehatch (Blake Hatch) wrote…

Ditto

Done.

Copy link
Member

@blakehatch blakehatch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r4, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 1 of 1 LGTMs obtained

Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04


nativelink-util/src/health_utils.rs line 139 at r6 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Don't hate me for this, but I have an idea.

What if we rewrite all this with this contract:

pub fn flatten(&self) -> impl Iterator<Item = (HealthComponentName, Arc<dyn HealthStatusIndicator>)> {
  // ...
}

trait HealthStatusReporter<S: Stream<Type = HealthStatusDescription>> {
  fn health_status_report(&self) -> S;
}

impl <T, S> HealthStatusReporter<S> for T
where
    T: Iterator<Item = (HealthComponentName, Arc<dyn HealthStatusIndicator>)>,
    S: Stream<Type = HealthStatusDescription>
{
  fn health_status_report(&self) -> S {
    // ...
  }
}

The idea here is that you can flatten() this into a collection, then with just the collection generate the report, without having to build the collection every time (if you are not going to add more items to the registry any more, which we don't).

I know this is more similar to what you had before.

Thoughts?

Done.


nativelink-util/src/health_utils.rs line 139 at r6 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: mut doesn't appear to be needed.

Done.

@adam-singer adam-singer force-pushed the adams/health-check-components branch from 52ebbe0 to 24a98c2 Compare February 12, 2024 18:31
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Vercel, asan / ubuntu-22.04, pre-commit-checks, publish-image, ubuntu-20.04 / stable

a discussion (no related file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Finished my ultra-fine grained review.

A lot of these are more suggestions. The reason I'm being a bit more careful here is because I think we may want to publish this API to a rust crate, so we should get the API solid before doing so.

@allada ptal.

I also forgot, which stores we don't want to provide generic health check verifications for?


Copy link

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 4 files at r7, 23 of 25 files at r11, 2 of 2 files at r12, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained


-- commits line 2 at r12:
nit: This is [Breaking], since it changes the behavior of /status and some users may rely on /status hard codded to return Ok in a happy path. I also suggest mentioning /status in the title.


nativelink-util/src/health_utils.rs line 40 at r12 (raw file):

        message: Cow<'static, Message>,
    },
    /// Initializing status, used to signal or indicate component is initializing.

nit: all these comments seems silly and just describing the definition of the word. I'd suggest removing them, as I don't see these comments adding any value except maybe the non-fatal part.


nativelink-util/src/health_utils.rs line 64 at r12 (raw file):

impl HealthStatus {
    /// Make a healthy status.
    pub fn new_ok(component: &(impl HealthStatusIndicator + ?Sized), message: Cow<'static, str>) -> Self {

nit: Does this need impl and Sized? Can't it just be &HealthStatusIndicator ? (same with below). I'd prefer not to let this pattern be copy-pasta'ed if possible.


nativelink-util/src/health_utils.rs line 115 at r12 (raw file):

    /// Returns the name of the struct implementing the trait.
    fn struct_name(&self) -> &'static str {

nit: const fn? Not sure if this is possible in trait but worth a try.


nativelink-util/src/health_utils.rs line 196 at r12 (raw file):

/// Generally used for components that don't need custom implementations
/// of the `check_health` function.
#[macro_export]

nit: Lets try not exporting with this macro and then import with nativelink_util::health_utils::default_health_status_indicator.


nativelink-util/src/store_trait.rs line 176 at r12 (raw file):

    // in situations where the default implementation is not sufficient.
    async fn check_health(self: Pin<&Self>, _namespace: Cow<'static, str>) -> HealthStatus {
        let bytes = _namespace.as_bytes();

nit: Don't prefix with _ if it's being used.


nativelink-util/src/store_trait.rs line 177 at r12 (raw file):

    async fn check_health(self: Pin<&Self>, _namespace: Cow<'static, str>) -> HealthStatus {
        let bytes = _namespace.as_bytes();
        let hash = blake3::hash(bytes).into();

nit: does namespace have the component name itself too or just the path up to the component? We should probably pipe in the store's name if it's not part of the namespace.


nativelink-util/src/store_trait.rs line 180 at r12 (raw file):

        let len = bytes.len();
        let digest_info = DigestInfo::new(hash, len as i64);
        let bytes_copy = bytes::Bytes::copy_from_slice(bytes);

nit: Lets make this a global config that contains the size of the digest to upload, filled with statically random generated (ie: use a constant random seed) with some of the bytes substituted with the namespace.

I'm worried that memory & filesystem store checks will have these default payloads far too small and never trigger the real problems.


nativelink-util/src/store_trait.rs line 182 at r12 (raw file):

        let bytes_copy = bytes::Bytes::copy_from_slice(bytes);

        match self.update_oneshot(digest_info, bytes_copy.clone()).await {

nit, might be cleaner using

if let Err(e) = self.update_oneshot(digest_info, bytes_copy.clone()).await {
  return HealthStatus::new_failed(
    self.get_ref(),
    format!("Store.update_oneshot() failed: {}", e).into(),
  );
}

nativelink-util/src/store_trait.rs line 206 at r12 (raw file):

        }

        match self.has(digest_info).await {

super nit: I suggest doing this before get_part as knowing it didn't exist is going to be more useful than .get_part_unchunked failed() error.


nativelink-util/tests/health_utils_test.rs line 22 at r12 (raw file):

use futures::StreamExt;
use nativelink_error::Error;
use nativelink_util::health_utils::*;

nit: Can we spell out what we use? I really don't like using * as it makes code search less useful, since it hides away where the stuff is really coming from.


src/bin/nativelink.rs line 387 at r12 (raw file):

                            match serde_json5::to_string(&health_status_descriptions) {
                                Ok(health_status_descriptions) => Response::builder()
                                    .status(StatusCode::OK)

Hmmm, I think we want a 503 in the case where any of the reports came back error (and maybe warning?). Failing to do this would make using this in a cloud health checker unusable.

@adam-singer adam-singer force-pushed the adams/health-check-components branch from 24a98c2 to 6537abc Compare February 14, 2024 01:39
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained


-- commits line 2 at r12:

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: This is [Breaking], since it changes the behavior of /status and some users may rely on /status hard codded to return Ok in a happy path. I also suggest mentioning /status in the title.

Done.


nativelink-util/src/health_utils.rs line 40 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: all these comments seems silly and just describing the definition of the word. I'd suggest removing them, as I don't see these comments adding any value except maybe the non-fatal part.

Done.


nativelink-util/src/health_utils.rs line 64 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Does this need impl and Sized? Can't it just be &HealthStatusIndicator ? (same with below). I'd prefer not to let this pattern be copy-pasta'ed if possible.

https://gist.github.com/adam-singer/49faf251ca251ff7fe40f832105cd34b#file-gistfile0-txt-L2 if we remove ?Sized, if we try &HealthStatusIndicator https://gist.github.com/adam-singer/c75e1d177dd105307b68230bd2bafc30 and then adding dyn just leads back to https://gist.github.com/adam-singer/ff64c6acd44234e001a77a1685a8c4a6.

I like that the helper methods depend on HealthStatusIndicator generally speaking, but using ?Sized is bit overly flexible, so I'd rather just remove the helper methods then trying to fighting this.


nativelink-util/src/health_utils.rs line 115 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: const fn? Not sure if this is possible in trait but worth a try.

not allowed https://gist.github.com/adam-singer/219603d86abba8f66cbf10f6e9b6e97a

Done.


nativelink-util/src/health_utils.rs line 196 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Lets try not exporting with this macro and then import with nativelink_util::health_utils::default_health_status_indicator.

Done.

Tried reading up on different ways to cross the crate boundaries with macros, it doesn't seem you can do so without macro_export. You can "re-scope" by doing pub use create::<macro_name> within a mod, that will make the macro "scoped" in some use for use statements.

https://users.rust-lang.org/t/re-exporting-macro-export-ed-macros-by-module-where-they-are-defined/84130
https://veykril.github.io/tlborm/decl-macros/minutiae/scoping.html
https://veykril.github.io/tlborm/decl-macros/minutiae/scoping.html#path-based-scope
https://doc.rust-lang.org/reference/macros-by-example.html

At least I didn't find another pattern that makes that possible from the above.


nativelink-util/src/store_trait.rs line 176 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Don't prefix with _ if it's being used.

Done.


nativelink-util/src/store_trait.rs line 177 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: does namespace have the component name itself too or just the path up to the component? We should probably pipe in the store's name if it's not part of the namespace.

it does not, but we should have a reference to the store name, self.get_name()


nativelink-util/src/store_trait.rs line 182 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit, might be cleaner using

if let Err(e) = self.update_oneshot(digest_info, bytes_copy.clone()).await {
  return HealthStatus::new_failed(
    self.get_ref(),
    format!("Store.update_oneshot() failed: {}", e).into(),
  );
}

Nice!

Done.


nativelink-util/tests/health_utils_test.rs line 22 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Can we spell out what we use? I really don't like using * as it makes code search less useful, since it hides away where the stuff is really coming from.

Done.


src/bin/nativelink.rs line 387 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Hmmm, I think we want a 503 in the case where any of the reports came back error (and maybe warning?). Failing to do this would make using this in a cloud health checker unusable.

sgtm, the k8s stuff is very status code dependent and probably won't inspect body much.

Done.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 16 of 17 files at r13, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13


nativelink-util/src/store_trait.rs line 177 at r12 (raw file):

Previously, adam-singer (Adam Singer) wrote…

it does not, but we should have a reference to the store name, self.get_name()

In that case we probably want to add self.get_name() salt to the hash or else we might get collisions.

@adam-singer adam-singer force-pushed the adams/health-check-components branch from 6537abc to 36172a0 Compare February 14, 2024 09:22
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Vercel, pre-commit-checks


nativelink-util/src/store_trait.rs line 177 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

In that case we probably want to add self.get_name() salt to the hash or else we might get collisions.

Done.


nativelink-util/src/store_trait.rs line 180 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Lets make this a global config that contains the size of the digest to upload, filled with statically random generated (ie: use a constant random seed) with some of the bytes substituted with the namespace.

I'm worried that memory & filesystem store checks will have these default payloads far too small and never trigger the real problems.

Done.

@adam-singer adam-singer force-pushed the adams/health-check-components branch from 36172a0 to f243c53 Compare February 14, 2024 09:25
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Vercel, pre-commit-checks


nativelink-util/src/store_trait.rs line 206 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

super nit: I suggest doing this before get_part as knowing it didn't exist is going to be more useful than .get_part_unchunked failed() error.

Done.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 8 files at r14, 1 of 1 files at r15, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained


nativelink-config/src/cas_server.rs line 634 at r15 (raw file):

    /// Default digest size to use for health check when running
    /// diagnostics checks.

super nit: Lets be a bit more descriptive on what this is doing.


nativelink-util/src/health_utils.rs line 189 at r15 (raw file):

                &self,
                namespace: std::borrow::Cow<'static, str>,
            ) -> nativelink_util::health_utils::HealthStatus {

nit: Just import the name.


nativelink-util/src/store_trait.rs line 39 at r15 (raw file):

pub fn default_digest_size_health_check() -> usize {
    *DEFAULT_DIGEST_SIZE_HEALTH_CHECK.get_or_init(|| 1024 * 1024)

nit: Make this number a const and make sure it's documented that if it changes it must be updated in the config file.


nativelink-util/src/store_trait.rs line 204 at r15 (raw file):

        // Fill data with all random bytes.
        HEALTH_CHECK_RNG.lock().fill_bytes(&mut digest_data);

nit: We want this number to be the same on every call, but different for every component.

The way I'd do this is use the namespace and component name to generate a random int32, then feed that number into the random number generator seed, then fill this.

This ensures we get the same data every call, but different for every component. (otherwise we pollute our store with a lot of useless data.

There's other ways to do this too, but the way I outlined above seems simplest albeit slower.


nativelink-util/src/store_trait.rs line 222 at r15 (raw file):

        }

        let digest_data_hash = blake3::hash(&digest_data).into();

nit: Don't use blake directly, use DigestHasher::from(default_digest_hasher_func()).update(...);

@adam-singer adam-singer force-pushed the adams/health-check-components branch from f243c53 to 9caec5f Compare February 15, 2024 01:14
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained


nativelink-config/src/cas_server.rs line 634 at r15 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

super nit: Lets be a bit more descriptive on what this is doing.

Done.


nativelink-util/src/health_utils.rs line 189 at r15 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Just import the name.

Done.

When macros are expanded some imports will conflict, in this case std::borrow::Cow and nativelink\_util::health\_utils::HealthStatus could already be imported in some scopes and not others. Tried suggested of just importing in this scope and in the macro scope, became a ping/pong of who imported what.


nativelink-util/src/store_trait.rs line 39 at r15 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Make this number a const and make sure it's documented that if it changes it must be updated in the config file.

Done.


nativelink-util/src/store_trait.rs line 204 at r15 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: We want this number to be the same on every call, but different for every component.

The way I'd do this is use the namespace and component name to generate a random int32, then feed that number into the random number generator seed, then fill this.

This ensures we get the same data every call, but different for every component. (otherwise we pollute our store with a lot of useless data.

There's other ways to do this too, but the way I outlined above seems simplest albeit slower.

Done.


nativelink-util/src/store_trait.rs line 222 at r15 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Don't use blake directly, use DigestHasher::from(default_digest_hasher_func()).update(...);

Done.

… based health

Introduce a component based health check system. Each type of component
should be able to opt into registering handlers implement some mechanical
checks of health. Health in this context is functionality expected to
work but runtime wise are semi no-op in terms of influencing the underlying
storage / rpc systems.

Opting in to the system requires for component to define a `HealthStatusIndicator`
and implement the `check_health()` function. Registration should automatically
be done by the existence of implementation and calling the `Store.register_health()`
function in component implementation. At the moment only `Store` based components
can register and a single health check is defined for `FilesystemStore`.

A global parameter `default_digest_size_health_check` has been introduced for
configuring the static seeded random bytes to fill data payload. The default
value is 1MB.

The `/status` endpoint has been updated to return the resulting string serialized
instances of `HealthStatus`, smart serialization of such objects is not implemented
at this time.
@adam-singer adam-singer force-pushed the adams/health-check-components branch from 9caec5f to 8eb0ee1 Compare February 15, 2024 01:16
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r16, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained

Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained

@adam-singer adam-singer merged commit 48cadc7 into TraceMachina:main Feb 15, 2024
26 checks passed
@adam-singer adam-singer deleted the adams/health-check-components branch February 15, 2024 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants