Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor out health status checks to it's own service file #823

Merged

Conversation

blizzardc0der
Copy link
Contributor

@blizzardc0der blizzardc0der commented Apr 1, 2024

Description

Health status checks should be in it's own service called Health Service.
I made a new service for checking health status.

Fixes #643

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@blizzardc0der
Copy link
Contributor Author

I hope you to review this. @allada , @MarcusSorealheis .

@blizzardc0der blizzardc0der force-pushed the factor-out-health-status-checks branch from b0f030e to c4b136b Compare April 1, 2024 19:12
Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 6 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 1 discussions need to be resolved


nativelink-service/src/health_server.rs line 47 at r1 (raw file):

    }

    pub async fn check_health_status(&self, json_content_type: &'static str) -> Response<String> {

I don't think there is value yet in passing json_content_type since currently we are serializing only with serde_json5. We could introduce a renderer/formatter/writer pattern here at some point, to keep things incremental I'd suggest not doing that yet and don't pass a parameter.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and 3 discussions need to be resolved


nativelink-service/src/health_server.rs line 1 at r1 (raw file):

// Copyright 2023 The NativeLink Authors. All rights reserved.

nit: 2024.


src/bin/nativelink.rs line 412 at r1 (raw file):

                    spawn_blocking(move || {
                        futures::executor::block_on(
                            health_server_cloned.check_health_status(JSON_CONTENT_TYPE),

We want to make this similar to the other services and load them in in the same manner (config).

see the area in this file around .add_optional_service(

We want this health checker to implement that service api and it should be enabled based on the config.

@blizzardc0der blizzardc0der force-pushed the factor-out-health-status-checks branch from c4b136b to ef125c4 Compare April 4, 2024 14:06
Copy link
Contributor Author

@blizzardc0der blizzardc0der left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 2 discussions need to be resolved


nativelink-service/src/health_server.rs line 47 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

I don't think there is value yet in passing json_content_type since currently we are serializing only with serde_json5. We could introduce a renderer/formatter/writer pattern here at some point, to keep things incremental I'd suggest not doing that yet and don't pass a parameter.

Done.


src/bin/nativelink.rs line 412 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

We want to make this similar to the other services and load them in in the same manner (config).

see the area in this file around .add_optional_service(

We want this health checker to implement that service api and it should be enabled based on the config.

Done.

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04), and 1 discussions need to be resolved

@blizzardc0der blizzardc0der force-pushed the factor-out-health-status-checks branch from ef125c4 to 450bf36 Compare April 5, 2024 16:30
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r1, 2 of 4 files at r2, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 2 discussions need to be resolved


-- commits line 2 at r2:
Please update this with:

[Breaking] Factor out health status checks to its own service

Health status checks are now a proper service and no longer enabled by
default.

Add `"health": {},` under your services config to enable.

nativelink-service/src/health_server.rs line 30 at r2 (raw file):

#[derive(Clone)]
pub struct HealthServer {
    health_registry_builder: Arc<AsyncMutex<HealthRegistryBuilder>>,

nit: This should not be part of this class. You probably want to bring in HealthRegistry during construction. The builder should be kept outside of this class.


src/bin/nativelink.rs line 412 at r1 (raw file):

Previously, blizzardc0der (Blizzardc0der) wrote…

Done.

You want to implement tower_service::Service for HealthServer then use .add_service or .add_optional_service. This will offload nearly all the complexity to that module.

@blizzardc0der blizzardc0der force-pushed the factor-out-health-status-checks branch 4 times, most recently from d93f6c3 to 4a8f519 Compare April 9, 2024 09:39
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04), and 2 discussions need to be resolved

@steedmicro steedmicro force-pushed the factor-out-health-status-checks branch from 4a8f519 to 52dffb2 Compare April 12, 2024 01:41
@blizzardc0der blizzardc0der force-pushed the factor-out-health-status-checks branch 2 times, most recently from 0dc7efe to caf1ed7 Compare April 13, 2024 18:02
Copy link
Contributor Author

@blizzardc0der blizzardc0der left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 1 discussions need to be resolved


-- commits line 2 at r2:

Previously, allada (Nathan (Blaise) Bruer) wrote…

Please update this with:

[Breaking] Factor out health status checks to its own service

Health status checks are now a proper service and no longer enabled by
default.

Add `"health": {},` under your services config to enable.

Done.


src/bin/nativelink.rs line 412 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

You want to implement tower_service::Service for HealthServer then use .add_service or .add_optional_service. This will offload nearly all the complexity to that module.

It's currently derived from tower_service::Service, although it can't be added by .add_service or .add_optional_service because they are functions of tonic::Service which is GRPC server not Http Server.
So instead, I added that service by route_service function.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r5, all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 8 discussions need to be resolved


nativelink-service/src/health_server.rs line 14 at r5 (raw file):

// See the License for the specific language governing permissions and
// limitations under the License.
use futures::StreamExt;

nit: for consistency new line above.


nativelink-service/src/health_server.rs line 30 at r5 (raw file):

#[derive(Clone)]
pub struct HealthServer {
    pub health_registry_status: HealthRegistry,

nit: This should not be public. If it's needed make an access function. Making it public allows anyone to update it if they have a it mutable.


nativelink-service/src/health_server.rs line 34 at r5 (raw file):

impl HealthServer {
    pub async fn new(health_registry_status: HealthRegistry) -> Result<Self, Error> {

nit: just call all these health_registry. It's confusing to call it _status.


nativelink-service/src/health_server.rs line 52 at r5 (raw file):

    fn call(&mut self, _req: Request<Body>) -> Self::Future {
        let health_registry_status = self.health_registry_status.clone();
        let fut = async move {

nit:

Pin::box(async move {
  //...
})

nativelink-service/src/health_server.rs line 53 at r5 (raw file):

        let health_registry_status = self.health_registry_status.clone();
        let fut = async move {
            let health_status_descriptions: Vec<HealthStatusDescription> = health_registry_status

nit: Move this above (outside this future, then we won't need to clone it (above).


nativelink-service/src/health_server.rs line 83 at r5 (raw file):

                    .status(StatusCode::INTERNAL_SERVER_ERROR)
                    .header(
                        hyper::header::CONTENT_TYPE,

nit: Please import these and use them instead of listing out the namespaces.


nativelink-util/src/health_utils.rs line 167 at r5 (raw file):

    fn health_status_report(
        &self,
    ) -> Pin<Box<dyn Stream<Item = HealthStatusDescription> + '_ + Send>>;

nit: '_ should always be at end.


nativelink-util/src/health_utils.rs line 175 at r5 (raw file):

    fn health_status_report(
        &self,
    ) -> Pin<Box<dyn Stream<Item = HealthStatusDescription> + '_ + Send>> {

nit: ditto.

@blizzardc0der blizzardc0der force-pushed the factor-out-health-status-checks branch from caf1ed7 to 0638fa2 Compare April 15, 2024 12:49
Copy link
Contributor Author

@blizzardc0der blizzardc0der left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 1 discussions need to be resolved


nativelink-service/src/health_server.rs line 53 at r5 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Move this above (outside this future, then we won't need to clone it (above).

Just fyi, I've tried your solution but it didn't work.
tower::Service trait's call function is defined as synchronous function.
Although, the health_status_description variable needs to determined in async function.
So I added this line above using block_on function to call async function inside call function.

let health_status_descriptions: Vec<HealthStatusDescription> = futures::executor::block_on(async move { self.health_registry.health_status_report().collect().await });

Although, when I tested it with Postman about /status endpoint, it didn't work and hang.
So I didn't apply this changes.

@blizzardc0der blizzardc0der requested a review from allada April 15, 2024 12:53
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r6, all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 3 discussions need to be resolved


nativelink-service/Cargo.toml line 14 at r6 (raw file):

nativelink-scheduler = { path = "../nativelink-scheduler" }

async-lock = "3.3.0"

nit: unused.


nativelink-service/src/health_server.rs line 53 at r5 (raw file):

Previously, blizzardc0der (Blizzardc0der) wrote…

Just fyi, I've tried your solution but it didn't work.
tower::Service trait's call function is defined as synchronous function.
Although, the health_status_description variable needs to determined in async function.
So I added this line above using block_on function to call async function inside call function.

let health_status_descriptions: Vec<HealthStatusDescription> = futures::executor::block_on(async move { self.health_registry.health_status_report().collect().await });

Although, when I tested it with Postman about /status endpoint, it didn't work and hang.
So I didn't apply this changes.

ok, it's fine this way, but the reason you are probably fighting it is because of the future lifetime. You could bind it to the same lifetime as self and it would probably be happy.


nativelink-service/src/health_server.rs line 21 at r6 (raw file):

    HealthRegistry, HealthStatus, HealthStatusDescription, HealthStatusReporter,
};
use std::future::Future;

nit: std's go first followed by a new line (see other files for example).


nativelink-service/src/health_server.rs line 43 at r6 (raw file):

    type Response = Response<hyper::Body>;
    type Error = std::convert::Infallible;
    type Future = Pin<Box<dyn Future<Output = Result<Self::Response, Self::Error>> + Send>>;

nit: It looks like there's some new syntax in rust you can use here to let the compiler know about the exact future:

impl Service<Request<hyper::Body>> for HealthServer {
    type Response = Response<hyper::Body>;
    type Error = std::convert::Infallible;
    type Future = impl Future<Output = Result<Self::Response, Self::Error>> + Send>;

    fn call(&mut self, _req: Request<Body>) -> impl Self::Future {
      async move {
        //...
      }
    }
}

see: https://rust-lang.github.io/async-fundamentals-initiative/evaluation/challenges/dyn_traits.html

@steedmicro steedmicro force-pushed the factor-out-health-status-checks branch from 0638fa2 to eeecbf4 Compare April 16, 2024 13:34
@blizzardc0der blizzardc0der force-pushed the factor-out-health-status-checks branch from eeecbf4 to 486da3e Compare April 16, 2024 13:45
Copy link
Contributor Author

@blizzardc0der blizzardc0der left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made CI passed.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 1 discussions need to be resolved


nativelink-service/src/health_server.rs line 53 at r5 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

ok, it's fine this way, but the reason you are probably fighting it is because of the future lifetime. You could bind it to the same lifetime as self and it would probably be happy.

I've retried with the solution but couldn't make it work because of tower::Serivce's declared Traits were strict and wasn't flexible.
To prevent from being obsessed with this, I'd like to leave it as for now and work on more urgent issues.


nativelink-service/src/health_server.rs line 43 at r6 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: It looks like there's some new syntax in rust you can use here to let the compiler know about the exact future:

impl Service<Request<hyper::Body>> for HealthServer {
    type Response = Response<hyper::Body>;
    type Error = std::convert::Infallible;
    type Future = impl Future<Output = Result<Self::Response, Self::Error>> + Send>;

    fn call(&mut self, _req: Request<Body>) -> impl Self::Future {
      async move {
        //...
      }
    }
}

see: https://rust-lang.github.io/async-fundamentals-initiative/evaluation/challenges/dyn_traits.html

I've tried that solution but it didn't work and failed with compile errors.
Seems like tower::Service trait's call trait doesn't support that option to declare in that syntax.

@blizzardc0der blizzardc0der requested a review from allada April 16, 2024 13:54
Health status checks are now a proper service and no longer enabled by
default.

Add `"health": {},` under your services config to enable.
@blizzardc0der blizzardc0der force-pushed the factor-out-health-status-checks branch from 4e87cc8 to 2c901b4 Compare April 16, 2024 14:46
Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r6, 4 of 4 files at r7.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Local / ubuntu-22.04, Publish image, Vercel, asan / ubuntu-22.04, integration-tests (20.04), pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, vale, and 1 discussions need to be resolved

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r6.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Local / ubuntu-22.04, Publish image, Vercel, asan / ubuntu-22.04, integration-tests (20.04), pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, vale, and 1 discussions need to be resolved

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r8, all commit messages.
Reviewable status: :shipit: complete! 2 of 1 LGTMs obtained

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r7.
Reviewable status: :shipit: complete! 2 of 1 LGTMs obtained

@allada allada merged commit ea50856 into TraceMachina:main Apr 16, 2024
26 checks passed
chinchaun pushed a commit to chinchaun/nativelink that referenced this pull request May 6, 2024
…achina#823)

Health status checks are now a proper service and no longer enabled by
default.

Add `"health": {},` under your services config to enable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Factor out health status checks to it's own service file
3 participants