KFP API: Pebble service fails to start #220

phoevos · 2023-06-14T12:47:20Z

We're bumping into this issue intermittently on a clean install with the 2.0/edge (rev 413) version of the charm, when deploying the Kubeflow bundle (1.7/edge) either on MicroK8s or Charmed Kubernetes.

On startup, the KFP API Server tries to connect to MinIO. If MinIO is not yet available, the connection fails with the following error, causing the service to crash, less than 1 second after it started.

Failed to check if Minio bucket exists. Error: Get "http://minio.kubeflow:9000/mlpipeline/?location=": dial tcp 10.152.183.92:9000: connect: connection refused

Due to the service failing fast (<1 sec), Pebble considers it to be inactive:

$ pebble services       
Service                 Startup  Current   Since
ml-pipeline-api-server  enabled  inactive  -

This is a design decision on the Pebble side as explained here:

When starting a service, Pebble executes the service's command, and waits 1 second to ensure the command doesn't exit too quickly. Assuming the command doesn't exit within that time window, the start is considered successful, otherwise pebble start will exit with an error.

Due to the fact that the service was never active, Pebble never attempts to restart it, despite the failing health checks.

Workaround

Since this issue is exposed due to a race (i.e. MinIO not yet available), it won't come up every time. If it does occur during deployment (after the rest of the bundle has been installed successfully), however, we need to start the API Server service manually to unblock:

juju ssh kfp-api/0 "PEBBLE_SOCKET=/charm/containers/ml-pipeline-api-server/pebble.socket /charm/bin/pebble replan"

Mitigation

We need to come up with the plan to avoid bumping into this issue in the future. There's a couple of things that could be done on our side:

Sleep for 1 second before starting any Pebble service
- This is not ideal, since it defeats the purpose of this Pebble feature
Catch the error and manually retry starting the service on the charm side
- This is not ideal, since it shifts the responsibility of service management from the service manager itself to the high-level charm code

The text was updated successfully, but these errors were encountered:

Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>

DnPlas · 2023-06-20T14:00:12Z

I also hit this issue today.

My environment

Ubuntu 20.04
juju controller 2.9.34
kfp-api 2.0/stable rev413
minio ckf-1.7/stable rev186

Steps to reproduce

Deploy kfp-api -> wait for it to go active and idle -> kfp-api goes into BlockedStatus requesting to add object-storage relation -> deploy minio -> relate minio and kfp-api

NOTE: the workaround did work for me, but we have to wait for the next update status to be triggered to see the change.

Status of the units:

App          Version                Status   Scale  Charm                    Channel         Rev  Address         Exposed  Message
kfp-api                             waiting      1  kfp-api                  2.0/stable      413  10.152.183.244  no       installing agent
kfp-db       mariadb/server:10.3    active       1  charmed-osm-mariadb-k8s  stable           35  10.152.183.242  no       ready
kfp-schedwf  res:oci-image@90ddd63  active       1  kfp-schedwf              2.0/stable      424                  no       
kfp-viz      res:oci-image@3de6f3c  active       1  kfp-viz                  2.0/stable      394  10.152.183.102  no       
minio        res:oci-image@1755999  active       1  minio                    ckf-1.7/stable  186  10.152.183.239  no       

Unit            Workload     Agent  Address     Ports              Message
kfp-api/0*      maintenance  idle   10.1.15.13                     Workload failed health check
kfp-db/0*       active       idle   10.1.15.18  3306/TCP           ready
kfp-schedwf/0*  active       idle   10.1.15.12                     
kfp-viz/0*      active       idle   10.1.15.23  8888/TCP           
minio/0*        active       idle   10.1.15.21  9000/TCP,9001/TCP

Observations

The message Workload failed heath check comes from L294 inside the _check_status() method that is called in on_update_status() on every UpdateStatus event.
The service ml-pipeline-api-server is never started if minio and the relation with minio is missing because it fails with 2023-06-20T12:56:36.924Z [ml-pipeline-api-server] F0620 12:56:36.924163 19 client_manager.go:412] Failed to check if Minio bucket exists. Error: Get "http://minio.kubeflow:9000/mlpipeline/?location=": dial tcp 10.152.183.239:9000: connect: connection refused
We get a failed health check because 2023-06-20T13:01:36.474Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused. Which makes sense because the service is not up.
The service ``ml-pipeline-api-server` is not replaned or restarted by the charm code at any point after the initial sequence. We only replan the service IFF there is a change in the pebble layer.
Pebble will never restart a service that exited too quickly because of a design choice (see description of this bug)

Possible solution

We can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the object-storage relation exists, which means that we have to always check for the relation before calling update_layer, deferring (or returning before) the update_layer call until the relation is present.

I can think of something like:

    def _on_event(self, event, force_conflicts: bool = False) -> None:
        # Set up all relations/fetch required data
        try:
            self._check_leader()
            interfaces = self._get_interfaces()
            config_json = self._generate_config(interfaces)
            self._get_object_storage(interfaces) # <--- This also raises ErrorWithStatus
            self._upload_files_to_container(config_json)
            self._apply_k8s_resources(force_conflicts=force_conflicts)
            update_layer(self._container_name, self._container, self._kfp_api_layer, self.logger)
            self._send_info(interfaces)
        except ErrorWithStatus as err:
            self.model.unit.status = err.status
            self.logger.error(f"Failed to handle {event} with error: {err}")
            return

        self.model.unit.status = ActiveStatus()

Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>

Sleep for 1.1 seconds before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>

phoevos · 2023-08-02T11:32:38Z

Closing this issue, since 06303e4 got merged and is now part of track/2.0. We will revisit this if there's any progress with canonical/pebble#240, or we decide to restructure the code to integrate Daniela's proposed solution:

We can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the object-storage relation exists, which means that we have to always check for the relation before calling update_layer, deferring (or returning before) the update_layer call until the relation is present.

Sleep for 1.1 seconds before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>

phoevos added the bug Something isn't working label Jun 14, 2023

phoevos self-assigned this Jun 14, 2023

beliaev-maksim mentioned this issue Jun 20, 2023

Pebble does not try to restart service if exited too quickly for the first time canonical/pebble#240

Open

phoevos mentioned this issue Jun 20, 2023

pebble: Sleep before starting KFP API server #221

Merged

DnPlas mentioned this issue Jul 13, 2023

pebble cannot start the service of the api-server container image canonical/pipelines-rocks#22

Closed

phoevos closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KFP API: Pebble service fails to start #220

KFP API: Pebble service fails to start #220

phoevos commented Jun 14, 2023

DnPlas commented Jun 20, 2023 •

edited

Loading

phoevos commented Aug 2, 2023

KFP API: Pebble service fails to start #220

KFP API: Pebble service fails to start #220

Comments

phoevos commented Jun 14, 2023

Workaround

Mitigation

DnPlas commented Jun 20, 2023 • edited Loading

My environment

Steps to reproduce

Observations

Possible solution

phoevos commented Aug 2, 2023

DnPlas commented Jun 20, 2023 •

edited

Loading