Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KFP API: Pebble service fails to start #220

Closed
phoevos opened this issue Jun 14, 2023 · 2 comments
Closed

KFP API: Pebble service fails to start #220

phoevos opened this issue Jun 14, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@phoevos
Copy link
Contributor

phoevos commented Jun 14, 2023

We're bumping into this issue intermittently on a clean install with the 2.0/edge (rev 413) version of the charm, when deploying the Kubeflow bundle (1.7/edge) either on MicroK8s or Charmed Kubernetes.

On startup, the KFP API Server tries to connect to MinIO. If MinIO is not yet available, the connection fails with the following error, causing the service to crash, less than 1 second after it started.

Failed to check if Minio bucket exists. Error: Get "http://minio.kubeflow:9000/mlpipeline/?location=": dial tcp 10.152.183.92:9000: connect: connection refused

Due to the service failing fast (<1 sec), Pebble considers it to be inactive:

$ pebble services       
Service                 Startup  Current   Since
ml-pipeline-api-server  enabled  inactive  -

This is a design decision on the Pebble side as explained here:

When starting a service, Pebble executes the service's command, and waits 1 second to ensure the command doesn't exit too quickly. Assuming the command doesn't exit within that time window, the start is considered successful, otherwise pebble start will exit with an error.

Due to the fact that the service was never active, Pebble never attempts to restart it, despite the failing health checks.

Workaround

Since this issue is exposed due to a race (i.e. MinIO not yet available), it won't come up every time. If it does occur during deployment (after the rest of the bundle has been installed successfully), however, we need to start the API Server service manually to unblock:

juju ssh kfp-api/0 "PEBBLE_SOCKET=/charm/containers/ml-pipeline-api-server/pebble.socket /charm/bin/pebble replan"

Mitigation

We need to come up with the plan to avoid bumping into this issue in the future. There's a couple of things that could be done on our side:

  • Sleep for 1 second before starting any Pebble service
    • This is not ideal, since it defeats the purpose of this Pebble feature
  • Catch the error and manually retry starting the service on the charm side
    • This is not ideal, since it shifts the responsibility of service management from the service manager itself to the high-level charm code
@phoevos phoevos added the bug Something isn't working label Jun 14, 2023
@phoevos phoevos self-assigned this Jun 14, 2023
phoevos added a commit that referenced this issue Jun 20, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
phoevos added a commit that referenced this issue Jun 20, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
@DnPlas
Copy link
Contributor

DnPlas commented Jun 20, 2023

I also hit this issue today.

My environment

  • Ubuntu 20.04
  • juju controller 2.9.34
  • kfp-api 2.0/stable rev413
  • minio ckf-1.7/stable rev186

Steps to reproduce

Deploy kfp-api -> wait for it to go active and idle -> kfp-api goes into BlockedStatus requesting to add object-storage relation -> deploy minio -> relate minio and kfp-api

NOTE: the workaround did work for me, but we have to wait for the next update status to be triggered to see the change.

Status of the units:

App          Version                Status   Scale  Charm                    Channel         Rev  Address         Exposed  Message
kfp-api                             waiting      1  kfp-api                  2.0/stable      413  10.152.183.244  no       installing agent
kfp-db       mariadb/server:10.3    active       1  charmed-osm-mariadb-k8s  stable           35  10.152.183.242  no       ready
kfp-schedwf  res:oci-image@90ddd63  active       1  kfp-schedwf              2.0/stable      424                  no       
kfp-viz      res:oci-image@3de6f3c  active       1  kfp-viz                  2.0/stable      394  10.152.183.102  no       
minio        res:oci-image@1755999  active       1  minio                    ckf-1.7/stable  186  10.152.183.239  no       

Unit            Workload     Agent  Address     Ports              Message
kfp-api/0*      maintenance  idle   10.1.15.13                     Workload failed health check
kfp-db/0*       active       idle   10.1.15.18  3306/TCP           ready
kfp-schedwf/0*  active       idle   10.1.15.12                     
kfp-viz/0*      active       idle   10.1.15.23  8888/TCP           
minio/0*        active       idle   10.1.15.21  9000/TCP,9001/TCP 

Observations

  1. The message Workload failed heath check comes from L294 inside the _check_status() method that is called in on_update_status() on every UpdateStatus event.
  2. The service ml-pipeline-api-server is never started if minio and the relation with minio is missing because it fails with 2023-06-20T12:56:36.924Z [ml-pipeline-api-server] F0620 12:56:36.924163 19 client_manager.go:412] Failed to check if Minio bucket exists. Error: Get "http://minio.kubeflow:9000/mlpipeline/?location=": dial tcp 10.152.183.239:9000: connect: connection refused
  3. We get a failed health check because 2023-06-20T13:01:36.474Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused. Which makes sense because the service is not up.
  4. The service ``ml-pipeline-api-server` is not replaned or restarted by the charm code at any point after the initial sequence. We only replan the service IFF there is a change in the pebble layer.
  5. Pebble will never restart a service that exited too quickly because of a design choice (see description of this bug)

Possible solution

We can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the object-storage relation exists, which means that we have to always check for the relation before calling update_layer, deferring (or returning before) the update_layer call until the relation is present.

I can think of something like:

    def _on_event(self, event, force_conflicts: bool = False) -> None:
        # Set up all relations/fetch required data
        try:
            self._check_leader()
            interfaces = self._get_interfaces()
            config_json = self._generate_config(interfaces)
            self._get_object_storage(interfaces) # <--- This also raises ErrorWithStatus
            self._upload_files_to_container(config_json)
            self._apply_k8s_resources(force_conflicts=force_conflicts)
            update_layer(self._container_name, self._container, self._kfp_api_layer, self.logger)
            self._send_info(interfaces)
        except ErrorWithStatus as err:
            self.model.unit.status = err.status
            self.logger.error(f"Failed to handle {event} with error: {err}")
            return

        self.model.unit.status = ActiveStatus()

phoevos added a commit that referenced this issue Jun 28, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
phoevos added a commit that referenced this issue Jun 29, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
phoevos added a commit that referenced this issue Jun 29, 2023
Sleep for 1.1 seconds before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
@phoevos
Copy link
Contributor Author

phoevos commented Aug 2, 2023

Closing this issue, since 06303e4 got merged and is now part of track/2.0. We will revisit this if there's any progress with canonical/pebble#240, or we decide to restructure the code to integrate Daniela's proposed solution:

We can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the object-storage relation exists, which means that we have to always check for the relation before calling update_layer, deferring (or returning before) the update_layer call until the relation is present.

@phoevos phoevos closed this as completed Aug 2, 2023
DnPlas pushed a commit that referenced this issue Aug 4, 2023
Sleep for 1.1 seconds before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants