-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KFP API: Pebble service fails to start #220
Comments
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
I also hit this issue today. My environment
Steps to reproduceDeploy kfp-api -> wait for it to go active and idle -> kfp-api goes into BlockedStatus requesting to add object-storage relation -> deploy minio -> relate minio and kfp-api NOTE: the workaround did work for me, but we have to wait for the next update status to be triggered to see the change. Status of the units:
Observations
Possible solutionWe can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the I can think of something like: def _on_event(self, event, force_conflicts: bool = False) -> None:
# Set up all relations/fetch required data
try:
self._check_leader()
interfaces = self._get_interfaces()
config_json = self._generate_config(interfaces)
self._get_object_storage(interfaces) # <--- This also raises ErrorWithStatus
self._upload_files_to_container(config_json)
self._apply_k8s_resources(force_conflicts=force_conflicts)
update_layer(self._container_name, self._container, self._kfp_api_layer, self.logger)
self._send_info(interfaces)
except ErrorWithStatus as err:
self.model.unit.status = err.status
self.logger.error(f"Failed to handle {event} with error: {err}")
return
self.model.unit.status = ActiveStatus() |
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1.1 seconds before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Closing this issue, since 06303e4 got merged and is now part of
|
Sleep for 1.1 seconds before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
We're bumping into this issue intermittently on a clean install with the
2.0/edge
(rev 413) version of the charm, when deploying the Kubeflow bundle (1.7/edge
) either on MicroK8s or Charmed Kubernetes.On startup, the KFP API Server tries to connect to MinIO. If MinIO is not yet available, the connection fails with the following error, causing the service to crash, less than 1 second after it started.
Due to the service failing fast (<1 sec), Pebble considers it to be inactive:
This is a design decision on the Pebble side as explained here:
Due to the fact that the service was never active, Pebble never attempts to restart it, despite the failing health checks.
Workaround
Since this issue is exposed due to a race (i.e. MinIO not yet available), it won't come up every time. If it does occur during deployment (after the rest of the bundle has been installed successfully), however, we need to start the API Server service manually to unblock:
Mitigation
We need to come up with the plan to avoid bumping into this issue in the future. There's a couple of things that could be done on our side:
The text was updated successfully, but these errors were encountered: