-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pebble does not try to restart service if exited too quickly for the first time #240
Comments
There are a bunch of controls you have over the restarting of a service: services:
# (Optional) Defines what happens when the service exits with a nonzero
# exit code. Possible values are: "restart" (default) which restarts
# the service after the backoff delay, "shutdown" which shuts down and
# exits the Pebble server, and "ignore" which does nothing further.
on-failure: restart | shutdown | ignore
# (Optional) Defines what happens when each of the named health checks
# fail. Possible values are: "restart" (default) which restarts
# the service once, "shutdown" which shuts down and exits the Pebble
# server, and "ignore" which does nothing further.
on-check-failure:
<check name>: restart | shutdown | ignore
# (Optional) Initial backoff delay for the "restart" exit action.
# Default is half a second ("500ms").
backoff-delay: <duration>
# (Optional) After each backoff, the backoff delay is multiplied by
# this factor to get the next backoff delay. Must be greater than or
# equal to one. Default is 2.0.
backoff-factor: <factor>
# (Optional) Limit for the backoff delay: when multiplying by
# backoff-factor to get the next backoff delay, if the result is
# greater than this value, it is capped to this value. Default is
# half a minute ("30s").
backoff-limit: <duration>
# (Optional) The amount of time afforded to this service to handle
# SIGTERM and exit gracefully before SIGKILL terminates it forcefully.
# Default is 5 seconds ("5s").
kill-delay: <duration> Have you tried any of those? :) |
I believe this is related to the fact that Pebble does not support one-shot services (and maybe also a bug as IMO, regardless of the service's nature, Pebble should catch its errors). Something similar was initially reported here and implemented via Although this could be worth a discussion, I think the most immediate workaround for that issue is to force a |
@jnsgruk, I did try the available options:
|
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Yeah @jnsgruk as @phoevos mentioned, those controls are applied only to things pebble things have previously started. See this section of pebble docs, specifically:
iirc, a service that exits with an error like this goes to a different state than something that has crashed and needs restarting. Pebble does not try to restart from this state, even if there are health checks. From reading the docs and discussion in this thread, it sounds like this is the intended behaviour rather than a bug. But to many this intended behaviour is not expected and maybe worth revisiting |
imo this is different from one-shot commands. one-shot commands are intended to be single execution. The problem here is that a long-running service may legitimately error out quickly and users will expect pebble to restart it. This is typical in kubernetes controllers. Commonly you create a controller and some configmaps that the controller needs at the same time. If the controller wins the race and starts before the configmaps, it errors out and k8s restarts it again soon. |
Yes, @ca-scribner is correct: this is not related to exec (one-shot commands), as @beliaev-maksim is describing a long-running service, just one that exits quickly if it can't connect to a dependency. This is actually happening by design: when you start a service manually, with Similar with except ErrorWithStatus as err:
self.model.unit.status = err.status
self.logger.error(f"Failed to handle {event} with error: {err}")
return This means Juju won't retry the hook, because you've exited the Python code successfully. I think the simplest thing to do would be to change the There are two other options I can think of:
I'd lean towards the last option, because then you get Pebble's auto-restart behaviour for free. You could either do the wait in your service's code if you have control over that, or add a simple wrapper script that waits. For example, I just tested it with this, and the service goes into the auto-restart / backoff loop as we want here:
It's possible we should revisit this design. But the current idea is that either the start (or replan) fails and the error is reported that way, or if the service exits later Pebble's auto-restart behaviour kicks in. The problem with your charm's current approach is that it's catching the exception and masking the error. |
Hi @benhoyt,
I strongly believe this is against charming best practices. Error in the charm code means we do not know what to do in this scenario and that is why we failed. Also, do not forget that user theoretically may set model setting to not retry failed apps.
yes, I also support this idea more. However, I cannot say you get it really for free. Will not we get an overhead of 1s on each service start then ?
I do not think this is the root problem. Again, see note about charm unhandled exceptions above. If we want to make the code fully backwards compatible we can add new pebble option that will ignore all the errors and will always try to restart according to the settings. |
I'm still digesting the comments about how to make it work in a charm, so I'll come back to those. Regarding the design choice on handling <1s errors different from >1s errors: was there a constraint that drove the design that way? The only thing I can think of would be avoiding thrashing a process that'll never succeed be restarting it over and over. Was that the issue, or was there something else? What caught me off guard here I think is that I don't know of another process manager that uses a similar pattern. For example with kubernetes, a Pod may die instantly (image cannot pull, process starts and dies, etc) and it is handled the same as one that dies after hours. This method could lead to thrashing the restarts, but they avoid that with an exponential backoff. Doesn't mean restarting after a quick death is the best thing to do, but to me it feels like the intuitive thing to do unless something really gets in the way. |
@beliaev-maksim Yeah, that's fair push-back. Though I did say "simplest" thing, not "best". :-) But you're right, and charms shouldn't go into error state when they know what's wrong and they're really just waiting on something. I guess my point is that you should either handle the class MyCharm(ops.CharmBase):
def _on_config_changed(self, event):
if not self.ensure_started() # or perhaps an @ensure_started decorator?
return
...
def _on_other_hook(self, event):
if not self.ensure_started()
return
...
def ensure_started(self) -> bool:
"""Ensure service is started and return True, or return False if it can't be started."""
if self.container.get_service("mysvc").is_running():
return True
try:
self.container.replan()
return True
except pebble.ChangeError:
self.unit.status = ops.WaitingStatus("waiting for dependency to start")
return False The drawback is you don't know when the next event is going to come in (and right now we have no way of telling Juju, "please wake me up in 10 seconds, will ya", though we might have something equivalent when we get workload events / Pebble Notices). But I think the other approach of the wrapper script with a
Well, kind of, but only when the service can't start and you need to wait for the dependency anyway. When the service starts properly, it's not an issue because it won't get to the sleep. So that's my recommendation for now, with the current system. As you point out, this might not be the best design, and we could definitely consider a different approach where the service start always succeeds (or maybe there's a config option, to make it backwards-compatible). I think it's a little tricky, and requires adding config, so it probably needs a small spec. It's not on my roadmap so realistically I probably won't get to it soon, but if one of you wants to start with a small OPnnn spec, we could discuss further from there? Still, that's not going to be discussed and implemented overnight, so you would likely want to use the |
@benhoyt yes, for now in 2.9 we go with sleep 1 In 3.1 as workaround I proposed to use juju secrets and set expiration time and try in X minutes on secret expire hook. Yes, we can start with spec proposal, thanks! |
To me that sounds much more hacky (and an abuse of Juju Secrets) than a simple
Sounds good to me! |
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1.1 seconds before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1.1 seconds before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Following up on this since its been a while... The Kubeflow team would appreciate it if this was delivered in 24.10, ideally implementing the same restart procedure whether something dies in <1s or >1s. We're happy to contribute to a spec or guide however you want to elaborate the problem. For some justification of why this would be helpful, the Kubeflow team has been burned by this a few times recently. Not that we can't work around it, but that we usually spend hours debugging the wrong thing before we realize this is the root cause. The trouble is that when someone creates a pebble service with checks and restart settings, it is not intuitive that these only apply some of the time (an example of this is @jnsgruk's interpretation here - probably everyone reading those settings expects restarts will occur even on quick failures). I doubt Kubeflow is the only team facing this, just that others might not realize what is happening and don't know this is the cause |
Hello,
we face the following issue:
our charm service is dependent on another service (2nd charm) to be alive. When we deploy bundle we cannot guarantee that 2nd charm will go alive before main charm.
What happens:
We tried to define health checks, but it looks like if the service was not start for the first time, then health checks are ignored.
Can we control from pebble retry interval and force it to try to start the service again?
We do not want to do service management on charm side and prefer to rely on Pebble in this scenario
please see related issue in the repo: canonical/kfp-operators#220
The text was updated successfully, but these errors were encountered: