Pebble does not try to restart service if exited too quickly for the first time #240

beliaev-maksim · 2023-06-20T07:53:01Z

Hello,
we face the following issue:
our charm service is dependent on another service (2nd charm) to be alive. When we deploy bundle we cannot guarantee that 2nd charm will go alive before main charm.

What happens:

Charm 1 installs first
Charm 1 tries to start pebble service
Charm 2 is not yet alive
Service exits with error since tries to connect to charm 2 which is not alive (exited to quickly)
Charm 2 goes UP
Charm 1 never recovers since pebble tries to start the service only once

We tried to define health checks, but it looks like if the service was not start for the first time, then health checks are ignored.

Can we control from pebble retry interval and force it to try to start the service again?
We do not want to do service management on charm side and prefer to rely on Pebble in this scenario

please see related issue in the repo: canonical/kfp-operators#220

jnsgruk · 2023-06-20T07:57:15Z

Can we control from pebble retry interval and force it to try to start the service again?

There are a bunch of controls you have over the restarting of a service:

services:
    # (Optional) Defines what happens when the service exits with a nonzero
    # exit code. Possible values are: "restart" (default) which restarts
    # the service after the backoff delay, "shutdown" which shuts down and
    # exits the Pebble server, and "ignore" which does nothing further.
    on-failure: restart | shutdown | ignore

    # (Optional) Defines what happens when each of the named health checks
    # fail. Possible values are: "restart" (default) which restarts
    # the service once, "shutdown" which shuts down and exits the Pebble
    # server, and "ignore" which does nothing further.
    on-check-failure:
        <check name>: restart | shutdown | ignore

    # (Optional) Initial backoff delay for the "restart" exit action.
    # Default is half a second ("500ms").
    backoff-delay: <duration>

    # (Optional) After each backoff, the backoff delay is multiplied by
    # this factor to get the next backoff delay. Must be greater than or
    # equal to one. Default is 2.0.
    backoff-factor: <factor>

    # (Optional) Limit for the backoff delay: when multiplying by
    # backoff-factor to get the next backoff delay, if the result is
    # greater than this value, it is capped to this value. Default is
    # half a minute ("30s").
    backoff-limit: <duration>

    # (Optional) The amount of time afforded to this service to handle
    # SIGTERM and exit gracefully before SIGKILL terminates it forcefully.
    # Default is 5 seconds ("5s").
    kill-delay: <duration>

Have you tried any of those? :)

cjdcordeiro · 2023-06-20T08:22:31Z

I believe this is related to the fact that Pebble does not support one-shot services (and maybe also a bug as IMO, regardless of the service's nature, Pebble should catch its errors). Something similar was initially reported here and implemented via pebble exec, without ever altering the nature of the supported commands for Pebble services.

Although this could be worth a discussion, I think the most immediate workaround for that issue is to force a sleep.

phoevos · 2023-06-20T08:24:34Z

@jnsgruk, I did try the available options:

on-failure seemed the most promising one. I've tried it (either as the default or explicitly setting it), but it looks like since the service is considered inactive (and not failed) this is never triggered. I believe that this is implied in the description:

when the service exits with a nonzero exit code given that the failure did not occur within less than 1 second
on-check-failure has been the one we've been trying to use along with our health checks. Again, this doesn't work for inactive services.
the backoff options all control the behaviour between restarts. Won't make a difference if the service hasn't even started from the perspective of Pebble.

Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>

ca-scribner · 2023-06-20T15:32:02Z

Yeah @jnsgruk as @phoevos mentioned, those controls are applied only to things pebble things have previously started. See this section of pebble docs, specifically:

When starting a service, Pebble executes the service's command, and waits 1 second to ensure the command doesn't exit too quickly. Assuming the command doesn't exit within that time window, the start is considered successful, otherwise pebble start will exit with an error

iirc, a service that exits with an error like this goes to a different state than something that has crashed and needs restarting. Pebble does not try to restart from this state, even if there are health checks.

From reading the docs and discussion in this thread, it sounds like this is the intended behaviour rather than a bug. But to many this intended behaviour is not expected and maybe worth revisiting

ca-scribner · 2023-06-20T15:34:49Z

imo this is different from one-shot commands. one-shot commands are intended to be single execution. The problem here is that a long-running service may legitimately error out quickly and users will expect pebble to restart it.

This is typical in kubernetes controllers. Commonly you create a controller and some configmaps that the controller needs at the same time. If the controller wins the race and starts before the configmaps, it errors out and k8s restarts it again soon.

benhoyt · 2023-06-20T23:27:43Z

Yes, @ca-scribner is correct: this is not related to exec (one-shot commands), as @beliaev-maksim is describing a long-running service, just one that exits quickly if it can't connect to a dependency.

This is actually happening by design: when you start a service manually, with pebble start or ops.Container.start(), it waits up to 1 second for the service to be started. If the service exits quickly (within 1s), the start operation returns an error / raises an exception, and the idea is that you'd handle it there. So if you were calling container.start('mysvc') manually in the charm, the charm hook would raise an error and presumably Juju would try it again, so you'd get retry behaviour that way.

Similar with replan, which is what it looks like you're doing in the charm you linked. You're calling update_layer to add a layer and call replan. This will raise a pebble.ChangeError in the case of a service which exits too quickly, which update_layer turns into an ErrorWithStatus exception. However, this is then caught by the charm and the hook sets an error status but returns successfully:

        except ErrorWithStatus as err:
            self.model.unit.status = err.status
            self.logger.error(f"Failed to handle {event} with error: {err}")
            return

This means Juju won't retry the hook, because you've exited the Python code successfully. I think the simplest thing to do would be to change the return above to raise, to re-raise the error, forcing Juju to retry the hook a bit later, which would do the add-layer/replan thing again, and hopefully this time the other charm is working. I forget off-hand how quickly and how many times Juju retries running a hook.

There are two other options I can think of:

Defer the event. Again, this would mean whenever the next event came through, it would re-do the add-layer/replan. But it's uncertain when the next Juju event is going to come through (the problem with defer).
Make the service wait at least 1 second before exiting, even if case of error. This would mean the service would be "successfully started" from Pebble's point of view, and then kick in Pebble's auto-restart behaviour.

I'd lean towards the last option, because then you get Pebble's auto-restart behaviour for free. You could either do the wait in your service's code if you have control over that, or add a simple wrapper script that waits. For example, I just tested it with this, and the service goes into the auto-restart / backoff loop as we want here:

services:
  srv1:
    override: replace
    command: /bin/sh -c "myservice; sleep 1.1"
    startup: enabled

It's possible we should revisit this design. But the current idea is that either the start (or replan) fails and the error is reported that way, or if the service exits later Pebble's auto-restart behaviour kicks in. The problem with your charm's current approach is that it's catching the exception and masking the error.

beliaev-maksim · 2023-06-21T07:22:58Z

Hi @benhoyt,

I think the simplest thing to do would be to change the return above to raise

I strongly believe this is against charming best practices. Error in the charm code means we do not know what to do in this scenario and that is why we failed. Also, do not forget that user theoretically may set model setting to not retry failed apps.

I'd lean towards the last option, because then you get Pebble's auto-restart behaviour for free.

yes, I also support this idea more. However, I cannot say you get it really for free. Will not we get an overhead of 1s on each service start then ?

The problem with your charm's current approach is that it's catching the exception and masking the error.

I do not think this is the root problem. Again, see note about charm unhandled exceptions above.
I think that it would be better if pebble will try to start any service independent of the error raised. It should not lead to any system overload due to backoff-factor.

If we want to make the code fully backwards compatible we can add new pebble option that will ignore all the errors and will always try to restart according to the settings.

ca-scribner · 2023-06-21T13:43:18Z

I'm still digesting the comments about how to make it work in a charm, so I'll come back to those.

Regarding the design choice on handling <1s errors different from >1s errors: was there a constraint that drove the design that way? The only thing I can think of would be avoiding thrashing a process that'll never succeed be restarting it over and over. Was that the issue, or was there something else?

What caught me off guard here I think is that I don't know of another process manager that uses a similar pattern. For example with kubernetes, a Pod may die instantly (image cannot pull, process starts and dies, etc) and it is handled the same as one that dies after hours. This method could lead to thrashing the restarts, but they avoid that with an exponential backoff.

Doesn't mean restarting after a quick death is the best thing to do, but to me it feels like the intuitive thing to do unless something really gets in the way.

benhoyt · 2023-06-22T00:44:28Z

I strongly believe this is against charming best practices. Error in the charm code means we do not know what to do in this scenario and that is why we failed.

@beliaev-maksim Yeah, that's fair push-back. Though I did say "simplest" thing, not "best". :-) But you're right, and charms shouldn't go into error state when they know what's wrong and they're really just waiting on something.

I guess my point is that you should either handle the start / replan error (by having the charm retry, or deferring the event). You could do something like this:

class MyCharm(ops.CharmBase):
    def _on_config_changed(self, event):
        if not self.ensure_started()  # or perhaps an @ensure_started decorator?
            return
        ...

    def _on_other_hook(self, event):
        if not self.ensure_started()
            return
        ...

    def ensure_started(self) -> bool:
        """Ensure service is started and return True, or return False if it can't be started."""
        if self.container.get_service("mysvc").is_running():
            return True
        try:
            self.container.replan()
            return True
        except pebble.ChangeError:
            self.unit.status = ops.WaitingStatus("waiting for dependency to start")
            return False

The drawback is you don't know when the next event is going to come in (and right now we have no way of telling Juju, "please wake me up in 10 seconds, will ya", though we might have something equivalent when we get workload events / Pebble Notices).

But I think the other approach of the wrapper script with a sleep 1 is probably a cleaner workaround for your use case.

Will not we get an overhead of 1s on each service start then ?

Well, kind of, but only when the service can't start and you need to wait for the dependency anyway. When the service starts properly, it's not an issue because it won't get to the sleep.

So that's my recommendation for now, with the current system. As you point out, this might not be the best design, and we could definitely consider a different approach where the service start always succeeds (or maybe there's a config option, to make it backwards-compatible). I think it's a little tricky, and requires adding config, so it probably needs a small spec. It's not on my roadmap so realistically I probably won't get to it soon, but if one of you wants to start with a small OPnnn spec, we could discuss further from there?

Still, that's not going to be discussed and implemented overnight, so you would likely want to use the sleep 1 workaround for now.

beliaev-maksim · 2023-06-22T06:33:57Z

@benhoyt yes, for now in 2.9 we go with sleep 1

In 3.1 as workaround I proposed to use juju secrets and set expiration time and try in X minutes on secret expire hook.
WDYT?

Yes, we can start with spec proposal, thanks!

benhoyt · 2023-06-25T21:46:15Z

In 3.1 as workaround I proposed to use juju secrets and set expiration time and try in X minutes on secret expire hook.
WDYT?

To me that sounds much more hacky (and an abuse of Juju Secrets) than a simple sleep 1 in your command.

Yes, we can start with spec proposal, thanks!

Sounds good to me!

Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>

Sleep for 1.1 seconds before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>

ca-scribner · 2024-04-10T20:36:49Z

Following up on this since its been a while...

The Kubeflow team would appreciate it if this was delivered in 24.10, ideally implementing the same restart procedure whether something dies in <1s or >1s. We're happy to contribute to a spec or guide however you want to elaborate the problem.

For some justification of why this would be helpful, the Kubeflow team has been burned by this a few times recently. Not that we can't work around it, but that we usually spend hours debugging the wrong thing before we realize this is the root cause. The trouble is that when someone creates a pebble service with checks and restart settings, it is not intuitive that these only apply some of the time (an example of this is @jnsgruk's interpretation here - probably everyone reading those settings expects restarts will occur even on quick failures). I doubt Kubeflow is the only team facing this, just that others might not realize what is happening and don't know this is the cause

phoevos mentioned this issue Jun 20, 2023

pebble: Sleep before starting KFP API server canonical/kfp-operators#221

Merged

cjdcordeiro mentioned this issue Jul 19, 2023

Provide a default service for the Hello world example canonical/rockcraft#330

Closed

1 task

phoevos mentioned this issue Aug 2, 2023

KFP API: Pebble service fails to start canonical/kfp-operators#220

Closed

amandahla mentioned this issue Mar 5, 2024

Document ChangeError is raised even while setting the layer as "on-failure: ignore" #377

Closed

benhoyt added 24.10 Planned for the 24.10 cycle Feature A feature request Needs Design Needs a design or spec labels Mar 13, 2024

orfeas-k mentioned this issue Apr 11, 2024

kubeflow-profiles service doesn't start but charm is active canonical/kubeflow-profiles-operator#164

Open

benhoyt changed the title ~~Pebble does not try to restart if exited too quickly for the first time~~ Pebble does not try to restart service if exited too quickly for the first time Apr 18, 2024

benhoyt mentioned this issue May 17, 2024

Allow service to be stopped while in 1s "starting" state #410

Open

benhoyt mentioned this issue Aug 29, 2024

"pebble logs" does not show logs of short-running / failing processes #495

Open

IronCore864 mentioned this issue Sep 24, 2024

fix: allow stopping services in "starting" state #503

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pebble does not try to restart service if exited too quickly for the first time #240

Pebble does not try to restart service if exited too quickly for the first time #240

beliaev-maksim commented Jun 20, 2023

jnsgruk commented Jun 20, 2023 •

edited

Loading

cjdcordeiro commented Jun 20, 2023

phoevos commented Jun 20, 2023

ca-scribner commented Jun 20, 2023

ca-scribner commented Jun 20, 2023

benhoyt commented Jun 20, 2023

beliaev-maksim commented Jun 21, 2023

ca-scribner commented Jun 21, 2023

benhoyt commented Jun 22, 2023

beliaev-maksim commented Jun 22, 2023

benhoyt commented Jun 25, 2023

ca-scribner commented Apr 10, 2024

Pebble does not try to restart service if exited too quickly for the first time #240

Pebble does not try to restart service if exited too quickly for the first time #240

Comments

beliaev-maksim commented Jun 20, 2023

jnsgruk commented Jun 20, 2023 • edited Loading

cjdcordeiro commented Jun 20, 2023

phoevos commented Jun 20, 2023

ca-scribner commented Jun 20, 2023

ca-scribner commented Jun 20, 2023

benhoyt commented Jun 20, 2023

beliaev-maksim commented Jun 21, 2023

ca-scribner commented Jun 21, 2023

benhoyt commented Jun 22, 2023

beliaev-maksim commented Jun 22, 2023

benhoyt commented Jun 25, 2023

ca-scribner commented Apr 10, 2024

jnsgruk commented Jun 20, 2023 •

edited

Loading