Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pebble does not try to restart service if exited too quickly for the first time #240

Open
beliaev-maksim opened this issue Jun 20, 2023 · 12 comments
Labels
24.10 Planned for the 24.10 cycle Feature A feature request Needs Design Needs a design or spec

Comments

@beliaev-maksim
Copy link
Member

Hello,
we face the following issue:
our charm service is dependent on another service (2nd charm) to be alive. When we deploy bundle we cannot guarantee that 2nd charm will go alive before main charm.

What happens:

  1. Charm 1 installs first
  2. Charm 1 tries to start pebble service
  3. Charm 2 is not yet alive
  4. Service exits with error since tries to connect to charm 2 which is not alive (exited to quickly)
  5. Charm 2 goes UP
  6. Charm 1 never recovers since pebble tries to start the service only once

We tried to define health checks, but it looks like if the service was not start for the first time, then health checks are ignored.

Can we control from pebble retry interval and force it to try to start the service again?
We do not want to do service management on charm side and prefer to rely on Pebble in this scenario

please see related issue in the repo: canonical/kfp-operators#220

@jnsgruk
Copy link
Member

jnsgruk commented Jun 20, 2023

Can we control from pebble retry interval and force it to try to start the service again?

There are a bunch of controls you have over the restarting of a service:

services:
    # (Optional) Defines what happens when the service exits with a nonzero
    # exit code. Possible values are: "restart" (default) which restarts
    # the service after the backoff delay, "shutdown" which shuts down and
    # exits the Pebble server, and "ignore" which does nothing further.
    on-failure: restart | shutdown | ignore

    # (Optional) Defines what happens when each of the named health checks
    # fail. Possible values are: "restart" (default) which restarts
    # the service once, "shutdown" which shuts down and exits the Pebble
    # server, and "ignore" which does nothing further.
    on-check-failure:
        <check name>: restart | shutdown | ignore

    # (Optional) Initial backoff delay for the "restart" exit action.
    # Default is half a second ("500ms").
    backoff-delay: <duration>

    # (Optional) After each backoff, the backoff delay is multiplied by
    # this factor to get the next backoff delay. Must be greater than or
    # equal to one. Default is 2.0.
    backoff-factor: <factor>

    # (Optional) Limit for the backoff delay: when multiplying by
    # backoff-factor to get the next backoff delay, if the result is
    # greater than this value, it is capped to this value. Default is
    # half a minute ("30s").
    backoff-limit: <duration>

    # (Optional) The amount of time afforded to this service to handle
    # SIGTERM and exit gracefully before SIGKILL terminates it forcefully.
    # Default is 5 seconds ("5s").
    kill-delay: <duration>

Have you tried any of those? :)

@cjdcordeiro
Copy link
Collaborator

I believe this is related to the fact that Pebble does not support one-shot services (and maybe also a bug as IMO, regardless of the service's nature, Pebble should catch its errors). Something similar was initially reported here and implemented via pebble exec, without ever altering the nature of the supported commands for Pebble services.

Although this could be worth a discussion, I think the most immediate workaround for that issue is to force a sleep.

@phoevos
Copy link

phoevos commented Jun 20, 2023

@jnsgruk, I did try the available options:

  • on-failure seemed the most promising one. I've tried it (either as the default or explicitly setting it), but it looks like since the service is considered inactive (and not failed) this is never triggered. I believe that this is implied in the description:

    when the service exits with a nonzero exit code given that the failure did not occur within less than 1 second

  • on-check-failure has been the one we've been trying to use along with our health checks. Again, this doesn't work for inactive services.

  • the backoff options all control the behaviour between restarts. Won't make a difference if the service hasn't even started from the perspective of Pebble.

phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 20, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 20, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
@ca-scribner
Copy link

Yeah @jnsgruk as @phoevos mentioned, those controls are applied only to things pebble things have previously started. See this section of pebble docs, specifically:

When starting a service, Pebble executes the service's command, and waits 1 second to ensure the command doesn't exit too quickly. Assuming the command doesn't exit within that time window, the start is considered successful, otherwise pebble start will exit with an error

iirc, a service that exits with an error like this goes to a different state than something that has crashed and needs restarting. Pebble does not try to restart from this state, even if there are health checks.

From reading the docs and discussion in this thread, it sounds like this is the intended behaviour rather than a bug. But to many this intended behaviour is not expected and maybe worth revisiting

@ca-scribner
Copy link

imo this is different from one-shot commands. one-shot commands are intended to be single execution. The problem here is that a long-running service may legitimately error out quickly and users will expect pebble to restart it.

This is typical in kubernetes controllers. Commonly you create a controller and some configmaps that the controller needs at the same time. If the controller wins the race and starts before the configmaps, it errors out and k8s restarts it again soon.

@benhoyt
Copy link
Contributor

benhoyt commented Jun 20, 2023

Yes, @ca-scribner is correct: this is not related to exec (one-shot commands), as @beliaev-maksim is describing a long-running service, just one that exits quickly if it can't connect to a dependency.

This is actually happening by design: when you start a service manually, with pebble start or ops.Container.start(), it waits up to 1 second for the service to be started. If the service exits quickly (within 1s), the start operation returns an error / raises an exception, and the idea is that you'd handle it there. So if you were calling container.start('mysvc') manually in the charm, the charm hook would raise an error and presumably Juju would try it again, so you'd get retry behaviour that way.

Similar with replan, which is what it looks like you're doing in the charm you linked. You're calling update_layer to add a layer and call replan. This will raise a pebble.ChangeError in the case of a service which exits too quickly, which update_layer turns into an ErrorWithStatus exception. However, this is then caught by the charm and the hook sets an error status but returns successfully:

        except ErrorWithStatus as err:
            self.model.unit.status = err.status
            self.logger.error(f"Failed to handle {event} with error: {err}")
            return

This means Juju won't retry the hook, because you've exited the Python code successfully. I think the simplest thing to do would be to change the return above to raise, to re-raise the error, forcing Juju to retry the hook a bit later, which would do the add-layer/replan thing again, and hopefully this time the other charm is working. I forget off-hand how quickly and how many times Juju retries running a hook.

There are two other options I can think of:

  • Defer the event. Again, this would mean whenever the next event came through, it would re-do the add-layer/replan. But it's uncertain when the next Juju event is going to come through (the problem with defer).
  • Make the service wait at least 1 second before exiting, even if case of error. This would mean the service would be "successfully started" from Pebble's point of view, and then kick in Pebble's auto-restart behaviour.

I'd lean towards the last option, because then you get Pebble's auto-restart behaviour for free. You could either do the wait in your service's code if you have control over that, or add a simple wrapper script that waits. For example, I just tested it with this, and the service goes into the auto-restart / backoff loop as we want here:

services:
  srv1:
    override: replace
    command: /bin/sh -c "myservice; sleep 1.1"
    startup: enabled

It's possible we should revisit this design. But the current idea is that either the start (or replan) fails and the error is reported that way, or if the service exits later Pebble's auto-restart behaviour kicks in. The problem with your charm's current approach is that it's catching the exception and masking the error.

@beliaev-maksim
Copy link
Member Author

Hi @benhoyt,

I think the simplest thing to do would be to change the return above to raise

I strongly believe this is against charming best practices. Error in the charm code means we do not know what to do in this scenario and that is why we failed. Also, do not forget that user theoretically may set model setting to not retry failed apps.

I'd lean towards the last option, because then you get Pebble's auto-restart behaviour for free.

yes, I also support this idea more. However, I cannot say you get it really for free. Will not we get an overhead of 1s on each service start then ?

The problem with your charm's current approach is that it's catching the exception and masking the error.

I do not think this is the root problem. Again, see note about charm unhandled exceptions above.
I think that it would be better if pebble will try to start any service independent of the error raised. It should not lead to any system overload due to backoff-factor.

If we want to make the code fully backwards compatible we can add new pebble option that will ignore all the errors and will always try to restart according to the settings.

@ca-scribner
Copy link

I'm still digesting the comments about how to make it work in a charm, so I'll come back to those.

Regarding the design choice on handling <1s errors different from >1s errors: was there a constraint that drove the design that way? The only thing I can think of would be avoiding thrashing a process that'll never succeed be restarting it over and over. Was that the issue, or was there something else?

What caught me off guard here I think is that I don't know of another process manager that uses a similar pattern. For example with kubernetes, a Pod may die instantly (image cannot pull, process starts and dies, etc) and it is handled the same as one that dies after hours. This method could lead to thrashing the restarts, but they avoid that with an exponential backoff.

Doesn't mean restarting after a quick death is the best thing to do, but to me it feels like the intuitive thing to do unless something really gets in the way.

@benhoyt
Copy link
Contributor

benhoyt commented Jun 22, 2023

I strongly believe this is against charming best practices. Error in the charm code means we do not know what to do in this scenario and that is why we failed.

@beliaev-maksim Yeah, that's fair push-back. Though I did say "simplest" thing, not "best". :-) But you're right, and charms shouldn't go into error state when they know what's wrong and they're really just waiting on something.

I guess my point is that you should either handle the start / replan error (by having the charm retry, or deferring the event). You could do something like this:

class MyCharm(ops.CharmBase):
    def _on_config_changed(self, event):
        if not self.ensure_started()  # or perhaps an @ensure_started decorator?
            return
        ...

    def _on_other_hook(self, event):
        if not self.ensure_started()
            return
        ...

    def ensure_started(self) -> bool:
        """Ensure service is started and return True, or return False if it can't be started."""
        if self.container.get_service("mysvc").is_running():
            return True
        try:
            self.container.replan()
            return True
        except pebble.ChangeError:
            self.unit.status = ops.WaitingStatus("waiting for dependency to start")
            return False

The drawback is you don't know when the next event is going to come in (and right now we have no way of telling Juju, "please wake me up in 10 seconds, will ya", though we might have something equivalent when we get workload events / Pebble Notices).

But I think the other approach of the wrapper script with a sleep 1 is probably a cleaner workaround for your use case.

Will not we get an overhead of 1s on each service start then ?

Well, kind of, but only when the service can't start and you need to wait for the dependency anyway. When the service starts properly, it's not an issue because it won't get to the sleep.


So that's my recommendation for now, with the current system. As you point out, this might not be the best design, and we could definitely consider a different approach where the service start always succeeds (or maybe there's a config option, to make it backwards-compatible). I think it's a little tricky, and requires adding config, so it probably needs a small spec. It's not on my roadmap so realistically I probably won't get to it soon, but if one of you wants to start with a small OPnnn spec, we could discuss further from there?

Still, that's not going to be discussed and implemented overnight, so you would likely want to use the sleep 1 workaround for now.

@beliaev-maksim
Copy link
Member Author

@benhoyt yes, for now in 2.9 we go with sleep 1

In 3.1 as workaround I proposed to use juju secrets and set expiration time and try in X minutes on secret expire hook.
WDYT?

Yes, we can start with spec proposal, thanks!

@benhoyt
Copy link
Contributor

benhoyt commented Jun 25, 2023

In 3.1 as workaround I proposed to use juju secrets and set expiration time and try in X minutes on secret expire hook.
WDYT?

To me that sounds much more hacky (and an abuse of Juju Secrets) than a simple sleep 1 in your command.

Yes, we can start with spec proposal, thanks!

Sounds good to me!

phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 28, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 29, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 29, 2023
Sleep for 1.1 seconds before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
DnPlas pushed a commit to canonical/kfp-operators that referenced this issue Aug 4, 2023
Sleep for 1.1 seconds before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
@benhoyt benhoyt added 24.10 Planned for the 24.10 cycle Feature A feature request Needs Design Needs a design or spec labels Mar 13, 2024
@ca-scribner
Copy link

Following up on this since its been a while...

The Kubeflow team would appreciate it if this was delivered in 24.10, ideally implementing the same restart procedure whether something dies in <1s or >1s. We're happy to contribute to a spec or guide however you want to elaborate the problem.

For some justification of why this would be helpful, the Kubeflow team has been burned by this a few times recently. Not that we can't work around it, but that we usually spend hours debugging the wrong thing before we realize this is the root cause. The trouble is that when someone creates a pebble service with checks and restart settings, it is not intuitive that these only apply some of the time (an example of this is @jnsgruk's interpretation here - probably everyone reading those settings expects restarts will occur even on quick failures). I doubt Kubeflow is the only team facing this, just that others might not realize what is happening and don't know this is the cause

@benhoyt benhoyt changed the title Pebble does not try to restart if exited too quickly for the first time Pebble does not try to restart service if exited too quickly for the first time Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
24.10 Planned for the 24.10 cycle Feature A feature request Needs Design Needs a design or spec
Projects
None yet
Development

No branches or pull requests

6 participants