Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering from Missed Events #698

Open
2 tasks done
jeff-minard-ck opened this issue Feb 24, 2021 · 9 comments
Open
2 tasks done

Recovering from Missed Events #698

jeff-minard-ck opened this issue Feb 24, 2021 · 9 comments
Labels
question Further information is requested

Comments

@jeff-minard-ck
Copy link

jeff-minard-ck commented Feb 24, 2021

Question

We've created an operator with kopf and it's been great -- subscribe to our CRD creation/update/delete events and processing works most of the time, but only most.

We routinely find ourselves having instances of our CRDS which have not been processed by the operator. If we go and make a change to the object (like adding an annotation), a new event is generated and kopf will then "see" the object and process it.

What it boils down to is that kopf is missing events -- but I don't think that's a kopf issue, in and of itself. Reading around suggests that events on watch streams are not inherently reliable. Several other operators I deal with handle this in with a two approach system:

  1. watch for events and process them as needed
  2. loop a list operation and reconcile items which have been missed

Even the official docs on watch talk about needing to do option 2:

Kubernetes [..] will only preserve a historical list of changes for a limited time. Clusters using etcd3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation.

kopf does have something like this too; sorta. When the operator starts up, it does an initial load of all the objects and can setup all the handlers, etc.

Would it make sense to make that initial 'load' process something that could be invoked regularly?

Checklist

Keywords

missed events, missing events

@jeff-minard-ck jeff-minard-ck added the question Further information is requested label Feb 24, 2021
@nolar
Copy link
Owner

nolar commented Feb 24, 2021

Thanks for asking. It is a good question.

First, please clarify if those are indeed events missed or changes not detected — on.event vs. on.create/update/delete handlers.

There is a known issue that changes made to an object while it is being processed are not always consistently handled: the earlier creation/update/deletion handlers might miss the most recent changes made within the same handling cycle (of multiple handlers) — #641 & co.


If that is about on.event low-level handling, then it misses the events from the watch-streams intentionally if they arrive too fast. Only the latest state is processed. This is done because K8s, in general, is not an event-driven but a state-driven system.

It can be configured by settings.batching.batch_window (default is 0.1 second). And, as I have just noticed, it is not fully documented (but should be). The source code can help for now:


Regarding the re-listing — this, I believe, is done by setting the client or server timeout on the watch-streams:

https://kopf.readthedocs.io/en/stable/configuration/#api-timeouts

On every such reconnection, the resource kind is fully re-listed. So does it on "410 Gone" errors:

# "410 Gone" is for the "resource version too old" error, we must restart watching.
# The resource versions are lost by k8s after few minutes (5, as per the official doc).
# The error occurs when there is nothing happening for few minutes. This is normal.
if raw_type == 'ERROR' and cast(bodies.RawError, raw_object)['code'] == 410:
where = f'in {namespace!r}' if namespace is not None else 'cluster-wide'
logger.debug(f"Restarting the watch-stream for {resource} {where}.")
return # out of the regular stream, to the infinite stream.

The default behaviour is to trust the server by not having any timeouts in the client (operator). And also because there is no “the one good value” for all possible environments and use cases.


I hope, this helps. If I misunderstood the question and not answered something, please feel free to clarify what’s left unanswered.

@jeff-minard-ck
Copy link
Author

We're using create/update/delete:

@kopf.on.create('our.crd.tld', 'v1', 'ourcrds')
@kopf.on.update('our.crd.tld', 'v1', 'ourcrds')
@kopf.on.delete('our.crd.tld', 'v1', 'ourcrds')
def handle_fn(name, namespace, spec, reason, **_):

changes made to an object while it is being processed are not always consistently handled

Re: #641 I don't think we're using subhandlers, regardless, the operator never shows the object in question being handled at all, which I think is different than missing one of the event types because they are overlapping.


Only the latest state is processed. This is done because K8s, in general, is not an event-driven but a state-driven system.

This is a very odd statement for an operator framework that seems to rely on events to create handlers.

Let's say I've got a MySQL CRD which, when you create an instance, the operator should go spin up a cloud mysql or something. If the operator skips out on events, then I'm almost guaranteed to have instances of the CRD which have no mysql created. I feel like I've missed something here to understand why events would be dropped.


I'll have a go at the timeouts; that may be a good enough solution but it would be nice if there was a way to more directly trigger that behavior.

@nolar
Copy link
Owner

nolar commented Feb 24, 2021

Thanks for this information. I've updated the #641 title to also refer to regular handlers, not only sub-handlers. The issue affects them all.

Just to clarify: Kopf does not skip events "at all" or randomly. It skips all the individual low-level events in a quick short batch (0.1s-long), except the last one — and that last one should be processed in full. That last event (the final state) should never be skipped/lost, of course. This concept of the state-driven systems is described here: https://hackernoon.com/level-triggering-and-reconciliation-in-kubernetes-1f17fe30333d

the operator never shows the object in question being handled at all

This, however, is very strange. It should never happen. Every individual resource should be handled at least once, at least attempted, and get that "kopf.zalando.org/last-handled-configuration" or other "kopf.zalando.org/…" annotations.

Can you please hint: which version of Kopf is used? What is the scale (number of CRs: ones-tens-hundreds-thousands)? What's the frequency of changes of each? Which K8s version (1.xx.y) and which K8s environment is used (GCP/EKS/self-managed/K3s/etc)? — I'll try to reproduce it locally.

Can you also please add @kopf.on.event() handler, and see if it is called in the logs? Some logs with --verbose would also be helpful. So as the operator's skeleton (decorators + function defs, without the functions' content); and the CRD. — If that does not expose your commercial secrets, of course. Or a standalone anonymised reproducible snippet.

PS: I just have thought: if a resource receives various changes continuously within 0.1s range and this stream never ends/stops/pauses for longer than 0.1s, such resource will never be handled — it will always be waiting until the "batch" is finally over (where "over" means "no changes for >=0.1s"). That would be a bug; easy to fix. But I am not sure how realistic such a scenario is.


Regarding the request: I think, it is possible to add a controllable trigger for full relisting. But that adds complexity to an already complex framework. I would like to clearly understand what is the root cause before doing this, and why is the current implementation not as stable as intended.

@nolar
Copy link
Owner

nolar commented Feb 24, 2021

A little follow-up: I've just remembered: over the past couple of years there were one or two reports of cases when the K8s API watch-calls stopped reporting the changes and went idle forever, despite the TCP connections remained alive, and the cluster was doing changes remotely. It was (and is) some issue in K8s itself, where it stops sending updates to the open watch-connections — for reasons unknown.

One is #232; I cannot find another one now (local connectivity issues).

The solution was to use the client-side and/or server-side timeouts — as mentioned above — and reconnect from time to time. The frequency of reconnections is operator- & cluster-specific though: more often is good for faster recoveries from this stuck state, but bad for cpu and logs (it does a full re-listing each time).

But in that case, the updates stop for the whole resource kind, not for individual resource objects.

@jeff-minard-ck
Copy link
Author

Oh that sounds exactly like what we see. I hadn't mentioned it because I wanted to keep things focused, but we have also had times when the operator was still up and running, but completely stopped processing objects.

We're going to add the timeout in and, based on that tickets report back, I suspect we'll see the same level of success.

(Our backpocket hack was going to be to add a "watcher" process that would look for unprocessed objects, ie missing the status field(s), and if they got "too old" to just kill the operator. A little bit more efficient than the timeout since it would only happen when this bug manifests instead of every N interval.)

@nolar
Copy link
Owner

nolar commented Feb 24, 2021

UPD: Here is the other issue, the main one, with details and some health checking scripts (finally found it!): #204

@jeff-minard-ck
Copy link
Author

Good to have that liveness check available, will look into that.

For completeness, we're on GKE @ K8S 1.15. Our requirements.txt:

kopf==0.28
pytest==6.1.1
pytest-mock==3.3.1
flask==1.1.2
flake8==3.8.4
gunicorn==20.0.4

(We also have a webhook validation running from this codebase, hence flask, etc)

And running under a FROM python:3.9-slim container.

As for code snippet; it's very basic. We've only got the one handle_fn subscribe to the create/update/delete events for our CRD.

@jeff-minard-ck
Copy link
Author

We added:

@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
    settings.watching.server_timeout = 120
    settings.watching.client_timeout = 150

And have not missed a single event in the last 48 hours. I would hazard that the default of "never time out because we trust the server to maintain this connection forever" is probably too generous.

@nolar
Copy link
Owner

nolar commented Feb 27, 2021

And I would agree, since this issue appears quite regularly. I will add some default there (not sure which one though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants