-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovering from Missed Events #698
Comments
Thanks for asking. It is a good question. First, please clarify if those are indeed events missed or changes not detected — on.event vs. on.create/update/delete handlers. There is a known issue that changes made to an object while it is being processed are not always consistently handled: the earlier creation/update/deletion handlers might miss the most recent changes made within the same handling cycle (of multiple handlers) — #641 & co. If that is about on.event low-level handling, then it misses the events from the watch-streams intentionally if they arrive too fast. Only the latest state is processed. This is done because K8s, in general, is not an event-driven but a state-driven system. It can be configured by
Regarding the re-listing — this, I believe, is done by setting the client or server timeout on the watch-streams: https://kopf.readthedocs.io/en/stable/configuration/#api-timeouts On every such reconnection, the resource kind is fully re-listed. So does it on "410 Gone" errors: Lines 174 to 180 in eaff63c
The default behaviour is to trust the server by not having any timeouts in the client (operator). And also because there is no “the one good value” for all possible environments and use cases. I hope, this helps. If I misunderstood the question and not answered something, please feel free to clarify what’s left unanswered. |
We're using create/update/delete: @kopf.on.create('our.crd.tld', 'v1', 'ourcrds')
@kopf.on.update('our.crd.tld', 'v1', 'ourcrds')
@kopf.on.delete('our.crd.tld', 'v1', 'ourcrds')
def handle_fn(name, namespace, spec, reason, **_):
Re: #641 I don't think we're using subhandlers, regardless, the operator never shows the object in question being handled at all, which I think is different than missing one of the event types because they are overlapping.
This is a very odd statement for an operator framework that seems to rely on events to create handlers. Let's say I've got a I'll have a go at the timeouts; that may be a good enough solution but it would be nice if there was a way to more directly trigger that behavior. |
Thanks for this information. I've updated the #641 title to also refer to regular handlers, not only sub-handlers. The issue affects them all. Just to clarify: Kopf does not skip events "at all" or randomly. It skips all the individual low-level events in a quick short batch (0.1s-long), except the last one — and that last one should be processed in full. That last event (the final state) should never be skipped/lost, of course. This concept of the state-driven systems is described here: https://hackernoon.com/level-triggering-and-reconciliation-in-kubernetes-1f17fe30333d
This, however, is very strange. It should never happen. Every individual resource should be handled at least once, at least attempted, and get that "kopf.zalando.org/last-handled-configuration" or other "kopf.zalando.org/…" annotations. Can you please hint: which version of Kopf is used? What is the scale (number of CRs: ones-tens-hundreds-thousands)? What's the frequency of changes of each? Which K8s version (1.xx.y) and which K8s environment is used (GCP/EKS/self-managed/K3s/etc)? — I'll try to reproduce it locally. Can you also please add PS: I just have thought: if a resource receives various changes continuously within 0.1s range and this stream never ends/stops/pauses for longer than 0.1s, such resource will never be handled — it will always be waiting until the "batch" is finally over (where "over" means "no changes for >=0.1s"). That would be a bug; easy to fix. But I am not sure how realistic such a scenario is. Regarding the request: I think, it is possible to add a controllable trigger for full relisting. But that adds complexity to an already complex framework. I would like to clearly understand what is the root cause before doing this, and why is the current implementation not as stable as intended. |
A little follow-up: I've just remembered: over the past couple of years there were one or two reports of cases when the K8s API watch-calls stopped reporting the changes and went idle forever, despite the TCP connections remained alive, and the cluster was doing changes remotely. It was (and is) some issue in K8s itself, where it stops sending updates to the open watch-connections — for reasons unknown. One is #232; I cannot find another one now (local connectivity issues). The solution was to use the client-side and/or server-side timeouts — as mentioned above — and reconnect from time to time. The frequency of reconnections is operator- & cluster-specific though: more often is good for faster recoveries from this stuck state, but bad for cpu and logs (it does a full re-listing each time). But in that case, the updates stop for the whole resource kind, not for individual resource objects. |
Oh that sounds exactly like what we see. I hadn't mentioned it because I wanted to keep things focused, but we have also had times when the operator was still up and running, but completely stopped processing objects. We're going to add the timeout in and, based on that tickets report back, I suspect we'll see the same level of success. (Our backpocket hack was going to be to add a "watcher" process that would look for unprocessed objects, ie missing the status field(s), and if they got "too old" to just kill the operator. A little bit more efficient than the timeout since it would only happen when this bug manifests instead of every N interval.) |
UPD: Here is the other issue, the main one, with details and some health checking scripts (finally found it!): #204 |
Good to have that liveness check available, will look into that. For completeness, we're on GKE @ K8S 1.15. Our
(We also have a webhook validation running from this codebase, hence flask, etc) And running under a As for code snippet; it's very basic. We've only got the one |
We added: @kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
settings.watching.server_timeout = 120
settings.watching.client_timeout = 150 And have not missed a single event in the last 48 hours. I would hazard that the default of "never time out because we trust the server to maintain this connection forever" is probably too generous. |
And I would agree, since this issue appears quite regularly. I will add some default there (not sure which one though). |
Question
We've created an operator with kopf and it's been great -- subscribe to our CRD creation/update/delete events and processing works most of the time, but only most.
We routinely find ourselves having instances of our CRDS which have not been processed by the operator. If we go and make a change to the object (like adding an annotation), a new event is generated and kopf will then "see" the object and process it.
What it boils down to is that kopf is missing events -- but I don't think that's a kopf issue, in and of itself. Reading around suggests that events on watch streams are not inherently reliable. Several other operators I deal with handle this in with a two approach system:
watch
for events and process them as neededlist
operation and reconcile items which have been missedEven the official docs on
watch
talk about needing to do option 2:kopf
does have something like this too; sorta. When the operator starts up, it does an initial load of all the objects and can setup all the handlers, etc.Would it make sense to make that initial 'load' process something that could be invoked regularly?
Checklist
Keywords
missed events, missing events
The text was updated successfully, but these errors were encountered: