Operator crashing after sometime #188

SaikiranDaripelli · 2020-09-16T08:35:19Z

Hi,
We have an operator written using this SDK, and operator pod is restarting every few hours with below exception

2020-09-16 07:51:39,953 i.f.k.c.d.i.WatchConnectionManager [DEBUG] Current reconnect backoff is 1000 milliseconds (T0)
2020-09-16 07:51:40,953 i.f.k.c.d.i.WatchConnectionManager [DEBUG] Connecting websocket ... io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@71e4b308
2020-09-16 07:51:41,003 i.f.k.c.d.i.WatchConnectionManager [DEBUG] WebSocket successfully opened
2020-09-16 07:51:41,018 c.g.c.o.p.EventScheduler       [ERROR] Error:
io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 22472056 (22832853)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:257)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Code i am using is

         KubernetesClient client = new DefaultKubernetesClient();
        Operator operator = new Operator(client);
        operator.registerController(new KafkaTopicController(client));

Am i using it wrong?

The text was updated successfully, but these errors were encountered:

adam-sandor · 2020-09-16T08:59:13Z

Hi Saikiran, I don't think you're doing something wrong. The error is happening on the fabric8 level. Let us get back to you asap with some ideas.

csviri · 2020-09-16T08:59:40Z

Hi @SaikiranDaripelli ,
Thank you for the issue, you are using it right. Unfortunately this is known issue not in our code but the Kubernetes client we are using: fabric8io/kubernetes-client#1800 - its not handled yet, see also
spring-cloud/spring-cloud-kubernetes#557
fabric8io/kubernetes-client#1318

In this case the restart is a simple workaround from our side, see:

https://github.com/ContainerSolutions/java-operator-sdk/blob/39107a309514a75f1c9fed745f7aa1de1bf4301c/operator-framework/src/main/java/com/github/containersolutions/operator/processing/EventScheduler.java#L142-L148

We will try to take a look on this soon.

csviri · 2020-09-16T09:02:17Z

@adam-sandor we could try to reconnect automatically from our side, but that should be done after the current changes in progress.

adam-sandor · 2020-09-16T09:04:43Z

Yeah it would be great if we could do something about this. I guess many users of the KubernetesClient don't have this problem as they don't watch things for a long time, but an operator does that by definition.

SaikiranDaripelli · 2020-09-16T09:33:05Z

Thanks for answering my query, i went through the fabric8 issue and they seem to suggest to do it on client end.
Retries would definitely help, with restarts i am seeing that all controller's createOrUpdate is getting called everytime after a restart even though there is no change to resource, is it expected and will happen even with retries?
I am using status sub-resource and adding version of last successful version inside status to avoid reprocessing, is it suggested way to workaround duplicate events?
Will this improvement address it? https://github.com/ContainerSolutions/java-operator-sdk/issues/38

csviri · 2020-09-16T10:15:12Z

@SaikiranDaripelli in short not, because by default we check if the generation increase, and in this case it won't increase (which can be turned off, in case it will reprocess because we cannot know if it happened during an execution of controller or not). In this case we are maintaining the state (last processed generation) in memory.

The issue: https://github.com/ContainerSolutions/java-operator-sdk/issues/38
is a more tough one, we cannot have that state in memory, since the process gets restarted. It can be stored somewhere else like a configMap or some data store. We don't plan to implement this issue in short term. Although we are up to any suggestions and/or contributions.

SaikiranDaripelli · 2020-09-16T10:30:02Z

Thanks, then retries without restart will solve my current issue.
With occasional reprocessing only on restart, which is fine for my usecase.

Regarding storing state, can't sdk itself do what i am doing right now as a workaround, i.e store last successfully processed generation in status sub-resource upon successful controller execution, and discard event if current generation matches one in status sub-resource.

csviri · 2020-09-16T10:36:52Z

@SaikiranDaripelli this could be done, it would be nicer if we could do this transparently. In the case when you suggesting we should probably provide some interface how to get the latest generation from the resource (name of the field can be different from different users). So this is definitely one of the ways to go.

We will take a look, after the current changes we are working on.

adam-sandor · 2020-09-16T12:11:54Z

How about putting that into an annotation?

PookiPok · 2020-09-23T09:41:57Z

Hi, i am encounter the same issue with the release version not match, @SaikiranDaripelli - can you please share how did you solve this issue on your end? is there any other solution for this?

SaikiranDaripelli · 2020-09-23T09:53:25Z

@PookiPok Right now there is no way to stop operator controller from restarting.

PookiPok · 2020-09-29T08:23:58Z

@SaikiranDaripelli - So is there any workaround for this for now?

csviri · 2020-09-29T08:51:09Z

@PookiPok @SaikiranDaripelli the restarting of controller is the workaround basically (thus it restarts but at least the system does not stop working) :(

We can try to improve on this in the current version, but we are working on a big change now, there it will be easiert to fix.

PookiPok · 2020-09-29T08:54:45Z

Thank you, waiting for this fix on the next release
Gil

ankeetj · 2020-11-10T22:50:14Z

@csviri

I'm facing similar issue with my operator. Because of restart pod is ending up in crash loop status. Is there any update on the fix or any workaround which we can use?

csviri · 2020-11-10T23:23:01Z

@adam-sandor @charlottemach @kirek007 We should consider fix this in the current version (before the event sources are released, since that might take a long time)
@ankeetj not at this moment, will discuss it, and might provide a patch sooner then planned.

charlottemach added the kind/bug Categorizes issue or PR as related to a bug. label Sep 24, 2020

csviri linked a pull request Nov 25, 2020 that will close this issue

Event sources M1 #235

Merged

csviri closed this as completed in #235 Dec 4, 2020

csviri mentioned this issue Oct 29, 2021

Performance and Longevity Tests #635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator crashing after sometime #188

Operator crashing after sometime #188

SaikiranDaripelli commented Sep 16, 2020 •

edited

Loading

adam-sandor commented Sep 16, 2020

csviri commented Sep 16, 2020 •

edited

Loading

csviri commented Sep 16, 2020

adam-sandor commented Sep 16, 2020

SaikiranDaripelli commented Sep 16, 2020

csviri commented Sep 16, 2020 •

edited

Loading

SaikiranDaripelli commented Sep 16, 2020

csviri commented Sep 16, 2020 •

edited

Loading

adam-sandor commented Sep 16, 2020

PookiPok commented Sep 23, 2020

SaikiranDaripelli commented Sep 23, 2020

PookiPok commented Sep 29, 2020

csviri commented Sep 29, 2020 •

edited

Loading

PookiPok commented Sep 29, 2020

ankeetj commented Nov 10, 2020

csviri commented Nov 10, 2020

Operator crashing after sometime #188

Operator crashing after sometime #188

Comments

SaikiranDaripelli commented Sep 16, 2020 • edited Loading

adam-sandor commented Sep 16, 2020

csviri commented Sep 16, 2020 • edited Loading

csviri commented Sep 16, 2020

adam-sandor commented Sep 16, 2020

SaikiranDaripelli commented Sep 16, 2020

csviri commented Sep 16, 2020 • edited Loading

SaikiranDaripelli commented Sep 16, 2020

csviri commented Sep 16, 2020 • edited Loading

adam-sandor commented Sep 16, 2020

PookiPok commented Sep 23, 2020

SaikiranDaripelli commented Sep 23, 2020

PookiPok commented Sep 29, 2020

csviri commented Sep 29, 2020 • edited Loading

PookiPok commented Sep 29, 2020

ankeetj commented Nov 10, 2020

csviri commented Nov 10, 2020

SaikiranDaripelli commented Sep 16, 2020 •

edited

Loading

csviri commented Sep 16, 2020 •

edited

Loading

csviri commented Sep 16, 2020 •

edited

Loading

csviri commented Sep 16, 2020 •

edited

Loading

csviri commented Sep 29, 2020 •

edited

Loading