Kubernetes loses the leader due to the timeout and doesn't elect new one #6761

myroch · 2024-11-08T13:56:11Z

Bug description

We have from time to time a problem with our master election via kubernetes. We are currently on Camel Quarkus 3.8.3 and Quarkus 3.8.6 LTS. There is no special configuration of the master election, just defaults:

quarkus.camel.cluster.kubernetes.enabled=true

Initially the application works as charm, but later it loses the leadership and there is no pod with it. At this point we do see in Log following messages:

pod1 (v26rk):
2024-11-04 13:07:12,199 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional.empty
2024-11-04 13:07:12,200 INFO  [org.apa.cam.com.qua.QuartzEndpoint] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) Pausing trigger ...
2024-11-04 13:07:12,200 INFO  [org.apa.cam.com.qua.QuartzEndpoint] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) Deleting job ...

There is nothing more relevant to kubernetes in pod1 Log. The camel routes are since this moment down.

pod2 (t5bt5):
2024-11-04 13:12:13,190 WARN  [org.apa.cam.com.kub.clu.loc.KubernetesLeadershipController] (Camel (camel-1) thread #2 - CamelKubernetesLeadershipController) Pod[pod2-985884674-t5bt5] Unable to retrieve the current lease resource my-lease for group my-service from Kubernetes
2024-11-04 13:52:15,345 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional.empty
2024-11-04 13:52:15,355 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional[pod1-985884674-v26rk]
2024-11-04 14:07:12,343 WARN [org.apa.cam.com.kub.clu.loc.KubernetesLeadershipController] (Camel (camel-1) thread #2 - CamelKubernetesLeadershipController) Pod[srv-mdn-patientdelivery-dev-985884674-t5bt5] Unable to retrieve the current lease resource my-lease for group my-service from Kubernetes
2024-11-04 14:07:15,130 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional.empty
...

After new deployment everything works again. I have absolutely no idea where the bug occurs, that's the reason why i'm trying to report it here. Any ideas? Would really appreciate.

Thanks a lot
Miro

The text was updated successfully, but these errors were encountered:

jamesnetherton · 2024-11-11T07:52:55Z

Are you able to give any of the later Camel Quarkus releases a try? Like the 3.15 LTS?

jamesnetherton · 2024-11-11T08:06:55Z

Something else you could try to get some more debugging info, would be to turn up the logging on the Kubernetes component.

This configuration should reveal the exception behind Unable to retrieve the current lease.

quarkus.log.category."org.apache.camel.component.kubernetes.cluster.lock".level=DEBUG

Or to log all debug messages from the Kubernetes component:

quarkus.log.category."org.apache.camel.component.kubernetes".level=DEBUG

myroch · 2024-11-22T06:01:48Z

Hello James, problem exists in LTS 3.15 as well. I've enabled DEBUG logs and I can see following exceptions:

Error while closing watcher: io.fabric8.kubernetes.client.WatcherException: The resourceVersion for the provided watch is too old.
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:401)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:369)

or

Error received during lease resource lock replace: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://100.68.0.1:443/apis/coordination.k8s.io/v1/namespaces/lab-mdc-leaderelection-dev/leases/lab-mdc-leaderelection-dev-mylease. Message: Operation cannot be fulfilled on leases.coordination.k8s.io "lab-mdc-leaderelection-dev-mylease": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases, name=lab-mdc-leaderelection-dev-mylease, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on leases.coordination.k8s.io "lab-mdc-leaderelection-dev-mylease": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)

or

Exception thrown during lease resource lookup: io.fabric8.kubernetes.client.KubernetesClientException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/lab-mdc-leaderelection-dev/leases/lab-mdc-leaderelection-dev-mylease for server null
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:509)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleGet(OperationSupport.java:467)

Do you have any idea what shall I change?
Thx allot for helping!
m.

jamesnetherton · 2024-11-27T08:48:53Z

Do you have any idea what shall I change?

I'm not super expert in this area. But there are some config options that could perhaps help by adjusting them:

quarkus.camel.cluster.kubernetes.connection-timeout-millis
quarkus.camel.cluster.kubernetes.lease-duration-millis
quarkus.camel.cluster.kubernetes.renew-deadline-millis
quarkus.camel.cluster.kubernetes.retry-period-millis

https://camel.apache.org/camel-quarkus/3.15.x/reference/extensions/kubernetes-cluster-service.html#extensions-kubernetes-cluster-service-additional-camel-quarkus-configuration

myroch added the bug Something isn't working label Nov 8, 2024

github-actions bot added the area/kubernetes label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes loses the leader due to the timeout and doesn't elect new one #6761

Kubernetes loses the leader due to the timeout and doesn't elect new one #6761

myroch commented Nov 8, 2024 •

edited

Loading

jamesnetherton commented Nov 11, 2024

jamesnetherton commented Nov 11, 2024

myroch commented Nov 22, 2024

jamesnetherton commented Nov 27, 2024

Kubernetes loses the leader due to the timeout and doesn't elect new one #6761

Kubernetes loses the leader due to the timeout and doesn't elect new one #6761

Comments

myroch commented Nov 8, 2024 • edited Loading

Bug description

jamesnetherton commented Nov 11, 2024

jamesnetherton commented Nov 11, 2024

myroch commented Nov 22, 2024

jamesnetherton commented Nov 27, 2024

myroch commented Nov 8, 2024 •

edited

Loading