Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes loses the leader due to the timeout and doesn't elect new one #6761

Open
myroch opened this issue Nov 8, 2024 · 4 comments
Open
Labels
area/kubernetes bug Something isn't working

Comments

@myroch
Copy link

myroch commented Nov 8, 2024

Bug description

We have from time to time a problem with our master election via kubernetes. We are currently on Camel Quarkus 3.8.3 and Quarkus 3.8.6 LTS. There is no special configuration of the master election, just defaults:

quarkus.camel.cluster.kubernetes.enabled=true

Initially the application works as charm, but later it loses the leadership and there is no pod with it. At this point we do see in Log following messages:

pod1 (v26rk):
2024-11-04 13:07:12,199 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional.empty
2024-11-04 13:07:12,200 INFO  [org.apa.cam.com.qua.QuartzEndpoint] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) Pausing trigger ...
2024-11-04 13:07:12,200 INFO  [org.apa.cam.com.qua.QuartzEndpoint] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) Deleting job ...

There is nothing more relevant to kubernetes in pod1 Log. The camel routes are since this moment down.

pod2 (t5bt5):
2024-11-04 13:12:13,190 WARN  [org.apa.cam.com.kub.clu.loc.KubernetesLeadershipController] (Camel (camel-1) thread #2 - CamelKubernetesLeadershipController) Pod[pod2-985884674-t5bt5] Unable to retrieve the current lease resource my-lease for group my-service from Kubernetes
2024-11-04 13:52:15,345 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional.empty
2024-11-04 13:52:15,355 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional[pod1-985884674-v26rk]
2024-11-04 14:07:12,343 WARN [org.apa.cam.com.kub.clu.loc.KubernetesLeadershipController] (Camel (camel-1) thread #2 - CamelKubernetesLeadershipController) Pod[srv-mdn-patientdelivery-dev-985884674-t5bt5] Unable to retrieve the current lease resource my-lease for group my-service from Kubernetes
2024-11-04 14:07:15,130 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional.empty
...

After new deployment everything works again. I have absolutely no idea where the bug occurs, that's the reason why i'm trying to report it here. Any ideas? Would really appreciate.

Thanks a lot
Miro

@myroch myroch added the bug Something isn't working label Nov 8, 2024
@jamesnetherton
Copy link
Contributor

Are you able to give any of the later Camel Quarkus releases a try? Like the 3.15 LTS?

@jamesnetherton
Copy link
Contributor

Something else you could try to get some more debugging info, would be to turn up the logging on the Kubernetes component.

This configuration should reveal the exception behind Unable to retrieve the current lease.

quarkus.log.category."org.apache.camel.component.kubernetes.cluster.lock".level=DEBUG

Or to log all debug messages from the Kubernetes component:

quarkus.log.category."org.apache.camel.component.kubernetes".level=DEBUG

@myroch
Copy link
Author

myroch commented Nov 22, 2024

Hello James, problem exists in LTS 3.15 as well. I've enabled DEBUG logs and I can see following exceptions:

Error while closing watcher: io.fabric8.kubernetes.client.WatcherException: The resourceVersion for the provided watch is too old.
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:401)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:369)

or

Error received during lease resource lock replace: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://100.68.0.1:443/apis/coordination.k8s.io/v1/namespaces/lab-mdc-leaderelection-dev/leases/lab-mdc-leaderelection-dev-mylease. Message: Operation cannot be fulfilled on leases.coordination.k8s.io "lab-mdc-leaderelection-dev-mylease": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases, name=lab-mdc-leaderelection-dev-mylease, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on leases.coordination.k8s.io "lab-mdc-leaderelection-dev-mylease": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)

or

Exception thrown during lease resource lookup: io.fabric8.kubernetes.client.KubernetesClientException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/lab-mdc-leaderelection-dev/leases/lab-mdc-leaderelection-dev-mylease for server null
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:509)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleGet(OperationSupport.java:467)

Do you have any idea what shall I change?
Thx allot for helping!
m.

@jamesnetherton
Copy link
Contributor

Do you have any idea what shall I change?

I'm not super expert in this area. But there are some config options that could perhaps help by adjusting them:

quarkus.camel.cluster.kubernetes.connection-timeout-millis
quarkus.camel.cluster.kubernetes.lease-duration-millis
quarkus.camel.cluster.kubernetes.renew-deadline-millis
quarkus.camel.cluster.kubernetes.retry-period-millis

https://camel.apache.org/camel-quarkus/3.15.x/reference/extensions/kubernetes-cluster-service.html#extensions-kubernetes-cluster-service-additional-camel-quarkus-configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants