-
Notifications
You must be signed in to change notification settings - Fork 693
PubSubHealthIndicator reports health as DOWN in case of deadline exceeded error #2628
Comments
A deadline exceeded error could certainly indicate a valid connection issue. 25 seconds is a very long time to wait for a Pub/Sub response! |
Messages are getting consumed and processed successfully when container is in running state, I don't see any warnings/errors in application logs. |
Did these actuator-driven restarts begin at a particular point in time, or have they always been happening? |
I have been monitoring service pods since last 2 days. From the observation, the restarts are frequent when the incoming message volume is medium to high. |
This helped! What I think happens is: the business logic done in message processing takes time, and takes up the available Pub/Sub subscriber threads. The healthcheck pull call then ends up never getting scheduled, leading to the Try setting |
I am experiencing this issue as well under exactly the same conditions. GKE and when the backlog of messages is large, Kubernetes checks against We have not configured anything special in our application, i.e. vanilla sping-cloud-gcp-pubsub. I recommend changing the implementation of the health check to either check for some internal state (nr of messages successfully pulled last X seconds?) and if that signals that nothing has happened, try to pull a message. Having a default implementation that triggers this side effect seems... not good for lack of better words. |
As a quick workaround, I wonder if giving the healthcheck a dedicated PubSubTemplate/executor pool would fix all these issues. The root problem is that under high load it takes too long for a thread to get scheduled to pull. |
Hi all I had the same problem with the health endpoint and the deadline exception after some changes in the business logic. The business logic was much slower than before and there were a lot of messages in the subscriptions. I read this article: https://medium.com/google-cloud/things-i-wish-i-knew-about-google-cloud-pub-sub-part-2-b037f1f08318 and set the max outstandig element count property to 10: It looks like it's working now. And I think it could make sense since all the executor threads where busy handling the 1000 (I think thats the default value) messages in one period/stream and no one got free for doing the actuator request. All the best |
@Sslimon That's a good workaround for slow message processing! |
@elefeint The confusing part for me was that the service had no problems before the update. The changes had nothing to do with pubsub and after the deployment Kubernetes restarted the services und they didn't came up. It was hard for me to find the error and the reason why the health endpoint wasn't responding. But now I'm happy that it is running and I can make the code faster :) |
Thanks for the update! |
Revisiting this because we had another similarly behaving healthcheck contributed, and we'd like to make sure no problems are introduced. @irenavy @Sslimon @patpe Does your application explicitly list the Pub/Sub healthcheck in the |
For one of our services deployed on GKE, we have been noticing frequent container restarts because of liveliness and readiness probes failures.
With management.endpoint.health.show-details=always, we found that the status for pubsub was down with deadline exceeded error.
pubSub":{"status":"DOWN","details":{"error":"com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 24.999356599s. [buffered_nanos=25000339743, waiting_for_connection]"}
As per the implementation of PubSubHealthIndicator, it determines health by checking for a random subscription. Is it correct to consider pubsub as not ready based on deadline exceeded error?
Also, what can be causing this error?
The text was updated successfully, but these errors were encountered: