PubSubHealthIndicator reports health as DOWN in case of deadline exceeded error #2628

irenavy · 2021-01-22T17:17:28Z

For one of our services deployed on GKE, we have been noticing frequent container restarts because of liveliness and readiness probes failures.
With management.endpoint.health.show-details=always, we found that the status for pubsub was down with deadline exceeded error.

pubSub":{"status":"DOWN","details":{"error":"com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 24.999356599s. [buffered_nanos=25000339743, waiting_for_connection]"}

As per the implementation of PubSubHealthIndicator, it determines health by checking for a random subscription. Is it correct to consider pubsub as not ready based on deadline exceeded error?

Also, what can be causing this error?

elefeint · 2021-01-22T18:18:57Z

A deadline exceeded error could certainly indicate a valid connection issue. 25 seconds is a very long time to wait for a Pub/Sub response!
Do your other Pub/Sub operations or the GCP console show issues?

irenavy · 2021-01-23T07:05:14Z

Messages are getting consumed and processed successfully when container is in running state, I don't see any warnings/errors in application logs.
We have observed that /actuator/health endpoint takes more than a minute to respond
I checked enabling debug level logs for org.springframework.cloud.gcp, but found nothing related to this issue.
Anywhere else can I look into to debug this issue?

elefeint · 2021-01-25T14:29:47Z

Did these actuator-driven restarts begin at a particular point in time, or have they always been happening?

irenavy · 2021-01-26T08:36:18Z

I have been monitoring service pods since last 2 days. From the observation, the restarts are frequent when the incoming message volume is medium to high.

elefeint · 2021-01-26T20:23:54Z

This helped! What I think happens is: the business logic done in message processing takes time, and takes up the available Pub/Sub subscriber threads. The healthcheck pull call then ends up never getting scheduled, leading to the waiting_for_connection deadline exceeded in the original message.

Try setting spring.cloud.gcp.pubsub.subscriber.executor-threads to a higher number (the default the client library ships with is 4).

patpe · 2021-02-19T06:56:39Z

I am experiencing this issue as well under exactly the same conditions. GKE and when the backlog of messages is large, Kubernetes checks against /actuator/health times out and we run into recurring restarts (which causes other issues in message processing since our JPA entity manager is forcefully closed etc.).

We have not configured anything special in our application, i.e. vanilla sping-cloud-gcp-pubsub. I recommend changing the implementation of the health check to either check for some internal state (nr of messages successfully pulled last X seconds?) and if that signals that nothing has happened, try to pull a message. Having a default implementation that triggers this side effect seems... not good for lack of better words.

meltsufin · 2021-02-19T15:15:10Z

@patpe Thanks for the feedback. We're continuing this discussion here. We've discussed ideas similar to what you're proposing but haven't gotten around to working on them. We definitely welcome contributions!

elefeint · 2021-03-04T14:26:14Z

As a quick workaround, I wonder if giving the healthcheck a dedicated PubSubTemplate/executor pool would fix all these issues. The root problem is that under high load it takes too long for a thread to get scheduled to pull.

Sslimon · 2021-08-30T16:43:56Z

Hi all

I had the same problem with the health endpoint and the deadline exception after some changes in the business logic. The business logic was much slower than before and there were a lot of messages in the subscriptions.

I read this article: https://medium.com/google-cloud/things-i-wish-i-knew-about-google-cloud-pub-sub-part-2-b037f1f08318 and set the max outstandig element count property to 10:
spring.cloud.gcp.pubsub.[subscriber,publisher.batching].flow-control.max-outstanding-element-count=10
Description from doc: Maximum number of outstanding elements to keep in memory before enforcing flow control.

It looks like it's working now. And I think it could make sense since all the executor threads where busy handling the 1000 (I think thats the default value) messages in one period/stream and no one got free for doing the actuator request.

All the best
Simon

elefeint · 2021-08-30T16:47:17Z

@Sslimon That's a good workaround for slow message processing!

Sslimon · 2021-08-30T18:02:36Z

@elefeint The confusing part for me was that the service had no problems before the update. The changes had nothing to do with pubsub and after the deployment Kubernetes restarted the services und they didn't came up. It was hard for me to find the error and the reason why the health endpoint wasn't responding. But now I'm happy that it is running and I can make the code faster :)

meltsufin · 2021-09-07T15:45:19Z

Thanks for the update!

elefeint · 2021-11-09T14:55:04Z

Revisiting this because we had another similarly behaving healthcheck contributed, and we'd like to make sure no problems are introduced.

@irenavy @Sslimon @patpe Does your application explicitly list the Pub/Sub healthcheck in the management.endpoint.health.group.liveness.include property? The reason I ask is that by default, no health indicators are included in liveness/readiness probes, and in fact Spring Actuator cautions against liveness probe depending on external system healthchecks for this exact reason -- that Kubernetes will trigger pod restart upon a liveness failure.

elefeint added the pubsub GCP PubSub label Jan 22, 2021

This was referenced Jan 25, 2021

sync pull causes DeadlineExceededException googleapis/java-pubsub#189

Closed

Allow custom subscription name for Pub/Sub healthcheck #2630

Closed

elefeint mentioned this issue Feb 1, 2021

Allow custom subscription name for Pub/Sub healthcheck GoogleCloudPlatform/spring-cloud-gcp#236

Closed

meltsufin added the P3 label Sep 7, 2021

elefeint mentioned this issue Sep 23, 2021

Pub/Sub Subscriber health check GoogleCloudPlatform/spring-cloud-gcp#613

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PubSubHealthIndicator reports health as DOWN in case of deadline exceeded error #2628

PubSubHealthIndicator reports health as DOWN in case of deadline exceeded error #2628

irenavy commented Jan 22, 2021

elefeint commented Jan 22, 2021

irenavy commented Jan 23, 2021

elefeint commented Jan 25, 2021

irenavy commented Jan 26, 2021

elefeint commented Jan 26, 2021

patpe commented Feb 19, 2021 •

edited

Loading

meltsufin commented Feb 19, 2021

elefeint commented Mar 4, 2021

Sslimon commented Aug 30, 2021

elefeint commented Aug 30, 2021

Sslimon commented Aug 30, 2021

meltsufin commented Sep 7, 2021

elefeint commented Nov 9, 2021

PubSubHealthIndicator reports health as DOWN in case of deadline exceeded error #2628

PubSubHealthIndicator reports health as DOWN in case of deadline exceeded error #2628

Comments

irenavy commented Jan 22, 2021

elefeint commented Jan 22, 2021

irenavy commented Jan 23, 2021

elefeint commented Jan 25, 2021

irenavy commented Jan 26, 2021

elefeint commented Jan 26, 2021

patpe commented Feb 19, 2021 • edited Loading

meltsufin commented Feb 19, 2021

elefeint commented Mar 4, 2021

Sslimon commented Aug 30, 2021

elefeint commented Aug 30, 2021

Sslimon commented Aug 30, 2021

meltsufin commented Sep 7, 2021

elefeint commented Nov 9, 2021

patpe commented Feb 19, 2021 •

edited

Loading