Skip to content
This repository has been archived by the owner on Jan 19, 2022. It is now read-only.

PubSubHealthIndicator reports health as DOWN in case of deadline exceeded error #2628

Open
irenavy opened this issue Jan 22, 2021 · 13 comments
Labels
P3 pubsub GCP PubSub

Comments

@irenavy
Copy link

irenavy commented Jan 22, 2021

For one of our services deployed on GKE, we have been noticing frequent container restarts because of liveliness and readiness probes failures.
With management.endpoint.health.show-details=always, we found that the status for pubsub was down with deadline exceeded error.

pubSub":{"status":"DOWN","details":{"error":"com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 24.999356599s. [buffered_nanos=25000339743, waiting_for_connection]"}

As per the implementation of PubSubHealthIndicator, it determines health by checking for a random subscription. Is it correct to consider pubsub as not ready based on deadline exceeded error?

Also, what can be causing this error?

@elefeint elefeint added the pubsub GCP PubSub label Jan 22, 2021
@elefeint
Copy link
Contributor

A deadline exceeded error could certainly indicate a valid connection issue. 25 seconds is a very long time to wait for a Pub/Sub response!
Do your other Pub/Sub operations or the GCP console show issues?

@irenavy
Copy link
Author

irenavy commented Jan 23, 2021

Messages are getting consumed and processed successfully when container is in running state, I don't see any warnings/errors in application logs.
We have observed that /actuator/health endpoint takes more than a minute to respond
I checked enabling debug level logs for org.springframework.cloud.gcp, but found nothing related to this issue.
Anywhere else can I look into to debug this issue?

@elefeint
Copy link
Contributor

Did these actuator-driven restarts begin at a particular point in time, or have they always been happening?

@irenavy
Copy link
Author

irenavy commented Jan 26, 2021

I have been monitoring service pods since last 2 days. From the observation, the restarts are frequent when the incoming message volume is medium to high.

@elefeint
Copy link
Contributor

This helped! What I think happens is: the business logic done in message processing takes time, and takes up the available Pub/Sub subscriber threads. The healthcheck pull call then ends up never getting scheduled, leading to the waiting_for_connection deadline exceeded in the original message.

Try setting spring.cloud.gcp.pubsub.subscriber.executor-threads to a higher number (the default the client library ships with is 4).

@patpe
Copy link

patpe commented Feb 19, 2021

I am experiencing this issue as well under exactly the same conditions. GKE and when the backlog of messages is large, Kubernetes checks against /actuator/health times out and we run into recurring restarts (which causes other issues in message processing since our JPA entity manager is forcefully closed etc.).

We have not configured anything special in our application, i.e. vanilla sping-cloud-gcp-pubsub. I recommend changing the implementation of the health check to either check for some internal state (nr of messages successfully pulled last X seconds?) and if that signals that nothing has happened, try to pull a message. Having a default implementation that triggers this side effect seems... not good for lack of better words.

@meltsufin
Copy link
Contributor

@patpe Thanks for the feedback. We're continuing this discussion here. We've discussed ideas similar to what you're proposing but haven't gotten around to working on them. We definitely welcome contributions!

@elefeint
Copy link
Contributor

elefeint commented Mar 4, 2021

As a quick workaround, I wonder if giving the healthcheck a dedicated PubSubTemplate/executor pool would fix all these issues. The root problem is that under high load it takes too long for a thread to get scheduled to pull.

@Sslimon
Copy link

Sslimon commented Aug 30, 2021

Hi all

I had the same problem with the health endpoint and the deadline exception after some changes in the business logic. The business logic was much slower than before and there were a lot of messages in the subscriptions.

I read this article: https://medium.com/google-cloud/things-i-wish-i-knew-about-google-cloud-pub-sub-part-2-b037f1f08318 and set the max outstandig element count property to 10:
spring.cloud.gcp.pubsub.[subscriber,publisher.batching].flow-control.max-outstanding-element-count=10
Description from doc: Maximum number of outstanding elements to keep in memory before enforcing flow control.

It looks like it's working now. And I think it could make sense since all the executor threads where busy handling the 1000 (I think thats the default value) messages in one period/stream and no one got free for doing the actuator request.

All the best
Simon

@elefeint
Copy link
Contributor

@Sslimon That's a good workaround for slow message processing!

@Sslimon
Copy link

Sslimon commented Aug 30, 2021

@elefeint The confusing part for me was that the service had no problems before the update. The changes had nothing to do with pubsub and after the deployment Kubernetes restarted the services und they didn't came up. It was hard for me to find the error and the reason why the health endpoint wasn't responding. But now I'm happy that it is running and I can make the code faster :)

@meltsufin
Copy link
Contributor

Thanks for the update!

@elefeint
Copy link
Contributor

elefeint commented Nov 9, 2021

Revisiting this because we had another similarly behaving healthcheck contributed, and we'd like to make sure no problems are introduced.

@irenavy @Sslimon @patpe Does your application explicitly list the Pub/Sub healthcheck in the management.endpoint.health.group.liveness.include property? The reason I ask is that by default, no health indicators are included in liveness/readiness probes, and in fact Spring Actuator cautions against liveness probe depending on external system healthchecks for this exact reason -- that Kubernetes will trigger pod restart upon a liveness failure.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P3 pubsub GCP PubSub
Development

No branches or pull requests

5 participants