-
Notifications
You must be signed in to change notification settings - Fork 217
Description
Summary
In our largest foundation, our active BBS instance will intermittently go to 100% cpu and stop accepting requests. The symptoms from the user side are stager errors like "Runner is unavailable:" when pushing, restarting, or staging apps. In the bbs log, we start seeing 2021/06/23 15:32:42 http: TLS handshake error from 10.10.17.188:43720: EOF
errors fill the logs.
Restarting bbs (which moves all traffic to the other diego-api instance) fixes the problem. I suspect this is a resource exhaustion issue as if we restart bbs about twice a week, this error doesn't come up.
We're running diego-release v2.49.0 (with a planned upgrade at the end of the month).
Steps to Reproduce
We don't have a way to reproduce this, however since this may be a resource exhaustion issue, it's worth mentioning that there are around 39000 events per minute in the bbs.stdout.log
on a normal day. We have about 10,000 apps in this foundation.
Diego repo
https://github.com/cloudfoundry/bbs
Environment Details
Versions in use:
cf-deployment v16.14.0
diego-release 2.49.0
on a bionic stemcell v1.1
Possible Causes or Fixes (optional)
I suspect a resource exhaustion issue, but I have no further insight into what might be happening.
Additional Text Output, Screenshots, contextual information (optional)
I realize this is a vague report. I am happy to collect more info if someone can guide me on what data is needed and how to get it.