Skip to content

bbs goes to 100% CPU after a period of time and won't accept requests #597

@pusherofbrooms

Description

@pusherofbrooms

Summary

In our largest foundation, our active BBS instance will intermittently go to 100% cpu and stop accepting requests. The symptoms from the user side are stager errors like "Runner is unavailable:" when pushing, restarting, or staging apps. In the bbs log, we start seeing 2021/06/23 15:32:42 http: TLS handshake error from 10.10.17.188:43720: EOF errors fill the logs.

Restarting bbs (which moves all traffic to the other diego-api instance) fixes the problem. I suspect this is a resource exhaustion issue as if we restart bbs about twice a week, this error doesn't come up.

We're running diego-release v2.49.0 (with a planned upgrade at the end of the month).

Steps to Reproduce

We don't have a way to reproduce this, however since this may be a resource exhaustion issue, it's worth mentioning that there are around 39000 events per minute in the bbs.stdout.log on a normal day. We have about 10,000 apps in this foundation.

Diego repo

https://github.com/cloudfoundry/bbs

Environment Details

Versions in use:
cf-deployment v16.14.0
diego-release 2.49.0
on a bionic stemcell v1.1

Possible Causes or Fixes (optional)

I suspect a resource exhaustion issue, but I have no further insight into what might be happening.

Additional Text Output, Screenshots, contextual information (optional)

I realize this is a vague report. I am happy to collect more info if someone can guide me on what data is needed and how to get it.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions