Make BBS more resilient to API port being unavailable

## Summary

When another process has claimed BBS's listen_addr, a deployment can result in no BBS instances being available. The deployment does not fast fail like we would expect it to, because BBS does not attempt to listen on the port until it becomes the active node.

## Steps to Reproduce

Deploy otel-collector job using [this operations file](https://github.com/cloudfoundry/cf-deployment/blob/main/operations/experimental/add-otel-collector.yml) and this metric exporter config:
```yaml
prometheus:
  endpoint: 127.0.0.1:8889
  namespace: default
```
See the deploy fail after all of the bbs instances have rolled and bbs becomes completely unavailable.


## Diego repo

bbs

## Environment Details 

diego-release 2.81.0 and loggregator-agent-release 7.6.0

## Possible Causes or Fixes (optional)

Causes: 
The BBS node only listens on the API port when it has claimed the lock to become the active BBS node. This means that the job can roll without listening on the listen_addr and won't know another process is using it until it tries to become the active node.

Possible Fixes: 
- If the BBS failed fast when another process had taken the port then BOSH would not roll the other BBS nodes and so the BBS would remain available.
- The BBS could try to listen on the API port on startup. We think a complication is that BBS clients use the connection refused as a way to determine if they are talking to the active BBS node. If all BBS instances started listening on the API port on startup we'd need to ensure that clients would still end up talking to the active node.
- The BBS could try to dial the port for a period and error if the connection succeeds.
- Alternatively the BBS could ask the operating system whether another process is listening on 8889 (without actually trying to listen on/dial the port). This code would probably be linux-specific.

@acrmp 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make BBS more resilient to API port being unavailable #812

Summary

Steps to Reproduce

Diego repo

Environment Details

Possible Causes or Fixes (optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make BBS more resilient to API port being unavailable #812

Description

Summary

Steps to Reproduce

Diego repo

Environment Details

Possible Causes or Fixes (optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions