Fix unexpected route registration/unregistration messages during cf redeployment

## Fix unexpected route registration/unregistration messages during cf redeployment

## Summary

On our live environments when `diego-api` is updated during a `bosh deploy ...` some apps experience `503 Service Unavailable` for a brief time period. 

**Chronology of events:**
1. Run `bosh deploy ...` with the regular sprint update. (all is fine for now)
1. The `diego-api (0)` with the active `bbs` instance shuts down. (all is fine for now)
1. `diego-api (0)` starts and `diego-api (1)` is shut down by `bosh`. (all is fine for now)
1. `bbs` from `diego-api (0)` becomes the active `bbs` instance. (and all turns on fire 🔥)
1.  Some apps experience `503 Service Unavailable` for 1-7 minutes.
1.  All gets fixed by itself. ✅

Errors in `gorouter`:

```json
"error":"x509: certificate is valid for <app-guid-1>, not <app-guid-2>"
```

Hence `route-emitter` should be involved which could actually be observed from the metrics below (step 4 was performed around 2:30):

<img width="250" alt="Screenshot 2021-08-04 at 15 03 07" src="https://user-images.githubusercontent.com/35896427/128177497-de6f721e-bd3b-4080-a32b-cb9bc4810b9f.png">
<img width="250" alt="Screenshot 2021-08-04 at 15 03 22" src="https://user-images.githubusercontent.com/35896427/128177513-08d143da-cf21-425e-95ff-2224f23f9f1b.png">

Such events in `route-emitter` occur only when the `bbs` hub module sends events to `route-emitter` with ActualLRP changes ([code](https://github.com/cloudfoundry/route-emitter/blob/1d351b34f5c7550e16263da2368b72fcc9401a8f/routehandlers/handler.go#L244-L283)).

On some environments we see the following errors within `bbs` whenever the issue occurs:

```json
"message": "bbs.request.start-actual-lrp.db-start-actual-lrp.failed-to-transition-actual-lrp-to-started",
```

On others we see:

```json
{
    "timestamp": "2021-07-29T02:33:22.404279874Z",
    "level": "error",
    "source": "bbs",
    "message": "bbs.got-error-sending-event",
    "data": {
        "error": "slow consumer"
    }
}

{
    "timestamp": "2021-07-29T02:30:55.825459239Z",
    "level": "error",
    "source": "bbs",
    "message": "bbs.request.subscribe-r0.failed-to-get-next-event",
    "data": {
        "error": "read from closed source",
        "session": "1827.1"
    }
}

# Also this one is seen but it is due to encryption key rotation and I guess it is nothing to worry about.
{
    "timestamp": "2021-07-29T02:35:37.111491890Z",
    "level": "error",
    "source": "bbs",
    "message": "bbs.encryptor.failed-to-scan-blob",
    "data": {
        "blob_columns": [
            "run_info",
            "volume_placement",
            "routes"
        ],
        "desired-key-label": "key1_2021T07a_0",
        "error": "sql: no rows in result set",
        "existing-key-label": "key1_2021T06b_0",
        "primary_key": "process_guid",
        "session": "7",
        "table_name": "desired_lrps"
    }
}
```

**Expectation:**

Such "bursting" route registration/unregistration behaviour should be observed only during `diego-cell` updates and shouldn't lead to app downtime. In our case only `diego-api` is updated separately from the `diego-cells` which is why I would classify it as a strange behaviour.

## Steps to Reproduce

Not clear yet but a good starting point is:

**GIVEN** a foundation with at least 10 `diego-cells` each with ~50 containers,
**WHEN** you redeploy `cf` with `bosh` and `diego-api` gets updated,
**THEN** `route_emitter.RoutesRegistered` and `route_emitter.RoutesUnregistered` metrics rapidly increase their values for a short period of time on almost all `diego-cells` resulting in `x509: certificate is valid for <app-guid-1>, not <app-guid-2>` errors in `gorouter`.

💡 I'll work over finding a better way to reproduce it.
💡 I'm also not sure whether it happens during each failover of `bbs`.

## Diego repo

My suspicions are that the issue is coming from the:
- [bbs](https://github.com/cloudfoundry/bbs)
- [route-emitter](https://github.com/cloudfoundry/route-emitter)

## Environment Details 

The issue can be seen amongst environments ranging from 2k to 60k app instances.
We're running `diego-release v2.50.0` and `routing-release v0.216.0`.
We've been observing the issue for around 1-2 months already.
Last observed with this `cf-deployment` [version](https://github.tools.sap/cloudfoundry/cf-deployment/tree/bffe11bd7081cd3d317fe3a7840e0c4b421780a7).

## Possible Causes or Fixes (optional)

I thought it was due to bbs encryption key rotation because there were good correlations between `bbs.EncryptionDuration` and the duration of the outage. Anyways I saw it happen without a key rotation so it shouldn't be it.

I have also only seen it happen during regular updates and not when diego-api fails over.

My guts tell me the new active `bbs` instance performs a rollback of events because it had been outdated. Also I think `locket `might be involved in all of this since I see a lot of errors related it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix unexpected route registration/unregistration messages during cf redeployment #582