-
Notifications
You must be signed in to change notification settings - Fork 217
Description
Fix unexpected route registration/unregistration messages during cf redeployment
Summary
On our live environments when diego-api is updated during a bosh deploy ... some apps experience 503 Service Unavailable for a brief time period.
Chronology of events:
- Run
bosh deploy ...with the regular sprint update. (all is fine for now) - The
diego-api (0)with the activebbsinstance shuts down. (all is fine for now) diego-api (0)starts anddiego-api (1)is shut down bybosh. (all is fine for now)bbsfromdiego-api (0)becomes the activebbsinstance. (and all turns on fire 🔥)- Some apps experience
503 Service Unavailablefor 1-7 minutes. - All gets fixed by itself. ✅
Errors in gorouter:
"error":"x509: certificate is valid for <app-guid-1>, not <app-guid-2>"Hence route-emitter should be involved which could actually be observed from the metrics below (step 4 was performed around 2:30):
Such events in route-emitter occur only when the bbs hub module sends events to route-emitter with ActualLRP changes (code).
On some environments we see the following errors within bbs whenever the issue occurs:
"message": "bbs.request.start-actual-lrp.db-start-actual-lrp.failed-to-transition-actual-lrp-to-started",On others we see:
{
"timestamp": "2021-07-29T02:33:22.404279874Z",
"level": "error",
"source": "bbs",
"message": "bbs.got-error-sending-event",
"data": {
"error": "slow consumer"
}
}
{
"timestamp": "2021-07-29T02:30:55.825459239Z",
"level": "error",
"source": "bbs",
"message": "bbs.request.subscribe-r0.failed-to-get-next-event",
"data": {
"error": "read from closed source",
"session": "1827.1"
}
}
# Also this one is seen but it is due to encryption key rotation and I guess it is nothing to worry about.
{
"timestamp": "2021-07-29T02:35:37.111491890Z",
"level": "error",
"source": "bbs",
"message": "bbs.encryptor.failed-to-scan-blob",
"data": {
"blob_columns": [
"run_info",
"volume_placement",
"routes"
],
"desired-key-label": "key1_2021T07a_0",
"error": "sql: no rows in result set",
"existing-key-label": "key1_2021T06b_0",
"primary_key": "process_guid",
"session": "7",
"table_name": "desired_lrps"
}
}Expectation:
Such "bursting" route registration/unregistration behaviour should be observed only during diego-cell updates and shouldn't lead to app downtime. In our case only diego-api is updated separately from the diego-cells which is why I would classify it as a strange behaviour.
Steps to Reproduce
Not clear yet but a good starting point is:
GIVEN a foundation with at least 10 diego-cells each with ~50 containers,
WHEN you redeploy cf with bosh and diego-api gets updated,
THEN route_emitter.RoutesRegistered and route_emitter.RoutesUnregistered metrics rapidly increase their values for a short period of time on almost all diego-cells resulting in x509: certificate is valid for <app-guid-1>, not <app-guid-2> errors in gorouter.
💡 I'll work over finding a better way to reproduce it.
💡 I'm also not sure whether it happens during each failover of bbs.
Diego repo
My suspicions are that the issue is coming from the:
Environment Details
The issue can be seen amongst environments ranging from 2k to 60k app instances.
We're running diego-release v2.50.0 and routing-release v0.216.0.
We've been observing the issue for around 1-2 months already.
Last observed with this cf-deployment version.
Possible Causes or Fixes (optional)
I thought it was due to bbs encryption key rotation because there were good correlations between bbs.EncryptionDuration and the duration of the outage. Anyways I saw it happen without a key rotation so it shouldn't be it.
I have also only seen it happen during regular updates and not when diego-api fails over.
My guts tell me the new active bbs instance performs a rollback of events because it had been outdated. Also I think locket might be involved in all of this since I see a lot of errors related it.

