Skip to content

Fix unexpected route registration/unregistration messages during cf redeployment #582

@IvanHristov98

Description

@IvanHristov98

Fix unexpected route registration/unregistration messages during cf redeployment

Summary

On our live environments when diego-api is updated during a bosh deploy ... some apps experience 503 Service Unavailable for a brief time period.

Chronology of events:

  1. Run bosh deploy ... with the regular sprint update. (all is fine for now)
  2. The diego-api (0) with the active bbs instance shuts down. (all is fine for now)
  3. diego-api (0) starts and diego-api (1) is shut down by bosh. (all is fine for now)
  4. bbs from diego-api (0) becomes the active bbs instance. (and all turns on fire 🔥)
  5. Some apps experience 503 Service Unavailable for 1-7 minutes.
  6. All gets fixed by itself. ✅

Errors in gorouter:

"error":"x509: certificate is valid for <app-guid-1>, not <app-guid-2>"

Hence route-emitter should be involved which could actually be observed from the metrics below (step 4 was performed around 2:30):

Screenshot 2021-08-04 at 15 03 07

Screenshot 2021-08-04 at 15 03 22

Such events in route-emitter occur only when the bbs hub module sends events to route-emitter with ActualLRP changes (code).

On some environments we see the following errors within bbs whenever the issue occurs:

"message": "bbs.request.start-actual-lrp.db-start-actual-lrp.failed-to-transition-actual-lrp-to-started",

On others we see:

{
    "timestamp": "2021-07-29T02:33:22.404279874Z",
    "level": "error",
    "source": "bbs",
    "message": "bbs.got-error-sending-event",
    "data": {
        "error": "slow consumer"
    }
}

{
    "timestamp": "2021-07-29T02:30:55.825459239Z",
    "level": "error",
    "source": "bbs",
    "message": "bbs.request.subscribe-r0.failed-to-get-next-event",
    "data": {
        "error": "read from closed source",
        "session": "1827.1"
    }
}

# Also this one is seen but it is due to encryption key rotation and I guess it is nothing to worry about.
{
    "timestamp": "2021-07-29T02:35:37.111491890Z",
    "level": "error",
    "source": "bbs",
    "message": "bbs.encryptor.failed-to-scan-blob",
    "data": {
        "blob_columns": [
            "run_info",
            "volume_placement",
            "routes"
        ],
        "desired-key-label": "key1_2021T07a_0",
        "error": "sql: no rows in result set",
        "existing-key-label": "key1_2021T06b_0",
        "primary_key": "process_guid",
        "session": "7",
        "table_name": "desired_lrps"
    }
}

Expectation:

Such "bursting" route registration/unregistration behaviour should be observed only during diego-cell updates and shouldn't lead to app downtime. In our case only diego-api is updated separately from the diego-cells which is why I would classify it as a strange behaviour.

Steps to Reproduce

Not clear yet but a good starting point is:

GIVEN a foundation with at least 10 diego-cells each with ~50 containers,
WHEN you redeploy cf with bosh and diego-api gets updated,
THEN route_emitter.RoutesRegistered and route_emitter.RoutesUnregistered metrics rapidly increase their values for a short period of time on almost all diego-cells resulting in x509: certificate is valid for <app-guid-1>, not <app-guid-2> errors in gorouter.

💡 I'll work over finding a better way to reproduce it.
💡 I'm also not sure whether it happens during each failover of bbs.

Diego repo

My suspicions are that the issue is coming from the:

Environment Details

The issue can be seen amongst environments ranging from 2k to 60k app instances.
We're running diego-release v2.50.0 and routing-release v0.216.0.
We've been observing the issue for around 1-2 months already.
Last observed with this cf-deployment version.

Possible Causes or Fixes (optional)

I thought it was due to bbs encryption key rotation because there were good correlations between bbs.EncryptionDuration and the duration of the outage. Anyways I saw it happen without a key rotation so it shouldn't be it.

I have also only seen it happen during regular updates and not when diego-api fails over.

My guts tell me the new active bbs instance performs a rollback of events because it had been outdated. Also I think locket might be involved in all of this since I see a lot of errors related it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions