-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix: deadlocked worker pool threads during a controlplane connection failure. #3487
Conversation
…hreads which calls the control plane gets blocked
|
||
// If API is not found (404), then there is no point in setting the control plane status as unhealthy. | ||
if data.ErrorCode != 404 { | ||
health.SetControlPlaneRestAPIStatus(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we set this back to healthy, in somewhere, if this has happened and is resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea here is if the status code 401, 403 , we don't retry rather we make the adapter to be killed as it is unrecoverable. That was the initial thought. But now um wondering if we do such thing in first place. May be a log alert is the correct way to move forward.
…tinue to run in its remaining state. If the adapter restarts in the middle, then there is a possibility that all the gateways could be down
c235d22
to
1a0c256
Compare
[succeeded] Dataplane(NorthEU) cluster : dev-deployment-v2 : 20240129.9 |
[succeeded] Dataplane(EastUS) cluster : dev-deployment-v2 : 20240129.9 |
[succeeded] Controlplane cluster : dev-deployment-v2 : 20240129.9 |
[succeeded] Dataplane(NorthEU) cluster : stage-deployment-v2 : 20240201.1 |
[succeeded] Dataplane(EastUS) cluster : stage-deployment-v2 : 20240201.1 |
[succeeded] Controlplane cluster : stage-deployment-v2 : 20240201.1 |
[succeeded] Controlplane cluster : prod-deployment-v2 : 20240201.3 |
[succeeded] Dataplane(EastUS) cluster : prod-deployment-v2 : 20240201.3 |
[succeeded] Dataplane(NorthEU) cluster : prod-deployment-v2 : 20240201.3 |
Purpose
Bug Fix: When the retry limit reaches its maximum limit, the worker threads which calls the control plane gets blocked.
Issues
Fixes #
Automation tests
Tested environments
Locally tested in the following manner.
Started choreo-product-apim locally
then Started choreo-connect locally pointing to local choreo-product-apim and ASB connection string. (I used PDP mode in choreo-connect here)
Stopped the choreo-product-apim so that we could replicate a network unreachable type failure.
Sent deploy events via ASB
Monitored the behavior with and without the fix. And with the fix it did not blocked.
Maintainers: Check before merge