Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: deadlocked worker pool threads during a controlplane connection failure. #3487

Merged
merged 2 commits into from
Jan 29, 2024

Conversation

VirajSalaka
Copy link
Contributor

@VirajSalaka VirajSalaka commented Jan 26, 2024

Purpose

Bug Fix: When the retry limit reaches its maximum limit, the worker threads which calls the control plane gets blocked.

Issues

Fixes #

Automation tests

  • Unit tests added: No
  • Integration tests added: No

Tested environments

Locally tested in the following manner.
Started choreo-product-apim locally
then Started choreo-connect locally pointing to local choreo-product-apim and ASB connection string. (I used PDP mode in choreo-connect here)
Stopped the choreo-product-apim so that we could replicate a network unreachable type failure.
Sent deploy events via ASB

Monitored the behavior with and without the fix. And with the fix it did not blocked.


Maintainers: Check before merge

  • Assigned 'Type' label
  • Assigned the project
  • Validated respective github issues
  • Assigned milestone to the github issue(s)

…hreads which calls the control plane gets blocked

// If API is not found (404), then there is no point in setting the control plane status as unhealthy.
if data.ErrorCode != 404 {
health.SetControlPlaneRestAPIStatus(false)
Copy link
Contributor

@renuka-fernando renuka-fernando Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set this back to healthy, in somewhere, if this has happened and is resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea here is if the status code 401, 403 , we don't retry rather we make the adapter to be killed as it is unrecoverable. That was the initial thought. But now um wondering if we do such thing in first place. May be a log alert is the correct way to move forward.

…tinue to run in its remaining state. If the adapter restarts in the middle, then there is a possibility that all the gateways could be down
@VirajSalaka VirajSalaka merged commit 895732a into wso2:choreo Jan 29, 2024
2 of 3 checks passed
@choreo-cicd
Copy link

[succeeded] Dataplane(NorthEU) cluster : dev-deployment-v2 : 20240129.9

@choreo-cicd
Copy link

[succeeded] Dataplane(EastUS) cluster : dev-deployment-v2 : 20240129.9

@choreo-cicd
Copy link

[succeeded] Controlplane cluster : dev-deployment-v2 : 20240129.9

@choreo-cicd
Copy link

[succeeded] Dataplane(NorthEU) cluster : stage-deployment-v2 : 20240201.1

@choreo-cicd
Copy link

[succeeded] Dataplane(EastUS) cluster : stage-deployment-v2 : 20240201.1

@choreo-cicd
Copy link

[succeeded] Controlplane cluster : stage-deployment-v2 : 20240201.1

@choreo-cicd
Copy link

[succeeded] Controlplane cluster : prod-deployment-v2 : 20240201.3

@choreo-cicd
Copy link

[succeeded] Dataplane(EastUS) cluster : prod-deployment-v2 : 20240201.3

@choreo-cicd
Copy link

[succeeded] Dataplane(NorthEU) cluster : prod-deployment-v2 : 20240201.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants