Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle existing workspace directories better #552

Merged
merged 4 commits into from
Mar 12, 2024
Merged

Conversation

JDTX0
Copy link
Contributor

@JDTX0 JDTX0 commented Feb 16, 2024

Proposed changes

I've changed the stack reconciliation code to clean up existing workspace directories when it finds them instead of treating them as a lock and failing forever. We've seen numerous times where the operator leaves a workspace directory behind for some unknown reason and it causes the stack to fail to reconcile forever. The only way to resolve the issue is to remove the directory manually or restart the entire operator pod.

The operator shouldn't treat directories as locks in my opinion. Given that they're processed by 1 thread at a time and Pulumi has its own state lock files (for SaaS & self-hosted backends), an existing directory shouldn't block the reconciliation and cause a failure that never resolves itself.

Also, I fixed a few typos. Let me know if there's anything I've missed for this or if there's any concerns with this approach.

Copy link

PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting.
Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

Copy link

PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting.
Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

Copy link

PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting.
Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

@rquitales
Copy link
Member

rquitales commented Mar 11, 2024

@JDTX0, thank you for your contribution, and I apologize for the oversight in reviewing this PR earlier. Upon further examination of the codebase, I agree with your assessment that it is guaranteed that only 1 thread/process can process a stack at a time when using a single replica for our operator deployment. Furthermore, even if a user were to adjust the replica count of their operator deployment manifests, they would likely encounter other issues due to the lack of native support for horizontal scaling. As such, it seems reasonable to remove the rudimentary lock check based on the assumption of the existence of the work directory.

However, I'm concerned with why these directories are not being cleaned up initially. If you have any insights into why this occurs in your deployments, that would be a helpful starting point for my investigations.

@rquitales
Copy link
Member

/run-acceptance-tests

@rquitales rquitales self-assigned this Mar 11, 2024
@rquitales rquitales self-requested a review March 11, 2024 23:48
@JDTX0
Copy link
Contributor Author

JDTX0 commented Mar 12, 2024

@rquitales Thanks for the reply and review! We found that the operator was losing the leader election as the lease renewal timed out. The lease renewal turned out to be a problem with etcd having performance degradation.

The election failure caused the container to restart and leave behind directories which leads to this issue. So, not directly a problem in the operator code.

Regardless, I'd still very much like this change merged as it makes the operator more resilient to a myriad of possible problems, lease-related or otherwise.

@rquitales rquitales merged commit 6ea80df into pulumi:master Mar 12, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants