-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle existing workspace directories better #552
Conversation
PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting. |
PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting. |
PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting. |
@JDTX0, thank you for your contribution, and I apologize for the oversight in reviewing this PR earlier. Upon further examination of the codebase, I agree with your assessment that it is guaranteed that only 1 thread/process can process a stack at a time when using a single replica for our operator deployment. Furthermore, even if a user were to adjust the replica count of their operator deployment manifests, they would likely encounter other issues due to the lack of native support for horizontal scaling. As such, it seems reasonable to remove the rudimentary lock check based on the assumption of the existence of the work directory. However, I'm concerned with why these directories are not being cleaned up initially. If you have any insights into why this occurs in your deployments, that would be a helpful starting point for my investigations. |
/run-acceptance-tests |
@rquitales Thanks for the reply and review! We found that the operator was losing the leader election as the lease renewal timed out. The lease renewal turned out to be a problem with etcd having performance degradation. The election failure caused the container to restart and leave behind directories which leads to this issue. So, not directly a problem in the operator code. Regardless, I'd still very much like this change merged as it makes the operator more resilient to a myriad of possible problems, lease-related or otherwise. |
Proposed changes
I've changed the stack reconciliation code to clean up existing workspace directories when it finds them instead of treating them as a lock and failing forever. We've seen numerous times where the operator leaves a workspace directory behind for some unknown reason and it causes the stack to fail to reconcile forever. The only way to resolve the issue is to remove the directory manually or restart the entire operator pod.
The operator shouldn't treat directories as locks in my opinion. Given that they're processed by 1 thread at a time and Pulumi has its own state lock files (for SaaS & self-hosted backends), an existing directory shouldn't block the reconciliation and cause a failure that never resolves itself.
Also, I fixed a few typos. Let me know if there's anything I've missed for this or if there's any concerns with this approach.