Handle existing workspace directories better #552

JDTX0 · 2024-02-16T17:57:07Z

Proposed changes

I've changed the stack reconciliation code to clean up existing workspace directories when it finds them instead of treating them as a lock and failing forever. We've seen numerous times where the operator leaves a workspace directory behind for some unknown reason and it causes the stack to fail to reconcile forever. The only way to resolve the issue is to remove the directory manually or restart the entire operator pod.

The operator shouldn't treat directories as locks in my opinion. Given that they're processed by 1 thread at a time and Pulumi has its own state lock files (for SaaS & self-hosted backends), an existing directory shouldn't block the reconciliation and cause a failure that never resolves itself.

Also, I fixed a few typos. Let me know if there's anything I've missed for this or if there's any concerns with this approach.

github-actions · 2024-02-16T17:57:21Z

PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting.
Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

github-actions · 2024-02-16T18:27:32Z

PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting.
Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

github-actions · 2024-02-16T18:33:31Z

PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting.
Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

rquitales · 2024-03-11T23:47:57Z

@JDTX0, thank you for your contribution, and I apologize for the oversight in reviewing this PR earlier. Upon further examination of the codebase, I agree with your assessment that it is guaranteed that only 1 thread/process can process a stack at a time when using a single replica for our operator deployment. Furthermore, even if a user were to adjust the replica count of their operator deployment manifests, they would likely encounter other issues due to the lack of native support for horizontal scaling. As such, it seems reasonable to remove the rudimentary lock check based on the assumption of the existence of the work directory.

However, I'm concerned with why these directories are not being cleaned up initially. If you have any insights into why this occurs in your deployments, that would be a helpful starting point for my investigations.

rquitales · 2024-03-11T23:48:03Z

/run-acceptance-tests

JDTX0 · 2024-03-12T01:28:24Z

@rquitales Thanks for the reply and review! We found that the operator was losing the leader election as the lease renewal timed out. The lease renewal turned out to be a problem with etcd having performance degradation.

The election failure caused the container to restart and leave behind directories which leads to this issue. So, not directly a problem in the operator code.

Regardless, I'd still very much like this change merged as it makes the operator more resilient to a myriad of possible problems, lease-related or otherwise.

JDTX0 added 2 commits February 6, 2024 16:26

Fix a few spelling errors

037253f

Ignore existence of workspace directory and cleanup anyway

78cb8e2

Add changelog entry

754d8d9

resolve changelog conflict

b86ad3f

rquitales self-assigned this Mar 11, 2024

rquitales self-requested a review March 11, 2024 23:48

rquitales approved these changes Mar 12, 2024

View reviewed changes

rquitales merged commit 6ea80df into pulumi:master Mar 12, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle existing workspace directories better #552

Handle existing workspace directories better #552

JDTX0 commented Feb 16, 2024

github-actions bot commented Feb 16, 2024

github-actions bot commented Feb 16, 2024

github-actions bot commented Feb 16, 2024

rquitales commented Mar 11, 2024 •

edited

Loading

rquitales commented Mar 11, 2024

JDTX0 commented Mar 12, 2024

Handle existing workspace directories better #552

Handle existing workspace directories better #552

Conversation

JDTX0 commented Feb 16, 2024

Proposed changes

github-actions bot commented Feb 16, 2024

github-actions bot commented Feb 16, 2024

github-actions bot commented Feb 16, 2024

rquitales commented Mar 11, 2024 • edited Loading

rquitales commented Mar 11, 2024

JDTX0 commented Mar 12, 2024

rquitales commented Mar 11, 2024 •

edited

Loading