Add more StackController loop hardening #34
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the efforts to build out CI on GithubActions (#32), we noticed that some runners were often hitting failures due to excess loops causing resource starvation, and in turn crashing the Pulumi CLI. Contrast to local client runs on a bigger VM were all successful.
This PR helps cut down on the errors seen in the current setup, and helps produce a passing CI in GHA.
fix(stack-cntlr): use NoGenerationPredicate to update objs w/ changed spec
Comments
We fetch the latest object before all necessary operations, but it can
still cause errors on updates when parallel loops are running.
This predicate helps to cut down on the errors during updates to the objects.
See:
- https://github.com/operator-framework/operator-lib/blob/main/predicate/nogeneration.go#L29-L34
- Modeled after the Ansible controller in operator-sdk
fix(stack-cntlr): do not requeue after addingFinalizer
NoGenerationPredicate helps prevent the errors we were seeing during
addFinalizer, so there is no need to requeue here.
tests: don't run Ginkgo tests in parallel
Comments
Ginkgo can run in parallel, but doing so spins up separate
go test
processes and an operator for each worker / CPU core.
This creates competing operators in the ephemeral GKE test cluster,
and they process the same Stack CRs. This ultimately causes concurrency issues that
lead to indeterministic update states.
Spawning a single operator in Ginkgo to share amongst a set of tests
would be ideal, but Ginkgo does not support running shared services in a
global context for the entirety of the test suite.
Ultimately, the operator will need to be configured with leader election
to settle contention between multiple Operator instances. Once available,
this should allow ginkgo to run in parallel again.
See:
Related: #29