Add more StackController loop hardening #34

metral · 2020-07-29T23:18:02Z

In the efforts to build out CI on GithubActions (#32), we noticed that some runners were often hitting failures due to excess loops causing resource starvation, and in turn crashing the Pulumi CLI. Contrast to local client runs on a bigger VM were all successful.

This PR helps cut down on the errors seen in the current setup, and helps produce a passing CI in GHA.

fix(stack-cntlr): use NoGenerationPredicate to update objs w/ changed spec

Comments

We fetch the latest object before all necessary operations, but it can
still cause errors on updates when parallel loops are running.

This predicate helps to cut down on the errors during updates to the objects.

This allows a controller to update objects that may have had their spec changed but, because the object does
not use a generation, will inform on that change in some other manner.

This predicate can be useful by itself, but is intended to be used in conjunction with
sigs.k8s.io/controller-runtime/pkg/predicate.GenerationChangedPredicate to allow update events on all potentially
changed objects, those that respect Generation semantics or those that do not:

See:
- https://github.com/operator-framework/operator-lib/blob/main/predicate/nogeneration.go#L29-L34
- Modeled after the Ansible controller in operator-sdk
fix(stack-cntlr): do not requeue after addingFinalizer

NoGenerationPredicate helps prevent the errors we were seeing during
addFinalizer, so there is no need to requeue here.
tests: don't run Ginkgo tests in parallel
Comments

Ginkgo can run in parallel, but doing so spins up separate go test
processes and an operator for each worker / CPU core.

This creates competing operators in the ephemeral GKE test cluster,
and they process the same Stack CRs. This ultimately causes concurrency issues that
lead to indeterministic update states.

Spawning a single operator in Ginkgo to share amongst a set of tests
would be ideal, but Ginkgo does not support running shared services in a
global context for the entirety of the test suite.

Ultimately, the operator will need to be configured with leader election
to settle contention between multiple Operator instances. Once available,
this should allow ginkgo to run in parallel again.

See:

Related: #29

pkg/controller/stack/stack_controller.go

… spec We fetch the latest object before all necessary operations, but it can still cause errors on updates when parallel loops are running. This predicate helps to cut down on the errors during updates to the objects. > This allows a controller to update objects that may have had their spec changed but, because the object does > not use a generation, will inform on that change in some other manner. > > This predicate can be useful by itself, but is intended to be used in conjunction with > sigs.k8s.io/controller-runtime/pkg/predicate.GenerationChangedPredicate to allow update events on all potentially > changed objects, those that respect Generation semantics or those that do not: See: https://github.com/operator-framework/operator-lib/blob/main/predicate/nogeneration.go#L29-L34

NoGenerationPredicate helps prevent the errors we were seeing during addFinalizer, so there is no need to requeue here.

Ginkgo can run in parallel, but doing so spins up separate `go test` processes and an operator for *each* worker / CPU core. This creates competing operators that fight to process the same Stack CRs, and causes concurrency issues that ultimately lead to indeterministic update states. Spawning a single operator in Ginkgo to share amongst a set of tests would be ideal, but Ginkgo does not support running shared services in a global context for the entirety of the test suite. Ultimately, the operator will need to be configured with leader election to settle contention between multiple Operator instances. Once available, this should allow ginkgo to run in parallel again. See: - #33 - operator-framework/operator-sdk#3585 - https://onsi.github.io/ginkgo/#parallel-specs - https://docs.openshift.com/container-platform/4.5/operators/operator_sdk/osdk-leader-election.html

metral requested review from lblackstone and lukehoban July 29, 2020 23:18

metral mentioned this pull request Jul 29, 2020

Add CI to build, test, and package the operator #32

Merged

metral self-assigned this Jul 29, 2020

lukehoban approved these changes Jul 30, 2020

View reviewed changes

pkg/controller/stack/stack_controller.go Outdated Show resolved Hide resolved

metral added 3 commits July 30, 2020 18:54

fix(stack-cntlr): do not requeue after addingFinalizer

5754446

NoGenerationPredicate helps prevent the errors we were seeing during addFinalizer, so there is no need to requeue here.

metral force-pushed the metral/addr-parallel-issues branch from 47e938e to bb21b25 Compare July 30, 2020 18:54

metral merged commit 05e5ade into master Jul 30, 2020

pulumi-bot deleted the metral/addr-parallel-issues branch July 30, 2020 18:58

leezen modified the milestones: current, 0.41 Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more StackController loop hardening #34

Add more StackController loop hardening #34

metral commented Jul 29, 2020 •

edited

Loading

Add more StackController loop hardening #34

Add more StackController loop hardening #34

Conversation

metral commented Jul 29, 2020 • edited Loading

metral commented Jul 29, 2020 •

edited

Loading