Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more StackController loop hardening #34

Merged
merged 3 commits into from
Jul 30, 2020
Merged

Conversation

metral
Copy link
Contributor

@metral metral commented Jul 29, 2020

In the efforts to build out CI on GithubActions (#32), we noticed that some runners were often hitting failures due to excess loops causing resource starvation, and in turn crashing the Pulumi CLI. Contrast to local client runs on a bigger VM were all successful.

This PR helps cut down on the errors seen in the current setup, and helps produce a passing CI in GHA.


Related: #29

pkg/controller/stack/stack_controller.go Outdated Show resolved Hide resolved
metral added 3 commits July 30, 2020 18:54
… spec

We fetch the latest object before all necessary operations, but it can
still cause errors on updates when parallel loops are running.

This predicate helps to cut down on the errors during updates to the objects.

> This allows a controller to update objects that may have had their spec changed but, because the object does
> not use a generation, will inform on that change in some other manner.
>
> This predicate can be useful by itself, but is intended to be used in conjunction with
> sigs.k8s.io/controller-runtime/pkg/predicate.GenerationChangedPredicate to allow update events on all potentially
> changed objects, those that respect Generation semantics or those that do not:

See: https://github.com/operator-framework/operator-lib/blob/main/predicate/nogeneration.go#L29-L34
NoGenerationPredicate helps prevent the errors we were seeing during
addFinalizer, so there is no need to requeue here.
Ginkgo can run in parallel, but doing so spins up separate `go test`
processes and an operator for *each* worker / CPU core.

This creates competing operators that fight to process the same Stack CRs, and
causes concurrency issues that ultimately lead to indeterministic update states.

Spawning a single operator in Ginkgo to share amongst a set of tests
would be ideal, but Ginkgo does not support running shared services in a
global context for the entirety of the test suite.

Ultimately, the operator will need to be configured with leader election
to settle contention between multiple Operator instances. Once available,
this should allow ginkgo to run in parallel again.

See:
 - #33
 - operator-framework/operator-sdk#3585
 - https://onsi.github.io/ginkgo/#parallel-specs
 - https://docs.openshift.com/container-platform/4.5/operators/operator_sdk/osdk-leader-election.html
@metral metral force-pushed the metral/addr-parallel-issues branch from 47e938e to bb21b25 Compare July 30, 2020 18:54
@metral metral merged commit 05e5ade into master Jul 30, 2020
@pulumi-bot pulumi-bot deleted the metral/addr-parallel-issues branch July 30, 2020 18:58
@leezen leezen modified the milestones: current, 0.41 Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants