Merge spot tests to a single test by JoelSpeed · Pull Request #193 · openshift/cluster-api-actuator-pkg

JoelSpeed · 2020-10-13T16:25:19Z

This means that the test suite only needs to spin up a single machineset for spot instances.

This is an easy alternative to having a BeforeAll which will hopefully come in Ginkgo V2, once it does, we can revert this and change the BeforeEach to a BeforeAll.

My hope is that by reducing the number of spot instances we need to spin up, this will improve the reliability of this test on Azure which seems to have some issues bringing up spot instances.

I'd suggest reviewing this with white space changes ignored, it should make a lot more sense that way 🤞

CC @kwoodson

This means that the test suite only needs to spin up a single machineset for spot instances. This is an easy alternative to having a `BeforeAll` which will hopefully come in Ginkgo V2, once it does, we can revert this and change the `BeforeEach` to a `BeforeAll`. My hope is that by reducing the number of spot instances we need to spin up, this will improve the reliability of this test on Azure which seems to have some issues bringing up spot instances.

elmiko

this seems reasonable to me, do we have any data yet that would demonstrate the stability?

JoelSpeed · 2020-10-14T12:10:24Z

/test e2e-azure-operator
/test e2e-aws-operator
/test e2e-gcp-operator

Retesting to check stability

this seems reasonable to me, do we have any data yet that would demonstrate the stability?

I was hoping that the metric added in openshift/machine-api-operator#640 would provide this but it's not merged yet.

This has come from a repeated observation that the spot test is failing and some analysis from @kwoodson who suspects that bringing up spot instances is causing much of the Azure CI flakiness

JoelSpeed · 2020-10-14T14:11:38Z

Different tests failed for azure and gcp

/test e2e-gcp-operator
/test e2e-aws-operator

openshift-ci-robot · 2020-10-14T15:39:06Z

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-operator	`a073157`	link	`/test e2e-gcp-operator`
ci/prow/e2e-aws-operator	`a073157`	link	`/test e2e-aws-operator`
ci/prow/e2e-azure-operator	`a073157`	link	`/test e2e-azure-operator`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

JoelSpeed · 2020-10-14T17:38:21Z

Last Azure test failed for CPU limit reasons, this has now been fixed!
/retest

JoelSpeed · 2020-10-15T11:53:25Z

/retest

JoelSpeed · 2020-10-19T09:37:25Z

/test e2e-gcp-operator
/test e2e-aws-operator
/test e2e-azure-operator

JoelSpeed · 2020-10-19T12:25:10Z

/test e2e-gcp-operator
/test e2e-aws-operator
/test e2e-azure-operator

JoelSpeed · 2020-10-19T14:38:16Z

/retest

kwoodson · 2020-10-19T18:22:38Z

@JoelSpeed Looks like the test failed when the autoscaler was scaling back down from 6.

• Failure [1416.664 seconds]
[Feature:Machines] Autoscaler should
/go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/autoscaler/autoscaler.go:214
  use a ClusterAutoscaler that has 12 maximum total nodes count and balance similar nodes enabled
  /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/autoscaler/autoscaler.go:490
    scales up and down while respecting MaxNodesTotal [Slow][Serial] [It]
    /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/autoscaler/autoscaler.go:521

    Timed out after 360.000s.
    Error: Unexpected non-nil/non-zero extra argument at index 1:
    	<*fmt.wrapError>: &fmt.wrapError{msg:"error getting node from machine \"ci-op-9i19wc9q-d57e2-kxndq4vmr6-556fl\": nodes \"ci-op-9i19wc9q-d57e2-kxndq4vmr6-556fl\" not found", err:(*errors.StatusError)(0xc000603040)}

    /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/autoscaler/autoscaler.go:589

JoelSpeed · 2020-10-19T18:25:04Z

/retest

@kwoodson hoping to resolve that flakiness with #195

JoelSpeed · 2020-10-20T09:22:12Z

/retest

JoelSpeed · 2020-10-20T09:36:06Z

@elmiko In the last 24 hours this test suite has run 4 times for this PR, none of the failures related to spot. On the MAO repo it ran 11 times yesterday and saw 2 failures from spot on azure, possibly not statistically significant but on the other hand, I can't really see how this change would make things worse, it reduces the number of VMs we are spinning up and therefore should make the suite faster and less prone to slow VM creation when we create many VMs at once.

JoelSpeed · 2020-10-20T11:33:29Z

/retest

That said, it did just fail on spot, but for a different reason, which I believe is unrelated to the issue we are trying to fix here

elmiko · 2020-10-20T14:27:06Z

makes sense to me @JoelSpeed , i'm good with this change
/approve

openshift-ci-robot · 2020-10-20T14:27:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [elmiko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2020-10-20T14:54:06Z

/retest

JoelSpeed · 2020-10-21T09:46:16Z

/retest

JoelSpeed · 2020-10-21T13:27:39Z

/retest

kwoodson · 2020-10-21T16:20:35Z

@JoelSpeed I'm unfamiliar with this test failure. Looks like a timeout but I'm not sure why.

Unexpected error:
    <*errors.errorString | 0xc00053b990>: {
        s: "error getting ValidatingWebhookConfiguration \"machine-api\": timed out waiting for the condition",
    }
    error getting ValidatingWebhookConfiguration "machine-api": timed out waiting for the condition
occurred
/go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/operators/machine-api-operator.go:173

JoelSpeed · 2020-10-21T16:38:04Z

/retest

@kwoodson This test failure is really odd, but I think it's because we are running multiple disruptive tests in parallel

There are a couple of tests that affect the VWC, one that deletes it, one that modifies it, both test the controller puts it back. These shouldn't be running in parallel, they will clash with one another. Will take a look at a quick fix by merging the disruptive tests to a single test.

https://github.com/openshift/cluster-api-actuator-pkg/blob/master/pkg/operators/machine-api-operator.go

JoelSpeed · 2020-10-22T09:12:21Z

/retest

Danil-Grigorev · 2020-10-30T15:40:11Z

/lgtm

openshift-bot · 2020-10-30T17:34:01Z

/retest