Skip to content

Merge spot tests to a single test#193

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
JoelSpeed:single-spot-test
Oct 31, 2020
Merged

Merge spot tests to a single test#193
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
JoelSpeed:single-spot-test

Conversation

@JoelSpeed
Copy link
Contributor

This means that the test suite only needs to spin up a single machineset for spot instances.

This is an easy alternative to having a BeforeAll which will hopefully come in Ginkgo V2, once it does, we can revert this and change the BeforeEach to a BeforeAll.

My hope is that by reducing the number of spot instances we need to spin up, this will improve the reliability of this test on Azure which seems to have some issues bringing up spot instances.

I'd suggest reviewing this with white space changes ignored, it should make a lot more sense that way 🤞

CC @kwoodson

This means that the test suite only needs to spin up a single machineset 
for spot instances.

This is an easy alternative to having a `BeforeAll` which will hopefully 
come in Ginkgo V2, once it does, we can revert this and change the 
`BeforeEach` to a `BeforeAll`.

My hope is that by reducing the number of spot instances we need to spin 
up, this will improve the reliability of this test on Azure which seems 
to have some issues bringing up spot instances.
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems reasonable to me, do we have any data yet that would demonstrate the stability?

@JoelSpeed
Copy link
Contributor Author

/test e2e-azure-operator
/test e2e-aws-operator
/test e2e-gcp-operator

Retesting to check stability

this seems reasonable to me, do we have any data yet that would demonstrate the stability?

I was hoping that the metric added in openshift/machine-api-operator#640 would provide this but it's not merged yet.

This has come from a repeated observation that the spot test is failing and some analysis from @kwoodson who suspects that bringing up spot instances is causing much of the Azure CI flakiness

@JoelSpeed
Copy link
Contributor Author

Different tests failed for azure and gcp

/test e2e-gcp-operator
/test e2e-aws-operator

@openshift-ci-robot
Copy link

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-operator a073157 link /test e2e-gcp-operator
ci/prow/e2e-aws-operator a073157 link /test e2e-aws-operator
ci/prow/e2e-azure-operator a073157 link /test e2e-azure-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@JoelSpeed
Copy link
Contributor Author

Last Azure test failed for CPU limit reasons, this has now been fixed!
/retest

@JoelSpeed
Copy link
Contributor Author

/retest

@JoelSpeed
Copy link
Contributor Author

/test e2e-gcp-operator
/test e2e-aws-operator
/test e2e-azure-operator

1 similar comment
@JoelSpeed
Copy link
Contributor Author

/test e2e-gcp-operator
/test e2e-aws-operator
/test e2e-azure-operator

@JoelSpeed
Copy link
Contributor Author

/retest

@kwoodson
Copy link

@JoelSpeed Looks like the test failed when the autoscaler was scaling back down from 6.

• Failure [1416.664 seconds]
[Feature:Machines] Autoscaler should
/go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/autoscaler/autoscaler.go:214
  use a ClusterAutoscaler that has 12 maximum total nodes count and balance similar nodes enabled
  /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/autoscaler/autoscaler.go:490
    scales up and down while respecting MaxNodesTotal [Slow][Serial] [It]
    /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/autoscaler/autoscaler.go:521

    Timed out after 360.000s.
    Error: Unexpected non-nil/non-zero extra argument at index 1:
    	<*fmt.wrapError>: &fmt.wrapError{msg:"error getting node from machine \"ci-op-9i19wc9q-d57e2-kxndq4vmr6-556fl\": nodes \"ci-op-9i19wc9q-d57e2-kxndq4vmr6-556fl\" not found", err:(*errors.StatusError)(0xc000603040)}

    /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/autoscaler/autoscaler.go:589

@JoelSpeed
Copy link
Contributor Author

/retest

@kwoodson hoping to resolve that flakiness with #195

@JoelSpeed
Copy link
Contributor Author

/retest

@JoelSpeed
Copy link
Contributor Author

@elmiko In the last 24 hours this test suite has run 4 times for this PR, none of the failures related to spot. On the MAO repo it ran 11 times yesterday and saw 2 failures from spot on azure, possibly not statistically significant but on the other hand, I can't really see how this change would make things worse, it reduces the number of VMs we are spinning up and therefore should make the suite faster and less prone to slow VM creation when we create many VMs at once.

@JoelSpeed
Copy link
Contributor Author

/retest

That said, it did just fail on spot, but for a different reason, which I believe is unrelated to the issue we are trying to fix here

@elmiko
Copy link
Contributor

elmiko commented Oct 20, 2020

makes sense to me @JoelSpeed , i'm good with this change
/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2020
@JoelSpeed
Copy link
Contributor Author

/retest

2 similar comments
@JoelSpeed
Copy link
Contributor Author

/retest

@JoelSpeed
Copy link
Contributor Author

/retest

@kwoodson
Copy link

@JoelSpeed I'm unfamiliar with this test failure. Looks like a timeout but I'm not sure why.

Unexpected error:
    <*errors.errorString | 0xc00053b990>: {
        s: "error getting ValidatingWebhookConfiguration \"machine-api\": timed out waiting for the condition",
    }
    error getting ValidatingWebhookConfiguration "machine-api": timed out waiting for the condition
occurred
/go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/operators/machine-api-operator.go:173

@JoelSpeed
Copy link
Contributor Author

/retest

@kwoodson This test failure is really odd, but I think it's because we are running multiple disruptive tests in parallel

There are a couple of tests that affect the VWC, one that deletes it, one that modifies it, both test the controller puts it back. These shouldn't be running in parallel, they will clash with one another. Will take a look at a quick fix by merging the disruptive tests to a single test.

https://github.com/openshift/cluster-api-actuator-pkg/blob/master/pkg/operators/machine-api-operator.go

@JoelSpeed
Copy link
Contributor Author

/retest

@Danil-Grigorev
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2020
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

12 similar comments
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 45ec974 into openshift:master Oct 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants