Fail fast if machines enter Failed phase by Danil-Grigorev · Pull Request #182 · openshift/cluster-api-actuator-pkg

Danil-Grigorev · 2020-08-02T07:19:11Z

According to MAO specifics, machines entering Failed phase will no longer be provisioned or processed. This commit ensures that e2e tests requiring machines to enter Running phase, or waiting for a full MachineSet rollout, would fail faster and give more readable failure output, instead of a generic timeout on condition.

This will help improve CI readability.

openshift-ci-robot · 2020-08-02T07:19:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign danil-grigorev
You can assign the PR to them by writing /assign @danil-grigorev in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Danil-Grigorev · 2020-08-03T07:55:59Z

/retest

enxebre · 2020-08-03T09:24:14Z

pkg/framework/machines.go

 			}
+			// Fail fast if machine entered Failed phase
+			if machine.Status.Phase != nil {
+				Expect(*machine.Status.Phase).NotTo(Equal(MachinePhaseFailed),


would this break the eventually loop? we still want to WaitForMachinesDeleted

The failed machine will never be deleted, as it is not going to reconcile. It should break the loop and report an error, unlike returning an error inside the block.

A failed machine still gets reconciled by the deletion logic does it not? The machine failed check is after the deletion timestamp check. As far as I can tell a failed machine can still be deleted, it will likely just be quicker.

https://github.com/openshift/machine-api-operator/blob/487298d74beeadfe564c5a7db6b9e37296b641b9/pkg/controller/machine/controller.go#L265-L268

Danil-Grigorev · 2020-08-05T09:16:09Z

/retest

JoelSpeed · 2020-08-05T11:51:03Z

pkg/framework/machines.go

 			}
+			// Fail fast if machine entered Failed phase
+			if machine.Status.Phase != nil {
+				Expect(*machine.Status.Phase).NotTo(Equal(MachinePhaseFailed),


A failed machine still gets reconciled by the deletion logic does it not? The machine failed check is after the deletion timestamp check. As far as I can tell a failed machine can still be deleted, it will likely just be quicker.

https://github.com/openshift/machine-api-operator/blob/487298d74beeadfe564c5a7db6b9e37296b641b9/pkg/controller/machine/controller.go#L265-L268

JoelSpeed · 2020-08-05T11:51:23Z

pkg/framework/machines.go

+		if m.Status.Phase != nil {
+			switch *m.Status.Phase {
+			case MachinePhaseRunning:
+				result = append(result, machines[i])
+			case MachinePhaseFailed:
+				return nil, fmt.Errorf("Machine entered the Failed phase: %q, reason: %v", m.GetName(), m.Status.ErrorMessage)
+			}


One question that comes to mind is whether any of the tests rely on a Machine going into a failed phase, we need to make sure we aren't breaking those by doing this

We are not testing that at the moment, so it is not a breaking change. Why wait for machines running if you are expecting them not to?

Good question 😅 I just wondered if we were using this to check that a machine didn't become running. I think it's ok

According to MAO specifics, machines entering Failed phase will no longer be provisioned or processed. This commit ensures that e2e tests requiring machines to enter Running phase, or waiting for a full MachineSet rollout, would fail faster and give more readable failure output, instead of a generic timeout on condition.

JoelSpeed

I'm not entirely sure whether calling Expect inside an Eventually is good practice, have you considered changing the Eventually blocks to the poll style that we have used in other wait for function? Then you have three states you can return, success, not yet and a terminal failure, which is what we want

JoelSpeed · 2020-08-05T13:15:21Z

pkg/framework/machines.go

+		if m.Status.Phase != nil {
+			switch *m.Status.Phase {
+			case MachinePhaseRunning:
+				result = append(result, machines[i])
+			case MachinePhaseFailed:
+				return nil, fmt.Errorf("Machine entered the Failed phase: %q, reason: %v", m.GetName(), m.Status.ErrorMessage)
+			}


Good question 😅 I just wondered if we were using this to check that a machine didn't become running. I think it's ok

Danil-Grigorev · 2020-08-06T12:30:19Z

/retest

JoelSpeed · 2020-08-06T16:01:56Z

/retest

Danil-Grigorev · 2020-08-11T08:15:45Z

/retest

openshift-ci-robot · 2020-08-11T11:31:00Z

@Danil-Grigorev: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-azure-operator	`fe88e5d`	link	`/test e2e-azure-operator`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-11-09T14:09:00Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-12-09T16:05:21Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-merge-robot · 2020-12-15T12:54:52Z

@Danil-Grigorev: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-vsphere-operator	`fe88e5d`	link	`/test e2e-vsphere-operator`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-01-14T18:16:49Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-01-14T18:17:08Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from enxebre and michaelgugino August 2, 2020 07:19

enxebre reviewed Aug 3, 2020

View reviewed changes

Danil-Grigorev requested a review from enxebre August 5, 2020 09:16

JoelSpeed reviewed Aug 5, 2020

View reviewed changes

Danil-Grigorev force-pushed the e2e-check-machine-phase branch from 8a2a8fa to fe88e5d Compare August 5, 2020 13:05

Danil-Grigorev requested a review from JoelSpeed August 5, 2020 13:08

JoelSpeed reviewed Aug 5, 2020

View reviewed changes

Danil-Grigorev mentioned this pull request Aug 6, 2020

Set 10 minute timeout on webhook and deployment operations #183

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 9, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 9, 2020

openshift-ci-robot closed this Jan 14, 2021

Conversation

Danil-Grigorev commented Aug 2, 2020

Uh oh!

openshift-ci-robot commented Aug 2, 2020

Uh oh!

Danil-Grigorev commented Aug 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Danil-Grigorev commented Aug 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Danil-Grigorev commented Aug 6, 2020

Uh oh!

JoelSpeed commented Aug 6, 2020

Uh oh!

Danil-Grigorev commented Aug 11, 2020

Uh oh!

openshift-ci-robot commented Aug 11, 2020

Uh oh!

openshift-bot commented Nov 9, 2020

Uh oh!

openshift-bot commented Dec 9, 2020

Uh oh!

openshift-merge-robot commented Dec 15, 2020

Uh oh!

openshift-bot commented Jan 14, 2021

Uh oh!

openshift-ci-robot commented Jan 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants