Skip to content

Fail fast if machines enter Failed phase#182

Closed
Danil-Grigorev wants to merge 1 commit intoopenshift:masterfrom
Danil-Grigorev:e2e-check-machine-phase
Closed

Fail fast if machines enter Failed phase#182
Danil-Grigorev wants to merge 1 commit intoopenshift:masterfrom
Danil-Grigorev:e2e-check-machine-phase

Conversation

@Danil-Grigorev
Copy link

According to MAO specifics, machines entering Failed phase will no longer be provisioned or processed. This commit ensures that e2e tests requiring machines to enter Running phase, or waiting for a full MachineSet rollout, would fail faster and give more readable failure output, instead of a generic timeout on condition.

This will help improve CI readability.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign danil-grigorev
You can assign the PR to them by writing /assign @danil-grigorev in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Danil-Grigorev
Copy link
Author

/retest

}
// Fail fast if machine entered Failed phase
if machine.Status.Phase != nil {
Expect(*machine.Status.Phase).NotTo(Equal(MachinePhaseFailed),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this break the eventually loop? we still want to WaitForMachinesDeleted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failed machine will never be deleted, as it is not going to reconcile. It should break the loop and report an error, unlike returning an error inside the block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failed machine still gets reconciled by the deletion logic does it not? The machine failed check is after the deletion timestamp check. As far as I can tell a failed machine can still be deleted, it will likely just be quicker.

https://github.com/openshift/machine-api-operator/blob/487298d74beeadfe564c5a7db6b9e37296b641b9/pkg/controller/machine/controller.go#L265-L268

@Danil-Grigorev
Copy link
Author

/retest

@Danil-Grigorev Danil-Grigorev requested a review from enxebre August 5, 2020 09:16
}
// Fail fast if machine entered Failed phase
if machine.Status.Phase != nil {
Expect(*machine.Status.Phase).NotTo(Equal(MachinePhaseFailed),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failed machine still gets reconciled by the deletion logic does it not? The machine failed check is after the deletion timestamp check. As far as I can tell a failed machine can still be deleted, it will likely just be quicker.

https://github.com/openshift/machine-api-operator/blob/487298d74beeadfe564c5a7db6b9e37296b641b9/pkg/controller/machine/controller.go#L265-L268

Comment on lines +29 to +35
if m.Status.Phase != nil {
switch *m.Status.Phase {
case MachinePhaseRunning:
result = append(result, machines[i])
case MachinePhaseFailed:
return nil, fmt.Errorf("Machine entered the Failed phase: %q, reason: %v", m.GetName(), m.Status.ErrorMessage)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question that comes to mind is whether any of the tests rely on a Machine going into a failed phase, we need to make sure we aren't breaking those by doing this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not testing that at the moment, so it is not a breaking change. Why wait for machines running if you are expecting them not to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question 😅 I just wondered if we were using this to check that a machine didn't become running. I think it's ok

According to MAO specifics, machines entering Failed phase
will no longer be provisioned or processed. This commit
ensures that e2e tests requiring machines to enter Running
phase, or waiting for a full MachineSet rollout, would fail
faster and give more readable failure output, instead of
a generic timeout on condition.
@Danil-Grigorev Danil-Grigorev force-pushed the e2e-check-machine-phase branch from 8a2a8fa to fe88e5d Compare August 5, 2020 13:05
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure whether calling Expect inside an Eventually is good practice, have you considered changing the Eventually blocks to the poll style that we have used in other wait for function? Then you have three states you can return, success, not yet and a terminal failure, which is what we want

Comment on lines +29 to +35
if m.Status.Phase != nil {
switch *m.Status.Phase {
case MachinePhaseRunning:
result = append(result, machines[i])
case MachinePhaseFailed:
return nil, fmt.Errorf("Machine entered the Failed phase: %q, reason: %v", m.GetName(), m.Status.ErrorMessage)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question 😅 I just wondered if we were using this to check that a machine didn't become running. I think it's ok

@Danil-Grigorev
Copy link
Author

/retest

@JoelSpeed
Copy link
Contributor

/retest

1 similar comment
@Danil-Grigorev
Copy link
Author

/retest

@openshift-ci-robot
Copy link

@Danil-Grigorev: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure-operator fe88e5d link /test e2e-azure-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 9, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 9, 2020
@openshift-merge-robot
Copy link
Contributor

@Danil-Grigorev: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-vsphere-operator fe88e5d link /test e2e-vsphere-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants