Skip to content

Conversation

@pierreprinetti
Copy link
Member

@pierreprinetti pierreprinetti commented Jul 17, 2021

InstanceCreate is a method that creates a server. If the server is set
to boot from volume, and an image ID is passed, the method creates the
root volume prior to creating the server. The root volume is set to be
destroyed when the associated server is destroyed.

However, if the server fails to create (for example, because of quota
issues), the volume is never associated to a server and the automatic
deletion is never triggered. At every round of retry, a new volume will
be created, possibly until volume quota is reached (or server creation
is successful). This results in a leakage of unused volumes.

With this patch, a newly created root volume is explicitly deleted as
soon as the server creation call fails.

Note that this patch leaves unmodified the lifespan of a volume
associated to a server, regardless if the server ever reaches an ACTIVE
state.

Co-Authored-By: Matthew Booth [email protected]

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. labels Jul 17, 2021
@openshift-ci
Copy link

openshift-ci bot commented Jul 17, 2021

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

Details

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jul 17, 2021
@openshift-ci
Copy link

openshift-ci bot commented Jul 17, 2021

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: eurijon.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from EmilienM and mdbooth July 17, 2021 14:17
Copy link
Member

@mandre mandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an easy enough fix. Have you checked if k8s/capo also exhibits the problem and thus needs a similar fix?

@openshift-ci
Copy link

openshift-ci bot commented Jul 19, 2021

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

Details

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Jul 19, 2021

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: eurijon.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

InstanceCreate is a method that creates a server. If the server is set
to boot from volume, and an image ID is passed, the method creates the
root volume prior to creating the server. The root volume is set to be
destroyed when the associated server is destroyed.

However, if the server fails to create (for example, because of quota
issues), the volume is never associated to a server and the automatic
deletion is never triggered. At every round of retry, a new volume will
be created, possibly until volume quota is reached (or server creation
is successful). This results in a leakage of unused volumes.

With this patch, a newly created root volume is explicitly deleted as
soon as the server creation call fails.

Note that this patch leaves unmodified the lifespan of a volume
associated to a server, regardless if the server ever reaches an ACTIVE
state.

Co-Authored-By: Matthew Booth <[email protected]>
@openshift-ci
Copy link

openshift-ci bot commented Jul 19, 2021

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

Details

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Jul 19, 2021

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: eurijon.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pierreprinetti
Copy link
Member Author

@mdbooth

I figured that the "quota exceeded" case fails synchronously, and not asynchronously (i.e. InstanceCreate returns an error to Create, thus the wait-for-ACTIVE loop is never hit). Which makes sense: conversely, a machine that is successfully created likely gets its volume associated, and destroyed upon deletion, even when it never reaches the ACTIVE state.

I tested this patch with local-capo.sh and using MOC's huge flavor. It works: the volume is correctly cleaned up when machine creation fails.

@mdbooth @mandre

I am aware that the InstanceCreate code does not comply with the idempotency criterion for CAPO operations. The volume-deletion request is only kept in memory and won't survive a restart: if CAPO is rebooted while creating a server, the volume might leak.
The reason why I am not tackling that issue, is that I believe it is not a regression. The same would happen before this patch. I have filed a bug for this; we can have a conversation in BZ on whether it's worth fixing.

@pierreprinetti
Copy link
Member Author

/hold cancel

Unit tests are now passing; this patch is ready for a review.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 19, 2021
@pierreprinetti
Copy link
Member Author

@mandre

Have you checked if k8s/capo also exhibits the problem

It does; @mdbooth has done that investigation. Since the bug is high-severity, we fix here then there.

@mandre
Copy link
Member

mandre commented Jul 19, 2021

Alternatively, instead of keeping track of volume that was created and eventually delete it on failure, we may adopt a different strategy where we check if a volume exists before creating a new one, if it's possible to identify such volume.

However, if the leaked volume gets deleted on cluster destroy, I believe it's not such a big deal: MAO restarting should be an exceptional event.

Would like to get @mdbooth's opinion on the PR.

@pierreprinetti
Copy link
Member Author

pierreprinetti commented Jul 19, 2021

@mandre

check if a volume exists before creating a new one

This is exactly what occurred to me while thinking about a solution for the idempotency bug 😄

For catching all cases, it might even be interesting to have both solutions in place: the immediate cleanup would be useful for cases where the machineset is scaled down between one retry and the next.

I am thinking about merging this patch as-is, then adding the delete-before-create logic in a different patch to address that case specifically.

@EmilienM
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2021
@EmilienM
Copy link
Member

/approve

@openshift-ci
Copy link

openshift-ci bot commented Jul 21, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: EmilienM

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2021
@openshift-merge-robot openshift-merge-robot merged commit efc4eed into openshift:master Jul 21, 2021
@openshift-ci
Copy link

openshift-ci bot commented Jul 21, 2021

@pierreprinetti: All pull requests linked via external trackers have merged:

Bugzilla bug 1943378 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

}

cleanupOperationsInCaseOfServerCreationFailure = append(cleanupOperationsInCaseOfServerCreationFailure, func() error {
return volumes.Delete(is.volumeClient, volume.ID, nil).ExtractErr()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This differs from the delete line below:

err = volumes.Delete(is.volumeClient, volumeID, volumes.DeleteOpts{}).ExtractErr()

Have you managed to test it manually?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also at least 1 other error condition on line 777 which, if it failed, would result in the same leak.

There might be a slightly more robust solution. How about something like:

defer func() error {
    if server == nil {
        err = volumes.Delete(is.volumeClient, volumeID, volumes.DeleteOpts{}).ExtractErr()
        ... log any error
    }
}

Copy link
Member Author

@pierreprinetti pierreprinetti Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested manually and it works in the case in point (quota error).

However the defer solution is more powerful (cover other cases as well) and obviously better. I'm back to the keyboard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, as far as I can tell, passing an empty volumes.DeleteOpts{} is a fancy noop with lots of reflection in the way.

@mdbooth
Copy link

mdbooth commented Jul 21, 2021

Ah, crap, it's already merged :( We still need to fix it.

@pierreprinetti pierreprinetti deleted the bz1943378 branch July 21, 2021 13:05
pierreprinetti pushed a commit to shiftstack/cluster-api-provider-openstack that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants