Bug 1943378: Eliminate instanceCreate volume leak #188

pierreprinetti · 2021-07-17T14:17:01Z

InstanceCreate is a method that creates a server. If the server is set
to boot from volume, and an image ID is passed, the method creates the
root volume prior to creating the server. The root volume is set to be
destroyed when the associated server is destroyed.

However, if the server fails to create (for example, because of quota
issues), the volume is never associated to a server and the automatic
deletion is never triggered. At every round of retry, a new volume will
be created, possibly until volume quota is reached (or server creation
is successful). This results in a leakage of unused volumes.

With this patch, a newly created root volume is explicitly deleted as
soon as the server creation call fails.

Note that this patch leaves unmodified the lifespan of a volume
associated to a server, regardless if the server ever reaches an ACTIVE
state.

Co-Authored-By: Matthew Booth [email protected]

openshift-ci · 2021-07-17T14:17:08Z

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

Details

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-07-17T14:17:09Z

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: eurijon.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target release (4.9.0) matches configured target release for branch (4.9.0)

bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mandre

Seems like an easy enough fix. Have you checked if k8s/capo also exhibits the problem and thus needs a similar fix?

openshift-ci · 2021-07-19T07:58:07Z

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

Details

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-07-19T07:58:08Z

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: eurijon.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target release (4.9.0) matches configured target release for branch (4.9.0)

bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

InstanceCreate is a method that creates a server. If the server is set to boot from volume, and an image ID is passed, the method creates the root volume prior to creating the server. The root volume is set to be destroyed when the associated server is destroyed. However, if the server fails to create (for example, because of quota issues), the volume is never associated to a server and the automatic deletion is never triggered. At every round of retry, a new volume will be created, possibly until volume quota is reached (or server creation is successful). This results in a leakage of unused volumes. With this patch, a newly created root volume is explicitly deleted as soon as the server creation call fails. Note that this patch leaves unmodified the lifespan of a volume associated to a server, regardless if the server ever reaches an ACTIVE state. Co-Authored-By: Matthew Booth <[email protected]>

openshift-ci · 2021-07-19T08:06:16Z

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

Details

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-07-19T08:06:18Z

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: eurijon.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target release (4.9.0) matches configured target release for branch (4.9.0)

bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @eurijon

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pierreprinetti · 2021-07-19T08:35:20Z

@mdbooth

I figured that the "quota exceeded" case fails synchronously, and not asynchronously (i.e. InstanceCreate returns an error to Create, thus the wait-for-ACTIVE loop is never hit). Which makes sense: conversely, a machine that is successfully created likely gets its volume associated, and destroyed upon deletion, even when it never reaches the ACTIVE state.

I tested this patch with local-capo.sh and using MOC's huge flavor. It works: the volume is correctly cleaned up when machine creation fails.

@mdbooth @mandre

I am aware that the InstanceCreate code does not comply with the idempotency criterion for CAPO operations. The volume-deletion request is only kept in memory and won't survive a restart: if CAPO is rebooted while creating a server, the volume might leak.
The reason why I am not tackling that issue, is that I believe it is not a regression. The same would happen before this patch. I have filed a bug for this; we can have a conversation in BZ on whether it's worth fixing.

pierreprinetti · 2021-07-19T08:35:46Z

/hold cancel

Unit tests are now passing; this patch is ready for a review.

pierreprinetti · 2021-07-19T08:36:41Z

@mandre

Have you checked if k8s/capo also exhibits the problem

It does; @mdbooth has done that investigation. Since the bug is high-severity, we fix here then there.

mandre · 2021-07-19T09:05:58Z

Alternatively, instead of keeping track of volume that was created and eventually delete it on failure, we may adopt a different strategy where we check if a volume exists before creating a new one, if it's possible to identify such volume.

However, if the leaked volume gets deleted on cluster destroy, I believe it's not such a big deal: MAO restarting should be an exceptional event.

Would like to get @mdbooth's opinion on the PR.

pierreprinetti · 2021-07-19T09:08:52Z

@mandre

check if a volume exists before creating a new one

This is exactly what occurred to me while thinking about a solution for the idempotency bug 😄

For catching all cases, it might even be interesting to have both solutions in place: the immediate cleanup would be useful for cases where the machineset is scaled down between one retry and the next.

I am thinking about merging this patch as-is, then adding the delete-before-create logic in a different patch to address that case specifically.

EmilienM · 2021-07-20T12:23:39Z

/lgtm

EmilienM · 2021-07-21T12:05:07Z

/approve

openshift-ci · 2021-07-21T12:05:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: EmilienM

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [EmilienM]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2021-07-21T12:06:20Z

@pierreprinetti: All pull requests linked via external trackers have merged:

openshift/cluster-api-provider-openstack#188

Bugzilla bug 1943378 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1943378: Eliminate instanceCreate volume leak

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mdbooth · 2021-07-21T12:13:09Z

pkg/cloud/openstack/clients/machineservice.go

 			}

+			cleanupOperationsInCaseOfServerCreationFailure = append(cleanupOperationsInCaseOfServerCreationFailure, func() error {
+				return volumes.Delete(is.volumeClient, volume.ID, nil).ExtractErr()


This differs from the delete line below:

err = volumes.Delete(is.volumeClient, volumeID, volumes.DeleteOpts{}).ExtractErr()

Have you managed to test it manually?

There's also at least 1 other error condition on line 777 which, if it failed, would result in the same leak.

There might be a slightly more robust solution. How about something like:

defer func() error { if server == nil { err = volumes.Delete(is.volumeClient, volumeID, volumes.DeleteOpts{}).ExtractErr() ... log any error } }

I tested manually and it works in the case in point (quota error).

However the defer solution is more powerful (cover other cases as well) and obviously better. I'm back to the keyboard.

By the way, as far as I can tell, passing an empty volumes.DeleteOpts{} is a fancy noop with lots of reflection in the way.

mdbooth · 2021-07-21T12:19:16Z

Ah, crap, it's already merged :( We still need to fix it.

Signed-off-by: Alex Yang <[email protected]>

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. labels Jul 17, 2021

openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jul 17, 2021

openshift-ci bot requested review from EmilienM and mdbooth July 17, 2021 14:17

mandre reviewed Jul 19, 2021

View reviewed changes

pierreprinetti force-pushed the bz1943378 branch from 9a9669a to 6c44ba7 Compare July 19, 2021 08:05

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 19, 2021

pierreprinetti mentioned this pull request Jul 20, 2021

Bug 1983612: Make InstanceCreate idempotent #189

Merged

openshift-ci bot assigned EmilienM Jul 20, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2021

openshift-merge-robot merged commit efc4eed into openshift:master Jul 21, 2021

mdbooth suggested changes Jul 21, 2021

View reviewed changes

pierreprinetti deleted the bz1943378 branch July 21, 2021 13:05

openshift-ci bot mentioned this pull request Jul 22, 2021

Bug 1943378: Fix InstanceCreate volume cleanup #191

Merged

EmilienM mentioned this pull request Jan 31, 2023

Manual rebase on upstream #252

Closed

pierreprinetti pushed a commit to shiftstack/cluster-api-provider-openstack that referenced this pull request Apr 22, 2024

add UserDataSecret into deepcopy file (openshift#188)

ddd1eb0

Signed-off-by: Alex Yang <[email protected]>

Bug 1943378: Eliminate instanceCreate volume leak #188

Bug 1943378: Eliminate instanceCreate volume leak #188

Uh oh!

Conversation

pierreprinetti commented Jul 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jul 17, 2021

Uh oh!

openshift-ci bot commented Jul 17, 2021

Uh oh!

mandre left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jul 19, 2021

Uh oh!

openshift-ci bot commented Jul 19, 2021

Uh oh!

openshift-ci bot commented Jul 19, 2021

Uh oh!

openshift-ci bot commented Jul 19, 2021

Uh oh!

pierreprinetti commented Jul 19, 2021

Uh oh!

pierreprinetti commented Jul 19, 2021

Uh oh!

pierreprinetti commented Jul 19, 2021

Uh oh!

mandre commented Jul 19, 2021

Uh oh!

pierreprinetti commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EmilienM commented Jul 20, 2021

Uh oh!

EmilienM commented Jul 21, 2021

Uh oh!

openshift-ci bot commented Jul 21, 2021

Uh oh!

openshift-ci bot commented Jul 21, 2021

Uh oh!

mdbooth Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

mdbooth Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

pierreprinetti Jul 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierreprinetti Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

mdbooth commented Jul 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pierreprinetti commented Jul 17, 2021 •

edited

Loading

pierreprinetti commented Jul 19, 2021 •

edited

Loading

pierreprinetti Jul 21, 2021 •

edited

Loading