-
Notifications
You must be signed in to change notification settings - Fork 33
Bug 1943378: Eliminate instanceCreate volume leak #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: eurijon. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
mandre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like an easy enough fix. Have you checked if k8s/capo also exhibits the problem and thus needs a similar fix?
|
@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: eurijon. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
InstanceCreate is a method that creates a server. If the server is set to boot from volume, and an image ID is passed, the method creates the root volume prior to creating the server. The root volume is set to be destroyed when the associated server is destroyed. However, if the server fails to create (for example, because of quota issues), the volume is never associated to a server and the automatic deletion is never triggered. At every round of retry, a new volume will be created, possibly until volume quota is reached (or server creation is successful). This results in a leakage of unused volumes. With this patch, a newly created root volume is explicitly deleted as soon as the server creation call fails. Note that this patch leaves unmodified the lifespan of a volume associated to a server, regardless if the server ever reaches an ACTIVE state. Co-Authored-By: Matthew Booth <[email protected]>
9a9669a to
6c44ba7
Compare
|
@pierreprinetti: This pull request references Bugzilla bug 1943378, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: eurijon. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
I figured that the "quota exceeded" case fails synchronously, and not asynchronously (i.e. I tested this patch with local-capo.sh and using MOC's huge flavor. It works: the volume is correctly cleaned up when machine creation fails. I am aware that the |
|
/hold cancel Unit tests are now passing; this patch is ready for a review. |
|
Alternatively, instead of keeping track of volume that was created and eventually delete it on failure, we may adopt a different strategy where we check if a volume exists before creating a new one, if it's possible to identify such volume. However, if the leaked volume gets deleted on cluster destroy, I believe it's not such a big deal: MAO restarting should be an exceptional event. Would like to get @mdbooth's opinion on the PR. |
This is exactly what occurred to me while thinking about a solution for the idempotency bug 😄 For catching all cases, it might even be interesting to have both solutions in place: the immediate cleanup would be useful for cases where the machineset is scaled down between one retry and the next. I am thinking about merging this patch as-is, then adding the delete-before-create logic in a different patch to address that case specifically. |
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: EmilienM The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@pierreprinetti: All pull requests linked via external trackers have merged: Bugzilla bug 1943378 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| } | ||
|
|
||
| cleanupOperationsInCaseOfServerCreationFailure = append(cleanupOperationsInCaseOfServerCreationFailure, func() error { | ||
| return volumes.Delete(is.volumeClient, volume.ID, nil).ExtractErr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This differs from the delete line below:
err = volumes.Delete(is.volumeClient, volumeID, volumes.DeleteOpts{}).ExtractErr()Have you managed to test it manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also at least 1 other error condition on line 777 which, if it failed, would result in the same leak.
There might be a slightly more robust solution. How about something like:
defer func() error {
if server == nil {
err = volumes.Delete(is.volumeClient, volumeID, volumes.DeleteOpts{}).ExtractErr()
... log any error
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested manually and it works in the case in point (quota error).
However the defer solution is more powerful (cover other cases as well) and obviously better. I'm back to the keyboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, as far as I can tell, passing an empty volumes.DeleteOpts{} is a fancy noop with lots of reflection in the way.
|
Ah, crap, it's already merged :( We still need to fix it. |
Signed-off-by: Alex Yang <[email protected]>
InstanceCreate is a method that creates a server. If the server is set
to boot from volume, and an image ID is passed, the method creates the
root volume prior to creating the server. The root volume is set to be
destroyed when the associated server is destroyed.
However, if the server fails to create (for example, because of quota
issues), the volume is never associated to a server and the automatic
deletion is never triggered. At every round of retry, a new volume will
be created, possibly until volume quota is reached (or server creation
is successful). This results in a leakage of unused volumes.
With this patch, a newly created root volume is explicitly deleted as
soon as the server creation call fails.
Note that this patch leaves unmodified the lifespan of a volume
associated to a server, regardless if the server ever reaches an ACTIVE
state.
Co-Authored-By: Matthew Booth [email protected]