Handle case of recovery from resize failures#187
Handle case of recovery from resize failures#187k8s-ci-robot merged 5 commits intokubernetes-csi:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/assign @jsafrane |
|
/assign @jingxu97 @xing-yang |
2aa9ed0 to
4cdc2b7
Compare
fix bug with switch..case statement
4cdc2b7 to
9f08c69
Compare
|
I tested it with AWS EBS (it allows one expansion of a volume in 6 hours, so it can throw nice errors). |
| // checks if pv can be expanded | ||
| func (ctrl *resizeController) pvCanBeExpanded(pv *v1.PersistentVolume, pvc *v1.PersistentVolumeClaim) bool { | ||
| if !ctrl.resizer.CanSupport(pv, pvc) { | ||
| klog.V(4).Infof("Resizer %q doesn't support PV %q", ctrl.name, pv.Name) |
There was a problem hiding this comment.
maybe it can be a warning because you are calling resize, but resizer does not support it?
There was a problem hiding this comment.
it can get called on all sort of PVs. such as intree pv,pvc etc, which are not resizable by external-resizer. I think Info is suitable since most people will be running both intree expand_controller and this controller side-by-side for now.
| _, _, err, _ := ctrl.expandAndRecover(pvc, pv) | ||
| return err | ||
| } else { | ||
| if !ctrl.pvNeedResize(pvc, pv) { |
There was a problem hiding this comment.
seem pvNeedResize can be called before checking feature is enabled?
There was a problem hiding this comment.
is pvNeedResize more like pv cannot be resized due to condition is not met? Does it consider an error?
There was a problem hiding this comment.
this is basically old flow - which checks if PV is resizable (such as, whether it is bound and bound to PVC which is being resized and whether it was already controller expanded and node expansion is pending). It is not a hard error if PV can not expanded at this moment. But anyways - this is old code and has kinda always worked okay.
There was a problem hiding this comment.
oh I confused about pvCanBeExpanded and pvNeedResize, thinking they are the same function.
There was a problem hiding this comment.
yeah I am sorry about similar names. I am hoping to delete pvNeedResize when this feature goes beta.
|
|
||
| return ctrl.resizePVC(pvc, pv) | ||
| if utilfeature.DefaultFeatureGate.Enabled(features.RecoverVolumeExpansionFailure) { | ||
| _, _, err, _ := ctrl.expandAndRecover(pvc, pv) |
There was a problem hiding this comment.
why return pv, pvc, but without using use objects?
There was a problem hiding this comment.
for now, mainly for tests.
pkg/controller/controller.go
Outdated
| return nil, err | ||
| return updatedPVC, err | ||
| } | ||
| return updatedPVC, nil |
There was a problem hiding this comment.
Since updatedPVC is always returned now, we can remove the "if err != nil" code block and just have "return updatedPVC, err" as this will cover the case when err is nil as well.
|
|
||
| func (ctrl *resizeController) patchClaim(oldPVC, newPVC *v1.PersistentVolumeClaim) (*v1.PersistentVolumeClaim, error) { | ||
| patchBytes, err := util.GetPVCPatchData(oldPVC, newPVC) | ||
| func (ctrl *resizeController) patchClaim(oldPVC, newPVC *v1.PersistentVolumeClaim, addResourceVersionCheck bool) (*v1.PersistentVolumeClaim, error) { |
There was a problem hiding this comment.
could you add some comments about this option addResourceVersionCheck?
15fcebc to
7ca824f
Compare
|
@jingxu97 @xing-yang addressed your comments. can you PTAL? |
|
/lgtm |
| newPVC := pvc.DeepCopy() | ||
| newPVC.Status.ResizeStatus = &expansionFailedOnController | ||
|
|
||
| updatedPVC, err := ctrl.patchClaim(pvc, newPVC, false /* addResourceVersionCheck */) |
There was a problem hiding this comment.
the reason no need to check resource version here?
There was a problem hiding this comment.
The general idea I had was if expansion failed we record the error even if we get a potential conflict. Since we are only patching resize status we should be okay.
There was a problem hiding this comment.
what is the result of conflict? It might overwrite some fields which are updated by other places?
how about we also check addResourceVersionCheck here? Any potential issue?
There was a problem hiding this comment.
The conflict when addResourceVersionCheck is present has nothing to do with content of the patch, but is about whether patch we are applying is with latest version of object or not. i.e - if two updates were performed on PVC concurrently and they updated different fields, then if addResourceVersionCheck is present with second patch - you will still get error (even if second update changes entirely different field of PVC).
how about we also check addResourceVersionCheck here? Any potential issue?
We can, but if somehow our version of PVC was older then, the entire resizing operation has to be restarted before ExpansionFailedOnController can be set for resizeStatus. As such it does not affect the design but skipping addResourceVersionCheck is an optimization - so as we set this field if expansion fails on controller and even if PVC our version of PVC was slightly older somehow.
There was a problem hiding this comment.
Added as a comment in the code.
7ca824f to
8c552c3
Compare
|
/lgtm |
|
/hold cancel |
Handle the case of recovery from resize failure.
xref kubernetes/enhancements#1790
This is mostly same implementation from intree controller code - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/operationexecutor/operation_generator.go#L1782 with minor changes because of node-only expansion handling as described in KEP.
Manually tested following scenarios (with hostpath driver):