Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for online resizing of persistent volumes #1535

Closed
wants to merge 1 commit into from

Conversation

mlmhl
Copy link

@mlmhl mlmhl commented Dec 20, 2017

This is a supplement for this proposal.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 20, 2017
@mlmhl
Copy link
Author

mlmhl commented Dec 20, 2017

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Dec 20, 2017
@k8s-ci-robot
Copy link
Contributor

@mlmhl: Reiterating the mentions to trigger a notification:
@kubernetes/sig-storage-pr-reviews

In response to this:

@kubernetes/sig-storage-pr-reviews @gnufied

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mlmhl
Copy link
Author

mlmhl commented Dec 20, 2017

/assign @gnufied


#### Fetch All Pods Using a PVC

We need to know which pods are using a volume with specified name, but `ExpandController` doesn't cache this relationship. To achieve this goal, We can share the `DesiredStateOfWorld` with `AttachDetachController`. In detail, reflector the `pkg/controller/volume/attachdetach/cache` package to `pkg/controller/volume/cache`, and add a `GetPodsInVolume` method to `DesiredStateOfWorld`. Expand controller use this method to fetch pods using the volume with specified name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in word reflector ? I am still unclear though - how will this cache become available to ExpandController since it will be a internal property of AttachDetachController.

Copy link
Author

@mlmhl mlmhl Dec 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the proposal to add more detailed description of this reflector operation.
And a preliminary implementation can be found from these commits: reflector
remove dependence

The `GetPodsInVolume` method will look like this:

```go
func (dsw *desiredStateOfWorld) GetPodsInVolume(volumeName v1.UniqueVolumeName) []*v1.Pod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really use ActualStateOfWorld rather than dsow here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. I think we should get the pod-volume-relationship(which pod use which volume) from DesiredStateOfWorld as ActualStateOfWorld doesn't cache this relationship, and get the volume-node-relationship(which volume mount to which node) from ActualStateOfWorld. The current proposal misses the latter one, I will update the proposal later.

Copy link
Author

@mlmhl mlmhl Dec 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a detailed explanation for this subject to the proposal.

@mlmhl mlmhl force-pushed the pvc-online-resize branch 6 times, most recently from 6756f2a to bca098f Compare December 22, 2017 06:37
### Prerequisite

* `sig-api-machinery` need to allow pod's annotation update from kubelet.
* `sig-api-machinery` need to allow pod's get and annotation update from expand-controller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* This modification is possible because `DesiredStateOfWorld` and `ActualStateOfWorld` only use `volumePluginMgr` to fetch volume name, inside the `DesiredStateOfWorld.AddPod` and `ActualStateOfWorld.AddVolumeNode` method, we can get the volume name outside and transform to them as a method parameter.

* Then we can create a common `DesiredStateOfWorld` and `ActualStateOfWorld`, implemented as fields of the `ControllerContext` object, instead of creating them inside `NewAttachDetachController`. After this, `AttachDetachController` and `ExpandController` can both access it through `ControllerContext`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any example of this design being used previously? I am just curious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can create a common DesiredStateOfWorld and ActualStateOfWorld, implemented as fields of the ControllerContext object

no... controller context is not intended to be a shared mutable struct

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this refactoring is really a bit trick. Do we have any better idea to share ASW/DSW?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets list out all the ideas we discussed in Github issue, so as we know all the available options. The other ways of implementing this is:

  1. We discussed volumemanager on kubelet side could watch for PVCs and perform FS resize when certain condition is met. The downside of this approach is - in a large enough cluster (think 5000 nodes), it may not be good idea to create such a large number of watches. Having said that, volumemanager need not watch for all PVCs - it can only watch for PVCs which are in its actual_state_of_world. I am not sure, performance penalty of watching all PVCs VS watching only a small subset of PVCs.
  2. Another option is, ExpandController can build its own object graph of pvc/pod/node mapping. This will enable us to determine pod/node where PVC is being used without sharing a cache with Attach/Detach controller.
  3. We have kicked around idea of adding node name to the PVC for awhile now after attach is done. This will require an API change, but it will enable us to determine pvc->node relationship rather quickly.


To achieve this goal, we need to:

* reflector the `pkg/controller/volume/attachdetach/cache` package to `pkg/controller/volume/cache`, and remove the parameter `volumePluginMgr` of `NewDesiredStateOfWorld` and `NewActualStateOfWorld` function, so that we can create a new `DesiredStateOfWorld` and `ActualStateOfWorld` without any dependence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean reflector or you mean refactor ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, typo in word, I've fixed this.


Kubelet need to periodically loop through the list of active pods to detect which volume requires file system resizing. We use a `VolumeToResize` object to represent a volume resizing requests and introduce a new interface `VolumeFsResizeMap` rather than `DesiredStateOfWorld` to cache such `VolumeToResize` objects, since we treat it as a request queue, not a state.

Besides, we add a `VolumeFsResizeMapPopulator` interface as the producer of `VolumeFsResizeMap`, it runs an asynchronous periodic loop to populate the volumes need file system resize. In each loop, `VolumeFsResizeMapPopulator` traverses each pod's annotation, and generate a `VolumeToResize` object for each `volumeFSResizingAnnotation`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a new package or part of volume-manager's interface?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand what you mean. We can put VolumeFsResizeMap interface under package pkg/kubelet/volumemanager/cache, just like the ASW and DSW, and put VolumeFsResizeMapPopulator interface under package pkg/kubelet/volumemanager/populator, just like the DesiredStateOfWorldPopulator. Then pkg/kubelet/volumemanager/reconciler.Reconciler can work as a consumer of VolumeFsResizeMap. These two interfaces will look like this:

type VolumeFsResizeMap interface {
    AddVolumeForPod(volumeName string, pod *v1.Pod)
    GetVolumesToResize() []VolumeToResize
    MarkAsFileSystemResized(pod *v1.Pod, volumeName string) error
}
type VolumeFsResizeMapPopulator interface {
    Run(stopCh <-chan struct{})
}

@jsafrane
Copy link
Member

I probably miss something very obvious, but why do we need new annotations? Right now, kubelet already knows that a PV needs filesystems resize, because it does offline resize. And it also can update the PVC and "finish" the resize so user knows it's done. So why a new annotation and a new controller?


#### Volume Resize Request Detect

Kubelet need to periodically loop through the list of active pods to detect which volume requires file system resizing. We use a `VolumeToResize` object to represent a volume resizing requests and introduce a new interface `VolumeFsResizeMap` rather than `DesiredStateOfWorld` to cache such `VolumeToResize` objects, since we treat it as a request queue, not a state. This interface will look like this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet already loops through all mounts in VolumeManager. It can see that a volume needs fs resize there. And based on whether the mount exists in its ASW, it can see if it's going to perform online or offline resize.

In addition, fs resize must be coordinated with VolumeManager anyway, since VolumeManager should not touch the volume (e.g. unmount it) during resize.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet already loops through all mounts in VolumeManager. It can see that a volume needs fs resize there.

The problem is that the volume objects cached in ASW is only updated when pod starting, and won't update after pod started any more, it means that if we resize the volume after pod already started, VolumeManager can't realize this change from ASW.

In addition, fs resize must be coordinated with VolumeManager anyway, since VolumeManager should not touch the volume (e.g. unmount it) during resize.

We append the resize operation to OperationExecutor so that VolumeManager can't touch the volume before resize operation finished.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original discussion in this issue: kubernetes/kubernetes#57185

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think - what we can do is. When we are iterating through pods in volumemanager's desired_state_of_world for mount and call PodExistsInVolume function, then that function can make an additional API call to check if underlying volume needs resizing. Currently PodExistsInVolume already raises an error remountRequired when some change requires us to remount the volume. So we can introduce a new error resizeRequired which will be raised by the function and that in turn will cause file system resize on the kubelet.

This will require no API change or annotation and everything will work online as expected. thanks @jsafrane for pointing it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential problem with ^ approach is - the reconciler loop turns every 100ms and say if a node has 20 pods with volumes, then we will be making 20 GET requests every 100ms, because there is no informer inside volume_manager. Again - I am not sure how badly it will affect us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a lastUpdateTimestamp for each volumes, and use an independent loop interval with the reconciler, this can reduce the request counts, but I am not sure how much help this can have

@mlmhl
Copy link
Author

mlmhl commented Jan 26, 2018

@jsafrane

I probably miss something very obvious, but why do we need new annotations?

Kubelet only watches for pod and only fetch PVC object from apiserver when pod starting, won't get latest PVC object again after pod is already started. So if we resize the volume after pod started, kubelet can't realize this request.

We add a annotation to pod to inform kubelet some volumes need resizing, but as @gnufied mentioned, this is just one of some available approaches to inform kubelet.

@cblecker
Copy link
Member

/ok-to-test

@mlmhl
Copy link
Author

mlmhl commented Apr 28, 2018

Hi guys @jsafrane , @yujuhong , we changed this proposal again as we found that Kubelet's VolumeManager is already fetching PV/PVC during every pod resync loop, see here and here. So we decided to reuse this code, using the fetched PV/PVC object to check whether volume requires online resizing or not instead of adding a PVC cache.

Actually current PV/PVC reprocess mechanism has a hidden performance problem, so we opened a separate issue to track this problem: #63274

cc @gnufied

@gnufied
Copy link
Member

gnufied commented Apr 28, 2018

In light of @mlmhl finding that we are already fetching pvc/pv from api-server on every pod sync, I do agree that we do not need PVC cache similar to configmaps etc. This proposal can definitely work without introducing new api calls.

I propose we tackle the bug/issue with PVC/PV getting fetched all the time in separate github issue - kubernetes/kubernetes#63274 that has been opened. cc @saad-ali

@gnufied
Copy link
Member

gnufied commented May 19, 2018

/lgtm

cc @saad-ali @msau42 @jsafrane lets try and get this merged for 1.11. I do not think there are any remaining blockers here

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2018
Copy link
Member

@jsafrane jsafrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm-ish, however you should note somewhere what happens when volume is being resized and at the same time a pod is deleted and the same volume is being unmounted. IMO unmount must wait (should not be started) until resize finishes. Reconciler(?) should check that.

In each loop, for each volume in DSW, `Reconciler` checks whether this volume exists in ASW or not. If volume exists but marked as file system resizing required, ASW returns a `fsResizeRequiredError` error. then `Reconciler` triggers a file system resizing operation for this volume. The code snippet looks like this:

```go
if cache.IsFSResizeRequiredError(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like passing random return values via errors, they should be reserved to real errors / exceptions. But we can solve that in implementation phase and not in this proposal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we use a remountRequiredError to indicate a volume need remount, so I add a resizeRequiredError to keep consistency. As you said we can review this again in implementation phase.

@gnufied
Copy link
Member

gnufied commented May 21, 2018

@jsafrane yeah I think there is a note in original resize proposal that if pvc has any of resize-in-progress conditions then we will wait for it to clear before unmounting the volume.

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request May 31, 2018
Automatic merge from submit-queue (batch tested with PRs 62460, 64480, 63774, 64540, 64337). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Implement kubelet side online file system resize for volume

**What this PR does / why we need it**:

Implement kubelet side online file system resize.

xref - [kubernetes/feature#531](kubernetes/enhancements#531)
proposal - [kubernetes/community#1535](kubernetes/community#1535)

**Release note**:

```release-note
Implement kubelet side online file system resizing
```
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 29, 2018
@gnufied
Copy link
Member

gnufied commented Nov 7, 2018

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2019
@mlmhl
Copy link
Author

mlmhl commented Feb 6, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 6, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2019
@agolomoodysaada
Copy link

/remove-lifecycle stale
still relevant right?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2019
@msau42
Copy link
Member

msau42 commented May 7, 2019

@k8s-ci-robot
Copy link
Contributor

@msau42: Closed this PR.

In response to this:

This has been replaced by https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190125-online-growing-persistent-volume-size.md

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.