Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Improve Local Storage Management #306

Merged
merged 19 commits into from
May 7, 2017
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
380 changes: 380 additions & 0 deletions contributors/design-proposals/local-storage-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,380 @@
# Local Storage Management
Authors: vishh@, msau42@

This document presents a strawman for managing local storage in Kubernetes. We expect to provide a UX and high level design overview for managing most user workflows. More detailed design and implementation will be added once the community agrees with the high level design presented here.

# Goals
* Enable ephemeral & durable access to local storage
* Support storage requirements for all workloads supported by Kubernetes
* Provide flexibility for users/vendors to utilize various types of storage devices
* Define a standard partitioning scheme for storage drives for all Kubernetes nodes
* Provide storage usage isolation for shared partitions
* Support random access storage devices only
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what support for "random access storage devices only" means? Does this mean using RAM as storage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I took this to mean DASD (i.e. this will not work for Tape drives or other sequential access storage media). Is this not the case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it means not supporting tape.


# Non Goals
* Provide isolation for all partitions. Isolation will not be of concern for most partitions since they are not expected to be shared.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be written more concisely as "Provide usage isolation for non-shared partitions" which would also make it more parallel with the Goal "Provide storage usage isolation for shared partitions"

* Support all storage devices natively in upstream Kubernetes. Non standard storage devices are expected to be managed using extension mechanisms.

# Use Cases

## Ephemeral Local Storage
Today, ephemeral local storage is exposed to pods via the container’s writable layer, logs directory, and EmptyDir volumes. Pods use ephemeral local storage for scratch space, caching and logs. There are many issues related to the lack of local storage accounting and isolation, including:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: today how do you expose logs directory ? You mean via emptydir, hostpath or pv ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a container writes to stdout, it gets saved to /var/log on the node, which is backed by the primary root partition.


* Pods do not know how much local storage is available to them.
* Pods cannot request “guaranteed” local storage.
* Local storage is a “best-effort” resource
* Pods can get evicted due to other pods filling up the local storage during which time no new pods will be admitted, until sufficient storage has been reclaimed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/during which/after which/


## Persistent Local Storage
Distributed filesystems and databases are the primary use cases for persistent local storage due to the following factors:

* Performance: On cloud providers, local SSDs give better performance than remote disks.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will performance include a QoS IOPS requirement for distributed storage systems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PVs will have to be created by the admin/addon that utilizes the entire disk to guarantee IOPs for performance use cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand correctly, but why do PVs have to be full-disk? Why not a properly aligned partition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not requiring that the PV has to use the whole disk (the volume is created on a partition), but if you need IOPS guarantees, then it should be a dedicated disk. Especially for rotational disks, the IO will still end up being on a shared path at the device layer.

SSDs may offer high enough IOPS that you can share them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @msau42 mentioned, the API that kubernetes would consume is a logical partition. It can map to any storage duration (RAID, JBOD, etc.). We recommend not sharing spinning disks unless either the storage configuration or IOPS requirements permits sharing them.

* Cost: On baremetal, in addition to performance, local storage is typically cheaper and using it is a necessity to provision distributed filesystems.

Distributed systems often use replication to provide fault tolerance, and can therefore tolerate node failures. However, data gravity is preferred for reducing replication traffic and cold startup latencies.

# Design Overview

A node’s local storage can be broken into primary and secondary partitions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were other options considered? Especially LVM would allow us to add / remove devices on hosts and RAID 0/1/5 per volume with very little overhead. With partitions, you must do all this manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of persistent local storage, most of the use cases we have heard about prioritize performance and being able to use dedicated disks.

In addition, LVM is only available on Linux, so it could be difficult to use as a generic solution.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming a Primary and Secondary partitions are logical objects which can be implemented a multiple of ways. Do you mind elaborating on possible implementations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the kind of information you were looking for?

  • Using an entire disk (this is the primary use case for persistent local storage)
  • Adding multiple disks into a RAID volume
  • Using LVM to carve out multiple logical partitions (if you don't need IOPs guarantees)


## Primary Partitions
Primary partitions are shared partitions that can provide ephemeral local storage. The two supported primary partitions are:

### Root
This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IO for example) from this partition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/IO/IOPs/


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its been shared via user pods, Isnt there a security concern ? How kubelet take care of that or ensure isolation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "sharing" described here refers to the pods writeable, logs and emptydir directories all being backed by the same partition, not that they're sharing the contents.

So while pods cannot access other pod's contents, today it is possible to impact other pods by using up all the space on the partition. So this proposal is trying to address that issue and allow pods to specify capacity boundaries and have kubelet enforce them.

### Runtime
This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually which specific partition is this one, can you provide more information , can a pod today request this , how ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detail is hidden from the pod's point of view. It's configured at the kubelet level, and it's an optional partition where the container images can be stored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above it said "Root" was for container writable layers, so this is for image-based layers only? Clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin -- this would include both image-based and container writable layers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can writeables be lumped under both Root and runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writables are not in root if runtime is enabled, we need to clarify that in this document. this is consistent with eviction behavior for imagefs and nodefs described here: https://kubernetes.io/docs/concepts/cluster-administration/out-of-resource/#eviction-signals

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these filesystems^Wpartitions are not disjoint in most cloud images I've seen. most cloud instances are a single (block) partition with a single filesystem; the root filesystem. /var/lib/kubelet, /var/lib/docker, and /var/log are all on the same filesystem. the premise that these filesytems are disjoint and can allocated from independently is not going to be true in many cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin I clarified the subtlety a bit more. PTAL


## Secondary Partitions
All other partitions are exposed as persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod. Applications can continue to use their existing PVC specifications with minimal changes to request local storage.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When they are exposed as PVs, are they created and available as a pool? Do you mind elaborating a little more on Secondary Partitions? Who creates them, how are they managed, what sizes, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the workflow mentions this a little bit. I will also add it here.

There will be an addon daemonset that can discover all the secondary partitions on the local node and create PV objects for them. The capacity of the PV will be the size of the entire partition. We can provide a default daemonset that will look for the partitions under a known directory, and it is also possible to write your own addons for your own environment.

So the PVs will be statically created and not dynamically provisioned, but the addon can be long running and create new PVs as disks are added to the node.


Copy link

@krmayankk krmayankk Feb 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cant the kubelet manage the secondary partitions as well and create the corresponding PV's rather than having a addon DaemonSet do this ? @msau42

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary reason to use an addon is to give the admin flexibility in terms of how they want to configure their local storage. They can mount the disks wherever they want, and then configure their labels and storageclasses accordingly. If they want to make any changes to how it is managed, then the daemonset can be updated instead of having to upgrade kubelet.

Do you see any downsides to the addon approach?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @msau42 ,i agree on flexibility part, i guess i don't see any downsides. We should have an option to say which partitions are available for pre-creation of PV's vs which should be left untouched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the admin provides their own DaemonSet, then which partitions to use for PVs is completely under their control. For the default DaemonSet, we can define a specific directory (for example, '/var/kubelet/local-partitions'), where all the partitions can be mounted and we'll automatically detect them and add them. Any partitions that are not in this directory will be ignored.

# User Workflows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 'local PV' support upcoming features like snapshot or replication?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, currently the snapshot and replication features are focused on the orchestration, which means the underlying storage system needs to support it. We don't have plans to implement those features for local PVs, which is designed to be a simple abstraction of a logical partition. I think higher level applications can be built on top to handle those features though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a distributed filesystem if you need snapshotting and/or replication. This is the kubernetes recommended approach. Very few apps should use local storage directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 @vishh thanks.. yes, completely agree on, very few apps may request local storage directly.


### Alice manages a deployment and requires “Guaranteed” ephemeral storage

1. Kubelet running across all nodes will identify primary partition and expose capacity and allocatable for the primary “root” partition. The runtime partition is an implementation detail and is not exposed outside the node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an assumption that all nodes in the cluster run the same storage driver? for example, we have had requests for clusters that have some nodes running devicemapper (where posix compliance is a concern), and others run overlay, etc. for systems where devicemapper is chosen, we are limited by dm.basesize for the storageOverlay value unless the state of the art has changed. as a result, i think the node will need some mechanism to expose if the storageOverlay request is actually feasible? for example, if you requested 20Gi but the dm.basesize=10Gi, a devicemapper node may not even be able to satisfy the workload.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i guess the capacity and allocatable is only for emptyDir and container writable layers right and allocabtable would have already taken into account the space needed for k8s system daemons ?


```yaml
apiVersion: v1
kind: Node
metadata:
name: foo
status:
Capacity:
Storage: 100Gi
Allocatable:
Storage: 90Gi
```

2. Alice adds a “Storage” requirement to her pod as follows

```yaml
apiVersion: v1
kind: pod
metadata:
name: foo
spec:
containers:
name: fooc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a missing - here.

resources:
limits:
storage-logs: 500Mi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for doing it this way rather than:

limits:
  storage:
    logs: 500Mi
    overlay: 1Gi

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a limitation in the LimitRange design. It doesn't support nesting of limits.

storage-overlay: 1Gi
volumes:
name: myEmptyDir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing -.

emptyDir:
capacity: 20Gi
```

3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the guarantees provided by the system? The FS?, logical volume?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For primary partitions, the node's local storage capacity will be exposed so that the scheduler can take into account a pod's storage limits and what nodes can satisfy that limit.

Then kubelet will monitor the storage usage of the emptydir volumes and containers so that they stay within their limits. If quota is supported, then it will use that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: myEmptydir is set to 1Gi, not 20Gi

4. Alice’s pod is not provided any IO guarantees
5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with kubectl logs? Right now we are aggregating and rolling stdout and stderr? Are you proposing that we use local storage instead of, or in addition to, the current K8s logging infra?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should have no impact to kubectl logs. It's only changing the log rotation mechanism to be on a per container basis instead of on a node basis.

6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/foo/fooc/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foo is correct because it's referring to the pod, and not the container.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quota feature assumes an appropriate supporting file system is being used. A large part of the distributed storage systems require raw (no file system) storage. How would that be managed? Would a raw partition be crated by a logical manager?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't plan to support raw partitions as a primary partition. Secondary partitions can have block level support though.

7. With hard limits, containers will receive a ENOSPACE error if they consume all reserved storage. Without hard limits, the pod will be evicted by kubelet.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the FS supports such a feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will mention otherwise kubelet can only enforce soft limits.

8. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
9. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know number 8 showed there will be a Health monitor, but how would it detect that the primary partition is unhealthy on number 9? What does it mean to be unhealthy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it more, health monitoring is dependent on the environment and configuration, so an external monitor may be needed for both primary and secondary.

It can monitor at various layers depending on how the partitions are configured:
disk layer: look at SMART data
raid layer: look for complete raid failure (non-recoverable)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to do disk health monitoring if the Node is a VM and disk is a virtual disk? The smartctl or raid tools may not return correct data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Because the partition configuration is very dependent on the environment, I think we cannot do any monitoring ourselves. Instead, we can define a method for external monitors to report errors, and also define how kubernetes will react to those errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does our proposal/design require this health monitor. lets say in the default configuration, when there is no external health monitor, what is the behavior ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health monitor is not required. In that case, it will behave the same way that it does today, which is undefined.


### Bob runs batch workloads and is unsure of “storage” requirements

1. Bob can create pods without any “storage” resource requirements.

```yaml
apiVersion: v1
kind: pod
metadata:
name: foo
namespace: myns
spec:
containers:
name: fooc
volumes:
name: myEmptyDir
emptyDir:
```

2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a LimitRange to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify bust ranges and a host of other features supported by LimitRange for local storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the LimitRange user guide that explains this.
(BTW did you mean "burst" rather than "bust"?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a little bit confusing "His cluster administrator being aware....". Does that mean that this solution would require the administrator to take action or things may be incorrectly allocated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to solve today's local storage isolation problem, pods should specify limits for their local storage usage. In the absence of that, the administrator has the option to specify defaults for the namespace. If neither of those two occur, then you just have the same issue today.


```yaml
apiVersion: v1
kind: LimitRange
metadata:
name: mylimits
spec:
- default:
storage-logs: 200Mi
Storage-overlay: 200Mi
type: Container
- default:
storage: 1Gi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, each empty dir backed volume will pick up this default capacity? so if a user had multiple empty dirs for some reason, each would get 1Gi?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the limit per emptydir. You bring up a good point though, that the "type: Pod" implies that it's for the whole pod. We can change it to "type: EmptyDir"

type: Pod
```

3. The limit range will update the pod specification as follows:

```yaml
apiVersion: v1
kind: pod
metadata:
name: foo
spec:
containers:
name: fooc
resources:
limits:
storage-logs: 200Mi
storage-overlay: 200Mi
volumes:
name: myEmptyDir
emptyDir:
capacity: 1Gi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you worry users will get confused with this field as only being meaningful when the medium is disk and not memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be used for memory-backed emptydir too.

```

4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume.
5. If Bob’s pod “foo” exceeds the “default” storage limits and gets evicted, then Bob can set a minimum storage requirement for his containers and a higher “capacity” for his EmptyDir volumes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern I have here is that it requires a lot of interaction with an administrator and the user. If I am "Bob", I'm just going to keep asking for more storage (1, then 2, then .. ). That would move the Pod from node to node satisfying the storage size request. I'm guessing... How different is this from the current model?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is a little bit of a trial and error going on here for Bob. But as an application developer, you will have to do this in order to size your apps appropriately. One goal that we're trying to achieve here is provide pods better isolation from other pods running on that node through storage isolation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/capacity/size/


```yaml
apiVersion: v1
kind: pod
metadata:
name: foo
spec:
containers:
name: fooc
resources:
requests:
storage-logs: 500Mi
storage-overlay: 500Mi
volumes:
name: myEmptyDir
emptyDir:
capacity: 2Gi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is emptyDir capacity a new proposal ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's new for the proposal. Let me see if the formatting will let me bold all the new fields to make that more clear.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks that will help

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it doesn't look like there's a way in markdown to bold specific lines in a code block. I'll make the preceding sections explicitly say which fields are new.

```

6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intent to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/intent/intend/
BTW I'm not clear about the connection between prohibiting a minimum storage requirement in LimitRange and overcommit. Won't the scheduler prohibit overcommit regardless of what storage requirement you give for the EmptyDir (regardless of whether it's set manually or via LimitRange)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this means we will just not allow a request a limit for volumes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intent=intend


### Alice manages a Database which needs access to “durable” and fast scratch space

1. Cluster administrator provisions machines with local SSDs and brings up the cluster
2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is always a 1:1 correspondence between PV and partition on a secondary device?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will let the cluster administrator decide how they want to provision local storage. They can have one partition per disk for IOPS isolation, or if sharing is ok, then create multiple partitions on a device.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is based on the technology of the underlying filesystem. If not, then I think this depends a lot on some type of logical volume manager. If not only two things can happen: 1. a secondary partition is the entire disk, 2: A lot of disk fragmentation. I think more information on how number 1 is done may shed more light on this model

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is up to the administrator to partition and create the filesystem first. And how that is done will depend on the partitioning tools (parted, mdadm, lvm, etc) available and which filesystems the administrator decides to use. From Kubernetes point of view, we will not handle creating partitions or filesystems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is up to the administrator to partition and create the filesystem first

That's very inconvenient for an admin. Also, when such PV gets Released, who/how removes the data there and puts it back to Available? We'd like to deprecate recycler as soon as possible.

IMO, some sort of simple dynamic provisioning would be very helpful and it's super simple with LVM. It should be pluggable though to work on all other platforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current thought for the PV deletion workflow is to set the PV to Released phase, delete all the files (similar to how EmptyDir cleans up), delete the PV object, and then the addon daemonset will detect that the partition and then create the PV for it again.

So from an admin's point of view, the partitioning and fs setup is just a one time step whenever new disks are added. And for the use case that we are targeting, which requires whole disks for IOPs guarantees, the setup is simple: one partition across the whole disk, and create the filesystem on that partition.

As for LVM, I agree it is a simpler user model, but we cannot get IOPs guarantees from it, which is what customers we've talked to want the most. I don't think this design will prevent supporting an LVM-based model in the future though. I can imagine there can be a "storageType = lvm" option as part of the PV spec, and a dynamic provisioner can be written to use that to carve out LVs from a single VG. The scheduling changes that we have to make to support local PVs can still apply to a lvm-based volume. We're just not prioritizing it right now based on user requirements.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with @jsafrane that we should have some default out of the box local disk PV provisioner and for default cases, we dont have to do a addon or some such thing. 90% use cases might be just simple use of local disks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on feedback we have gotten from customers and workloads team, it's the opposite. Most of the use cases require dedicated disks. We have not seen many requests for dynamic provisioning of shared disks. If you see some real use cases where an app wants to use persistent local storage (and all its semantics), but doesn't need performance guarantees, then I would be interested in hearing about them as well.

I do want to make sure that nothing in this proposal would prevent LVM and dynamic provisioning from being supported in the future. And that it will be able to take advantage of the scheduling and failure handling features we will be adding.

In terms of admin work, my hope is that the default addon will require a similar amount of admin intervention as the LVM model (configure the disk once in the beginning, the system takes care of the rest).


```yaml
kind: PersistentVolume
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clear my confusion, these all are created by hand?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could be created by hand, or if you put the partitions in a known directory, then the addon daemonset can discover the partitions and automatically create the PVs.

apiVersion: v1
metadata:
name: local-pv-1
annotations:
storage.kubernetes.io/node: node-1
labels:
storage.kubernetes.io/medium: ssd
spec:
volume-type: local
storage-type: filesystem
capacity:
storage: 100Gi
hostpath:
path: /var/lib/kubelet/storage-partitions/local-pv-1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the path in the spec, i am assuming the pv gets populated with the path of where the storage is allocated. Putting it in spec, suggests its an input, when its probably just an output, right @msau42 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an input for the PV layer, to allow for situations where admins decide not to follow the default partitioning discovery scheme (where the addon auto-discovers the partitions at a known location and creates the PVs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the API to be generic. The path can be whatever. From an API standpoint, if a user needs to attach a local PV to a pod then they should have all the information they need from the API.
Imagine a user implementing their own kubelet.
I'd much rather prefer not having a VFS based API and instead make path explicit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a solution to make sure that same partition is re-mounted to the same path after Node restarts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is a requirement on the storage admin to setup the local mounts such that they will be mounted to the same location across reboots.

accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the reclaim policy is Delete for the local disk PV, is Recycle better?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the reclaim policy is Delete for the local disk PV, is Recycle better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recycle policy is going to be deprecated and eventually removed. We are thinking about getting similar recycling behavior by: having the 'delete' operation cleanup the volume, and having addon daemonset create the PV again after it's been deleted

```
```
$ kubectl get pv
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM … NODE
local-pv-1 100Gi RWO Delete Available node-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does delete do with local PV? Does the node delete the PVs and no longer make them available?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The daemonset will operate like an external provisioner and handle the cleanup and reclaimation. It will delete the contents of the PV, and then create a new PV for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We operate on the assumption that a PV can only be bound to one PVC.

local-pv-2 10Gi RWO Delete Available node-1
local-pv-1 100Gi RWO Delete Available node-2
local-pv-2 10Gi RWO Delete Available node-2
local-pv-1 100Gi RWO Delete Available node-3
local-pv-2 10Gi RWO Delete Available node-3
```
3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no notion of tainting PVs, only nodes. Can you say more about what semantics you are expecting for tainting a PV?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would like to evict pods that are using tainted PVs, unbind the PVC, and reschedule the pod so that it can bind to a different PV. I think everything after the eviction could be handled by a separate controller.

4. Alice creates a StatefulSet that uses local PVCs

```yaml
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
terminationGracePeriodSeconds: 10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its more of a statefulset question, but are we considering any parameters for how the statefulset behaves when a bound claim to local pv, somehow becomes unhealthy or the node goes down. Will the statefulset timeout and bringup that pod on a new node ? how will we deal with partitions ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, handling those failures will be part of a "forgiveness" feature. I will update this proposal to include how to specify it, and which scenarios will be handled.

Basically, if the node or pv dies, then after some specified forgiveness period, then we can unbind the PVC from the PV, evict the pod, and cause it to be rescheduled and obtain a new PV.

Can you clarify what is the partition issue you are asking about?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @msau42 ,ignore the partition issue which is more of statefulset thing. Basically how do we ensure in case of partitions that there is exactly only one pod with a given identity. Looking forward to the forgiveness update

containers:
- name: nginx
image: gcr.io/google_containers/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
- name: log
mountPath: /var/log/nginx
volumeClaimTemplates:
- metadata:
name: www
labels:
storage.kubernetes.io/medium: ssd
spec:
volume-type: local
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the convention be volumeType?

accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
- metadata:
name: log
labels:
storage.kubernetes.io/medium: hdd
spec:
volume-type: local
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
```

5. The scheduler identifies nodes for each pods that can satisfy cpu, memory, storage requirements and also contains free local PVs to satisfy the pods PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node.
```
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESSMODES … NODE
www-local-pvc-1 Bound local-pv-1 100Gi RWO node-1
www-local-pvc-2 Bound local-pv-1 100Gi RWO node-2
www-local-pvc-3 Bound local-pv-1 100Gi RWO node-3
log-local-pvc-1 Bound local-pv-2 10Gi RWO node-1
log-local-pvc-2 Bound local-pv-2 10Gi RWO node-2
log-local-pvc-3 Bound local-pv-2 10Gi RWO node-3
```
```
$ kubectl get pv
NAME CAPACITY … STATUS CLAIM NODE
local-pv-1 100Gi Bound www-local-pvc-1 node-1
local-pv-2 10Gi Bound log-local-pvc-1 node-1
local-pv-1 100Gi Bound www-local-pvc-2 node-2
local-pv-2 10Gi Bound log-local-pvc-2 node-2
local-pv-1 100Gi Bound www-local-pvc-3 node-3
local-pv-2 10Gi Bound log-local-pvc-3 node-3
```

6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
Copy link
Member

@kow3ns kow3ns Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are depending on priority and preemption to be implemented prior to this?

Copy link
Member

@davidopp davidopp Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like what's described in this section also relies on some variant of #7562 / #30044 being implemented, as today there is no notion of a local PV (beyond the experimental HostPath volume type which doesn't do what's needed here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this proposal is also covering this new local PV type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kow3ns regarding priority and preemption, it is not a strict requirement for this feature, but will make the workflow smoother. There are also plans to implement this soon.

7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods.
Copy link
Member

@kow3ns kow3ns Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should occur only when the PV backing the PVC is permanently unavailable. If a controller creates a new PVC and relaunches the Pod with that PVC, it will never be able to reuse the data on the old PV anyway. To simplify this for controller developers, when some policy is applied to indicate that K8s should "give up" on recovering a PV, can we just delete the PV, and set the status of the PVC to pending? This would reduce the complexity of the interaction with DaemonSet, StatefulSet, and any other controllers and local persistent storage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This situation could also occur if the node has failed or can no longer fulfill other requested resources, for example, if other pods got scheduled and took up the cpu or memory needed.

The main concern with deleting the PV and keeping the PVC, is that it may not follow the retention policy. The user may want to recover data from the PV, but won't have the pod->PVC->PV binding anymore. As another alternative, we could remove the PVC->PV binding, and if the PV policy is retain, also add an annotation with the old pod, PVC information so the user can figure out which PV had their data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of keeping the PVC and just removing the PVC->PV binding. If we expect the StatefulSet controller to modify the Pod to use a new PVC, that essentially means only the StatefulSet controller can perform the task of unblocking its unschedulable Pods. That in turn means that every controller needs to separately implement this behavior. For example, what if I have "stateless" Deployment Pods that want this behavior for their large caches on local PV?

If unblocking can be done without modifying the Pods to use a different PVC, then it leaves the door open to write a generic "local PV unbinding" controller that implements this behavior once for everyone who requests it via some annotation or field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic PVC unbinding controller can monitor for this error condition, unbind the PVC, clean up the PV according to the reclaim policy, and then evict and reschedule the pods to force them to obtain a new PV.

8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy.
Copy link
Member

@kow3ns kow3ns Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are targeting DBs and DFSs, and if a "taint" is really pertaining to a problem with the underlying storage media, I don't think anything in your target set will tolerate a taint. @davidopp shouldn't this be expressed by the controller in terms of declarative tolerations against node taints in any case. That is, don't I have to explicitly declare the taint as tolerated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having every controller/operator watch for the appearance of taints on a node and delete Pods should we consider the following approach?

  1. DBs and DFS should include a health check that causes the Pod to fail when the contained application can't write to storage media (Most, if not all, storage applications will fail on errors returned from fysnc/sync).
  2. When the application monitoring the storage device decides that the mounted PVs are unrecoverable, it should delete the PVs and mark the Bound PVCs as pending. The policy deciding when to do this can be applied here. Note that, this is no scarier than having the controller make the decision to delete the PVC. In either case, once the Pod is disassociated from its local volume and launched with another, it can never be safely re-associated with the prior volume. Both cases also need a good story around snapshops and backup. I think that, as the device monitoring application is a node local agent, it can make a better decision about when to "give up" trying bind a Pod to a local mount.
  3. As the volumes are deleted, we need not be concerned with the PVCs being fulfilled by this node unless it has volumes mounted on another, functional device.
  4. When controllers/operators recreate the Pods, their existing PVCs must be Bound to volumes provided by another node.

If we take an approach that is closer to this we don't have to duplicate the watch logic in every controller/operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, are you suggesting to leave it up to the application to handle local storage failures, since each application may have its own unique requirements and policies?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I was not clear. I am saying the opposite. The "application monitoring the storage device" referred to above is based on the design statement that " Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints."
Rather than have every controller/operator attempt to heuristically guess when it should delete a PVC, might it not be better to have the "Node Problem Detector", kublet, or another local agent make a decision that the volume is no longer usable due to device failure, and to set the associated PVC back to pending. Perhaps using your suggestion above to retrain the volume for data recovery purposes. I can't think of a distributed storage application that will want to re-balance or re-replicate its data due to a temporary network partition or intermittent node failure. The only time, IMO, that we'd want a controller/operator to move a Pod with local PVs to a new node is if the storage device failed, or if the MTTR is so high that it might as well have. In the former case it might be best if a node local agent made the decision that the storage device is failed. In the latter case, we should at least consider having a global controller with a policy that can unbind local PVs from PVCs, rather than having every controller/operator have to implement its own policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the controller (e.g. StatefulSet) should be responsible for deleting Pods. I think @kow3ns is also saying that if I'm reading him correctly.

My understanding is that regular Node taints are noticed and enforced by kubelet, which may evict the Pod if it doesn't tolerate the taint. Wouldn't it make sense for kubelet to also evict the Pod if it does not tolerate a taint on one of its local PVs?

If recreated with the same PVC, the Pod would remain unschedulable due to the taint on the PV. At this point, the problem is reduced to being the same as (7) above. In this way, both (7) and (8) can be handled without necessarily requiring any changes to StatefulSet or other controllers (if a generic controller can be implemented as suggested above).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently taints are only at the node level, but I think it could worth looking into expanding, as it already has a flexible interface for specifying per pod the tolerations and forgiveness for each taint. This workflow could also work for the case when the node fails or becomes unavailable. @davidopp

Then, when the pod gets evicted due to the taint, it reduces the problem to (7), as mentioned above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also possible to implement this without taints, and instead add an error state to the PV, and have a controller monitor for the error state and evict pods that way. But using taints may be nice as a future enhancement to unify the API.

9. Once Alice decides to delete the database, the PVCs are expected to get deleted by the StatefulSet. PVs will then get recycled and deleted, and the addon adds it back to the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be global for all PVCs for StatefulSet going forward? Also, we will be depending on reasonable collection timeouts to ensure that users have time to collect data from Volumes after deletion (assuming they have a need to do so)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, the PVC will need to be deleted by the user to retain similar behavior as today. We are looking into an "inline" PVC feature that can automatically delete the PVCs when the StatefulSet gets destroyed. I'll update this to clarify that.

Regarding the retention policy, the PV can be changed to use the "Retain" policy if users need to collect data after deletion.


### Bob manages a distributed filesystem which needs access to all available storage on each node

1. The cluster that Bob is using is provisioned with nodes that contain one or more secondary partitions
2. The cluster administrator runs a DaemonSet addon that discovers secondary partitions across all nodes and creates corresponding PVs for them.
3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy.
4. Bob creates a specialized controller (Operator) for his distributed filesystem and deploys it.
5. The operator will identify all the nodes that it can schedule pods onto and discovers the PVs available on each of those nodes. The operator has a label selector that identifies the specific PVs that it can use (this helps preserve fast PVs for Databases for example).
6. The operator will then create PVCs and manually bind to individual local PVs across all its nodes.
7. It will then create pods, manually place them on specific nodes (similar to a DaemonSet) with high enough priority and have them use all the PVCs create by the Operator on those nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*created

8. If a pod dies, it will get replaced with a new pod that uses the same set of PVCs that the old pod had used.
9. If a PV gets tainted as unhealthy, the Operator is expected to delete pods if they cannot tolerate device failures.

### Bob manages a specialized application that needs access to Block level storage

1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
2. The cluster admin creates a DaemonSet addon that discovers all the raw block devices on a node that are available within a specific directory and creates corresponding PVs for them with a ‘StorageType’ of ‘Block’
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this DaemonSet do formatting for the raw block devices? and what the filesystem type will be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the assumption is that if an application wants a raw block device, then there is no filesystem created. We won't do any formatting at all and just expose the device.

On deletion, we may want to consider some zeroing option in order to cleanup the data, but that could add considerable time to the operation.

Do you know of any use cases that need raw devices? Right now this is low on the priority list, so we haven't been focusing much on designing this feature.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the example below, /var/lib/kubelet/storage-raw-devices/foo is a device file, but can it use hostpath way to do bind mount for pods?
We have a in-house application that need use the raw devices, but not high priority now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what we are thinking, but we haven't looked into the technical aspects of this yet. If anyone is aware of any limitations/challenges, please let us know.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should also be storageLevel instead of storageType - according to the example below)

Copy link

@fabiand fabiand Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another note: Instead of using a boolean or type flag to define the access to a volume, another approach could be to have a separate property to enumerate the "raw volumes", i.e.:

kind: Pod
apiVersion: v1
metadata:
  name: mypod
spec:
  containers:
    - name: myfrontend
      image: dockerfile/nginx
      volumeMounts:
      - mountPath: "/var/www/html"
        name: mypd
#    alternatively
      blockVolumes:
      - deviceNode: /dev/sda
        name: mypd
  volumes:
    - name: mypd
      persistentVolumeClaim:
        claimName: myclaim

On the node or cluster leve it could be evaluated if the referenced blockVolume would be suitable as a block volume (i.e. iscsi and ceph rbd), or not (i.e nfs).

This would prevent us from blowing up the volume spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it reduce the total amount of changes to a spec? To me, it looks like it just moves the attribute from the PVC/PV spec to the container spec.

The advantages I see for keeping the attribute in the PVC/PV spec:

  • you don't have to worry about user misconfiguration (ie specifying blockVolume on a filesystem volume)
  • dynamic provisioning is better supported. When you create a PVC, you may not have a pod associated with it, so you wouldn't know if you need to provision a file or block volume if it's specified at the pod spec.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 I do agree, it has benefits to explicitly make the storage access type a property of the PV. This information helps in several places, provisioning is one of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do people think of the names of these optional new fields:
volumeAccessType = file | block (default file)
volumeAttachType = direct | remote (default remote)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another use case is ceph osd's need raw devices as well.


```yaml
kind: PersistentVolume
apiVersion: v1
metadata:
name: foo
annotations:
storage.kubernetes.io/node: k8s-node
labels:
storage.kubernetes.io/medium: ssd
spec:
volume-type: local
storage-type: block
capacity:
storage: 100Gi
hostpath:
path: /var/lib/kubelet/storage-raw-devices/foo
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
```

3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request.

```yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
labels:
storage.kubernetes.io/medium: ssd
spec:
volume-type: local
storage-type: block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be storageType? Also, having both volumeType and storageType seems confusing. Not sure what else these could be called though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would storageLevel be better?

accessModes:
- ReadWriteOnce
resources:
requests:
storage: 80Gi
```

*The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will Bob's pod receive the block device? Will kubelet bind whole /dev from the host into the pod? That might be insecure. On the other way the pod needs to be privileged to access a raw device anyway, so probably nobody cares. Still, how will the pod find the right device(s) in /dev? There can be many of them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUIC whitelisting access to individual block devices can be achieved by the devices cgroup subsystem.
But I am not sure if the pod needs to be privileged or not, or if it can be solved by granting a specific capability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely needs some more investigation into the details. I did a quick search and saw that docker does support a device option that doesn't require privileged containers. And you can set read, write and mknod permissions on each device. But we'll have to see what CRI supports.


# Open Questions & Discussion points
* Single vs split “limit” for storage across writable layer and logs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer split. for device mapper, we are limited by the –storage-opt dm.basesize= flag that is global to the daemon last i checked. as a result, i don't actually know if we would have a realistic mechanism to support less than a fixed size for copy-on-write layer as a result when using that driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

* Local Persistent Volume bindings happening in the scheduler vs in PV controller
* Should the PV controller fold into the scheduler
Copy link

@sky-big sky-big Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at present, local PVs have above question,PV, PVC bounded before schedulering,when bounded completed,scheduler select the node with PV node affinity,but now the node CPU, Mem not enough and so on,so the pod all the time schedule failed,so above question have plan to solve?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the limitation in the first phase. We hope to solve it in the next release, but no concrete ideas yet, we're just prototyping at this point. At a high level, the PVC binding needs to be delayed until a pod is scheduled, so that it can take into account all the other scheduling requirements of the pod.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cool if solve local volume PV PVC delay bound,now my project team worry about the question,so not dare use local volume plugin because pod schedule fail all the time easily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will not work well right now in general purpose situations. But if you use the critical pod feature, or you run your workload that needs local storage first, then it that may work better. Still, the PVs may not get spread very well because the PV controller doesn't know that all these PVCs are replicas in the same workload. You may be able to work around the issue by labeling the PVs per workload.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,Thanks. I pay close attention to V1.8(Scheduler predicate for prebound local PVCs#43640) .

* Supporting dedicated partitions for logs and volumes in Kubelet in addition to runtime overlay filesystem
* This complicates kubelet.Not sure what value it adds to end users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs should usually not accumulate as they should be collected to a central location.
--> no need for separation

Overlay FS data can be used, but for heavy use or increased storage needs we do recommend and provide emptyDirs and the new local PVs.
--> no need for separation

As emptyDirs might be used for caches and heavy IO it might makes sense to let this be separated from the planned root PV.

Complicating the Kubelet for logs and overlay doesn't seem to make sense. We should definitely think about the usage pattern of emptyDirs after local PVs are available.
Would we recommend local PV usage for heavy IO caches instead of emptyDirs? If yes, then we might leave emptyDirs inside the root PV and let the user know that for anything serious he might need to migrate away from emptyDirs.

Definitely needs to be clearly documented what use-cases each one solves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since we don't plan to provide IOPS isolation for emptydir, then local PV should be used instead for those use cases. One question we have is are there use cases that need ephemeral IOPS guarantees that cannot be adapted to use local PVs? Do we need to look into an "inline" local PV feature where the PV gets created and destroyed with the pod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the automation of creating and using local PVs.
EmptyDirs work great without having to involve the cluster admin.
Local PVs most likely need cluster admin intervention. Maybe not always, but it's not 100% automated.

The path I see as reasonable would be:

  • Leave emptyDir as "best-effort" scratch pad.
  • Recommend local PVs for guaranteed IOPS.
  • First iteration having to use manual cluster admin action
  • Iterate on automating local PVs to bring them closer to emptyDir and PDs aka provide local PVs via dynamic provisioning

This would lead to no huge complexity additions in the kubelet as root, emptyDir, log and overlay FS are kept on the primary partition in the first iteration.

As additional note:
Persistent Volume as name seems confusing especially, when we recommend it as IOPS guaranteed scratch pad. (Maybe: Local Disk?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good plan! LocalDisk as the actual volume plugin name sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification. We would have:

  • PersistentDisk (networked "unfailable" disk)
  • emptyDir (shared temporary volume without guarantees)
  • LocalDisk (local volume with guarantees, which might have some persistence)
  • hostPath (local volume for testing)
  • all the provider specific stuff, flexVolume, gitRepo and k8s API backed volumes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning to recommend using LocalDisk only through the PV/PVC interfaces for the following reasons:

  • In failure scenarios, like the node failing, you may want to give up on the local disk and find a new one to use. You can do that by unbinding the PVC from the PV, instead of having to change the volume in the pod spec
  • If you use local disk directly, it would be very similar to HostPath volumes, and have all its problems, where you have to specify the path, and understand the storage layout of the node, and understand that that particular volume can satisfy the pod's capacity needs. The PV interface hide those details.
  • The PV interface gives a way to pool all the local volumes across the entire cluster and easily query for them, and find ones that will fit a pod's requirements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Always great to have that documented.

So recommendation:
PD + LD using PVC
emptyDir + hostPath used directly

Small addition:
The notion it projects using PV/PVCs about LD being persistent could create some confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update this doc to clarify that thanks! I agree the PV name could be misleading since the local disk can only offer semi-persistence, and has different semantics then normal PVs. I can add a section about the different semantics. Also, because of the different behavior, and its very targeted use cases, I want to make sure in the API layer, the user explicitly selects this option, and that they cannot use a local PV "accidentally".

* Providing IO isolation for ephemeral storage
* IO isolation is difficult. Use local PVs for performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i worry this is not cost effective for many of our users and is not a viable long term state.

i think we need to provide some reasonable io isolation for ephemeral storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this proposal saying that we should never handle IO isolation, or that we dont plan to tackle this as part of this set of changes? I was under the impression that long term we would eventually provide disk IO, and network isolation and scheduling as well. But I think it makes sense to tackle disk first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is not going to tackle IO isolation for the primary partition, but that doesn't mean it can't be handled in the future. This proposal will provide IO isolation for secondary partitions through PVs. So for ephemeral IO isolation use cases, perhaps we can have an "inline" PV feature where the PVC gets created and destroyed with the pod.

* Support for encrypted partitions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems more a concern of the underlaying infrastructure to provide the partition on top of full disk encryption. Not sure k8s should by default take care of this. We don't do this currently for any storage. If it's planned for the root disk and current PVs please let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were thinking encrypted volumes could offer a more secure way of wiping data after the volume gets destroyed. You just need to delete and change the key. It's currently not planned, but it could be considered as a future feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That use case I definitely agree with. Using encryption to basically reduce wiping/reuse latency seems like a good optional feature especially for pods with a high turnaround.

For security purposes itself, I think the cluster admin should just use full disk encryption. AS already said we are not offering encryption on networked volumes or the root disk.

If encrypted volumes would be used for faster a wiping/reuse cycle that should be done on other persistent disks too. (Makes more sense for manually created rare PVs, but would still be useful.)

* Do applications need access to performant local storage for ephemeral use cases? Ex. A pod requesting local SSDs for use as ephemeral scratch space.
* Typically referred to as “inline PVs” in kube land
* Should block level storage devices be auto formatted to be used as file level storage?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - Block level storage should be untouched to allow pods to consume them as-is.
This is i.e. benefitial forbacking Virtual Machine disks or MySql databases on raw storage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. For raw device access, no formatting should be done.

For the normal use case of file level access, we could consider auto formatting to simplify the administrator's role. With the current proposal, the administrator has to preconfigure all the filesystem partitions. As an alternative design, we could instead take a list of raw devices and form a pool, and then format them when the PV is created. Then that's one task the admin does not have to do, and it could work better with dynamic provisioning. But some downsides are:

  • does not offer flexibility in partitioning
  • it can be more complex to implement, as kubelet would need to manage and maintain a new resource object

* Flexibility vs complexity
* Do we really need this?
* Repair/replace scenarios.
* What are the implications of removing a disk and replacing it with a new one?
* We may not do anything in the system, but may need a special workflow

# Recommended Storage best practices

* Have the primary partition on a reliable storage device
* Consider using RAID and SSDs (for performance)
* Partition the rest of the storage devices based on the application needs
* SSDs can be statically partitioned and they might still meet IO requirements of apps.
* TODO: Identify common durable storage requirements for most databases
* Avoid having multiple logical partitions on hard drives to avoid IO isolation issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to challenge this more. this assumption avoids complexity by just increasing operator cost. i think io isolation will still be required for primary partition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands today, this solution is not changing primary partition IO isolation semantics, and doesn't increase operator cost for managing the primary partition.

For local persistent use cases, the only way to get IO isolation today is to use HostPath volumes, and data gravity with node affinity, all of which already has a high operator cost. I believe this solution is an improvement on this existing method and will decrease operator cost for both the cluster admin and the application developer. But if you see any ways that operator cost is increasing compared to the current methods, please let me know.

  1. Dev and admin need to coordinate less about exact nodes and storage paths (but still has some coordination about number of disks and capacity requirements). Operator costs greatly reduced because you don't have to manually schedule your pods on nodes using node affinity.
  2. Pods are more flexible about where they can run because they don't need to specify node affinity. Node failure scenarios can automatically be handled via forgiveness, instead of requiring manual intervention.
  3. In order to use a pod template with HostPath, your nodes have to be provisioned the same way. This proposal allows for heterogeneous node storage configs, which can make upgrading storage capacity easier for the admin.
  4. With HostPath volumes, the admin has to manually cleanup the volume to be able to reuse it. This proposal includes a workflow for automatically cleaning up and recycling the PVs. This will greatly benefit multi-tenant clusters in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr Take a look at the updated proposal. I think I covered most of the IO isolation challenges.

* Run a reliable cluster level logging service to drain logs from the nodes before they get rotated or deleted
* The runtime partition for overlayfs is optional. You do not **need** one.
* Alert on primary partition failures and act on it immediately. Primary partition failures will render your node unusable.
* Use EmptyDir for all scratch space requirements of your apps.
Copy link
Member

@stp-ip stp-ip Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • when IO isolation isn't relevant to the app

Copy link
Contributor

@smarterclayton smarterclayton Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EmptyDir is effectively best effort local storage. Best effort is cheap, easy, and a known quantity. Local volumes are Burstable / Guaranteed storage. Best effort is "fair", but not predictable.

Like CPU and memory, best effort storage shouldn't be able to impact guaranteed / burstable storage.

* Make the container’s writable layer `readonly` if possible.
* Another option is to keep the writable layer on tmpfs. Such a setup will allow you to eventually migrate from using local storage for anything but super fast caching purposes or distributed databases leading to higher reliability & uptime for nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to be a huge benefit in writable container layers in general, other than general laziness / adapting existing workloads. I find it hard to imagine we care enough about the writable layer to implement something like this - I suspect making volume mounts easier and more predictable is a better investment.

The vast majority of workloads need 0..3 volumes (the third is just for weird cases), and the writable layer seems like it's only a when when you are lazy. Most people with weird workloads aren't lazy (it takes work to get into that state).