-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Improve Local Storage Management #306
Changes from 2 commits
71f2016
c034e02
2f01ad5
92fc45c
9d4f582
ff03f56
1a68dab
8c08c3c
4dbe1aa
90a855d
f50fca4
8e006f0
4893f8f
cee4bca
0301668
86f560a
70301e5
e4c6100
ba62a3f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,380 @@ | ||
# Local Storage Management | ||
Authors: vishh@, msau42@ | ||
|
||
This document presents a strawman for managing local storage in Kubernetes. We expect to provide a UX and high level design overview for managing most user workflows. More detailed design and implementation will be added once the community agrees with the high level design presented here. | ||
|
||
# Goals | ||
* Enable ephemeral & durable access to local storage | ||
* Support storage requirements for all workloads supported by Kubernetes | ||
* Provide flexibility for users/vendors to utilize various types of storage devices | ||
* Define a standard partitioning scheme for storage drives for all Kubernetes nodes | ||
* Provide storage usage isolation for shared partitions | ||
* Support random access storage devices only | ||
|
||
# Non Goals | ||
* Provide isolation for all partitions. Isolation will not be of concern for most partitions since they are not expected to be shared. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be written more concisely as "Provide usage isolation for non-shared partitions" which would also make it more parallel with the Goal "Provide storage usage isolation for shared partitions" |
||
* Support all storage devices natively in upstream Kubernetes. Non standard storage devices are expected to be managed using extension mechanisms. | ||
|
||
# Use Cases | ||
|
||
## Ephemeral Local Storage | ||
Today, ephemeral local storage is exposed to pods via the container’s writable layer, logs directory, and EmptyDir volumes. Pods use ephemeral local storage for scratch space, caching and logs. There are many issues related to the lack of local storage accounting and isolation, including: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: today how do you expose logs directory ? You mean via emptydir, hostpath or pv ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When a container writes to stdout, it gets saved to /var/log on the node, which is backed by the primary root partition. |
||
|
||
* Pods do not know how much local storage is available to them. | ||
* Pods cannot request “guaranteed” local storage. | ||
* Local storage is a “best-effort” resource | ||
* Pods can get evicted due to other pods filling up the local storage during which time no new pods will be admitted, until sufficient storage has been reclaimed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/during which/after which/ |
||
|
||
## Persistent Local Storage | ||
Distributed filesystems and databases are the primary use cases for persistent local storage due to the following factors: | ||
|
||
* Performance: On cloud providers, local SSDs give better performance than remote disks. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will performance include a QoS IOPS requirement for distributed storage systems? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The PVs will have to be created by the admin/addon that utilizes the entire disk to guarantee IOPs for performance use cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure I understand correctly, but why do PVs have to be full-disk? Why not a properly aligned partition? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're not requiring that the PV has to use the whole disk (the volume is created on a partition), but if you need IOPS guarantees, then it should be a dedicated disk. Especially for rotational disks, the IO will still end up being on a shared path at the device layer. SSDs may offer high enough IOPS that you can share them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As @msau42 mentioned, the API that kubernetes would consume is a logical partition. It can map to any storage duration (RAID, JBOD, etc.). We recommend not sharing spinning disks unless either the storage configuration or IOPS requirements permits sharing them. |
||
* Cost: On baremetal, in addition to performance, local storage is typically cheaper and using it is a necessity to provision distributed filesystems. | ||
|
||
Distributed systems often use replication to provide fault tolerance, and can therefore tolerate node failures. However, data gravity is preferred for reducing replication traffic and cold startup latencies. | ||
|
||
# Design Overview | ||
|
||
A node’s local storage can be broken into primary and secondary partitions. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Were other options considered? Especially LVM would allow us to add / remove devices on hosts and RAID 0/1/5 per volume with very little overhead. With partitions, you must do all this manually. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the case of persistent local storage, most of the use cases we have heard about prioritize performance and being able to use dedicated disks. In addition, LVM is only available on Linux, so it could be difficult to use as a generic solution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm assuming a Primary and Secondary partitions are logical objects which can be implemented a multiple of ways. Do you mind elaborating on possible implementations? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the kind of information you were looking for?
|
||
|
||
## Primary Partitions | ||
Primary partitions are shared partitions that can provide ephemeral local storage. The two supported primary partitions are: | ||
|
||
### Root | ||
This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IO for example) from this partition. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/IO/IOPs/ |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If its been shared via user pods, Isnt there a security concern ? How kubelet take care of that or ensure isolation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "sharing" described here refers to the pods writeable, logs and emptydir directories all being backed by the same partition, not that they're sharing the contents. So while pods cannot access other pod's contents, today it is possible to impact other pods by using up all the space on the partition. So this proposal is trying to address that issue and allow pods to specify capacity boundaries and have kubelet enforce them. |
||
### Runtime | ||
This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually which specific partition is this one, can you provide more information , can a pod today request this , how ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This detail is hidden from the pod's point of view. It's configured at the kubelet level, and it's an optional partition where the container images can be stored. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Above it said "Root" was for container writable layers, so this is for image-based layers only? Clarify? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @thockin -- this would include both image-based and container writable layers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can writeables be lumped under both Root and runtime? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. writables are not in root if runtime is enabled, we need to clarify that in this document. this is consistent with eviction behavior for imagefs and nodefs described here: https://kubernetes.io/docs/concepts/cluster-administration/out-of-resource/#eviction-signals There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so these filesystems^Wpartitions are not disjoint in most cloud images I've seen. most cloud instances are a single (block) partition with a single filesystem; the root filesystem. /var/lib/kubelet, /var/lib/docker, and /var/log are all on the same filesystem. the premise that these filesytems are disjoint and can allocated from independently is not going to be true in many cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @thockin I clarified the subtlety a bit more. PTAL |
||
|
||
## Secondary Partitions | ||
All other partitions are exposed as persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod. Applications can continue to use their existing PVC specifications with minimal changes to request local storage. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When they are exposed as PVs, are they created and available as a pool? Do you mind elaborating a little more on Secondary Partitions? Who creates them, how are they managed, what sizes, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the workflow mentions this a little bit. I will also add it here. There will be an addon daemonset that can discover all the secondary partitions on the local node and create PV objects for them. The capacity of the PV will be the size of the entire partition. We can provide a default daemonset that will look for the partitions under a known directory, and it is also possible to write your own addons for your own environment. So the PVs will be statically created and not dynamically provisioned, but the addon can be long running and create new PVs as disks are added to the node. |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why cant the kubelet manage the secondary partitions as well and create the corresponding PV's rather than having a addon DaemonSet do this ? @msau42 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The primary reason to use an addon is to give the admin flexibility in terms of how they want to configure their local storage. They can mount the disks wherever they want, and then configure their labels and storageclasses accordingly. If they want to make any changes to how it is managed, then the daemonset can be updated instead of having to upgrade kubelet. Do you see any downsides to the addon approach? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks @msau42 ,i agree on flexibility part, i guess i don't see any downsides. We should have an option to say which partitions are available for pre-creation of PV's vs which should be left untouched. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the admin provides their own DaemonSet, then which partitions to use for PVs is completely under their control. For the default DaemonSet, we can define a specific directory (for example, '/var/kubelet/local-partitions'), where all the partitions can be mounted and we'll automatically detect them and add them. Any partitions that are not in this directory will be ignored. |
||
# User Workflows | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does 'local PV' support upcoming features like snapshot or replication? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand correctly, currently the snapshot and replication features are focused on the orchestration, which means the underlying storage system needs to support it. We don't have plans to implement those features for local PVs, which is designed to be a simple abstraction of a logical partition. I think higher level applications can be built on top to handle those features though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use a distributed filesystem if you need snapshotting and/or replication. This is the kubernetes recommended approach. Very few apps should use local storage directly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
### Alice manages a deployment and requires “Guaranteed” ephemeral storage | ||
|
||
1. Kubelet running across all nodes will identify primary partition and expose capacity and allocatable for the primary “root” partition. The runtime partition is an implementation detail and is not exposed outside the node. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there an assumption that all nodes in the cluster run the same storage driver? for example, we have had requests for clusters that have some nodes running devicemapper (where posix compliance is a concern), and others run overlay, etc. for systems where devicemapper is chosen, we are limited by dm.basesize for the storageOverlay value unless the state of the art has changed. as a result, i think the node will need some mechanism to expose if the storageOverlay request is actually feasible? for example, if you requested 20Gi but the dm.basesize=10Gi, a devicemapper node may not even be able to satisfy the workload. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so i guess the capacity and allocatable is only for emptyDir and container writable layers right and allocabtable would have already taken into account the space needed for k8s system daemons ? |
||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: Node | ||
metadata: | ||
name: foo | ||
status: | ||
Capacity: | ||
Storage: 100Gi | ||
Allocatable: | ||
Storage: 90Gi | ||
``` | ||
|
||
2. Alice adds a “Storage” requirement to her pod as follows | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: pod | ||
metadata: | ||
name: foo | ||
spec: | ||
containers: | ||
name: fooc | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there's a missing |
||
resources: | ||
limits: | ||
storage-logs: 500Mi | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason for doing it this way rather than: limits:
storage:
logs: 500Mi
overlay: 1Gi ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a limitation in the LimitRange design. It doesn't support nesting of limits. |
||
storage-overlay: 1Gi | ||
volumes: | ||
name: myEmptyDir | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing |
||
emptyDir: | ||
capacity: 20Gi | ||
``` | ||
|
||
3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are the guarantees provided by the system? The FS?, logical volume? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For primary partitions, the node's local storage capacity will be exposed so that the scheduler can take into account a pod's storage limits and what nodes can satisfy that limit. Then kubelet will monitor the storage usage of the emptydir volumes and containers so that they stay within their limits. If quota is supported, then it will use that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: |
||
4. Alice’s pod is not provided any IO guarantees | ||
5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this interact with kubectl logs? Right now we are aggregating and rolling stdout and stderr? Are you proposing that we use local storage instead of, or in addition to, the current K8s logging infra? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should have no impact to kubectl logs. It's only changing the log rotation mechanism to be on a per container basis instead of on a node basis. |
||
6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/foo/fooc/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. foo is correct because it's referring to the pod, and not the container. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Quota feature assumes an appropriate supporting file system is being used. A large part of the distributed storage systems require raw (no file system) storage. How would that be managed? Would a raw partition be crated by a logical manager? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't plan to support raw partitions as a primary partition. Secondary partitions can have block level support though. |
||
7. With hard limits, containers will receive a ENOSPACE error if they consume all reserved storage. Without hard limits, the pod will be evicted by kubelet. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assuming the FS supports such a feature. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I will mention otherwise kubelet can only enforce soft limits. |
||
8. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints. | ||
9. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know number 8 showed there will be a Health monitor, but how would it detect that the primary partition is unhealthy on number 9? What does it mean to be unhealthy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now that I think about it more, health monitoring is dependent on the environment and configuration, so an external monitor may be needed for both primary and secondary. It can monitor at various layers depending on how the partitions are configured: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How to do disk health monitoring if the Node is a VM and disk is a virtual disk? The smartctl or raid tools may not return correct data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good point. Because the partition configuration is very dependent on the environment, I think we cannot do any monitoring ourselves. Instead, we can define a method for external monitors to report errors, and also define how kubernetes will react to those errors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does our proposal/design require this health monitor. lets say in the default configuration, when there is no external health monitor, what is the behavior ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The health monitor is not required. In that case, it will behave the same way that it does today, which is undefined. |
||
|
||
### Bob runs batch workloads and is unsure of “storage” requirements | ||
|
||
1. Bob can create pods without any “storage” resource requirements. | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: pod | ||
metadata: | ||
name: foo | ||
namespace: myns | ||
spec: | ||
containers: | ||
name: fooc | ||
volumes: | ||
name: myEmptyDir | ||
emptyDir: | ||
``` | ||
|
||
2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a LimitRange to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify bust ranges and a host of other features supported by LimitRange for local storage. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a link to the LimitRange user guide that explains this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This part is a little bit confusing "His cluster administrator being aware....". Does that mean that this solution would require the administrator to take action or things may be incorrectly allocated? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In order to solve today's local storage isolation problem, pods should specify limits for their local storage usage. In the absence of that, the administrator has the option to specify defaults for the namespace. If neither of those two occur, then you just have the same issue today. |
||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: LimitRange | ||
metadata: | ||
name: mylimits | ||
spec: | ||
- default: | ||
storage-logs: 200Mi | ||
Storage-overlay: 200Mi | ||
type: Container | ||
- default: | ||
storage: 1Gi | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to clarify, each empty dir backed volume will pick up this default capacity? so if a user had multiple empty dirs for some reason, each would get 1Gi? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it's the limit per emptydir. You bring up a good point though, that the "type: Pod" implies that it's for the whole pod. We can change it to "type: EmptyDir" |
||
type: Pod | ||
``` | ||
|
||
3. The limit range will update the pod specification as follows: | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: pod | ||
metadata: | ||
name: foo | ||
spec: | ||
containers: | ||
name: fooc | ||
resources: | ||
limits: | ||
storage-logs: 200Mi | ||
storage-overlay: 200Mi | ||
volumes: | ||
name: myEmptyDir | ||
emptyDir: | ||
capacity: 1Gi | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you worry users will get confused with this field as only being meaningful when the medium is disk and not memory? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can be used for memory-backed emptydir too. |
||
``` | ||
|
||
4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. | ||
5. If Bob’s pod “foo” exceeds the “default” storage limits and gets evicted, then Bob can set a minimum storage requirement for his containers and a higher “capacity” for his EmptyDir volumes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The concern I have here is that it requires a lot of interaction with an administrator and the user. If I am "Bob", I'm just going to keep asking for more storage (1, then 2, then .. ). That would move the Pod from node to node satisfying the storage size request. I'm guessing... How different is this from the current model? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes there is a little bit of a trial and error going on here for Bob. But as an application developer, you will have to do this in order to size your apps appropriately. One goal that we're trying to achieve here is provide pods better isolation from other pods running on that node through storage isolation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/capacity/size/ |
||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: pod | ||
metadata: | ||
name: foo | ||
spec: | ||
containers: | ||
name: fooc | ||
resources: | ||
requests: | ||
storage-logs: 500Mi | ||
storage-overlay: 500Mi | ||
volumes: | ||
name: myEmptyDir | ||
emptyDir: | ||
capacity: 2Gi | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is emptyDir capacity a new proposal ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes that's new for the proposal. Let me see if the formatting will let me bold all the new fields to make that more clear. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks that will help There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately, it doesn't look like there's a way in markdown to bold specific lines in a code block. I'll make the preceding sections explicitly say which fields are new. |
||
``` | ||
|
||
6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intent to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/intent/intend/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i assume this means we will just not allow a request a limit for volumes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. intent=intend |
||
|
||
### Alice manages a Database which needs access to “durable” and fast scratch space | ||
|
||
1. Cluster administrator provisions machines with local SSDs and brings up the cluster | ||
2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So there is always a 1:1 correspondence between PV and partition on a secondary device? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this will let the cluster administrator decide how they want to provision local storage. They can have one partition per disk for IOPS isolation, or if sharing is ok, then create multiple partitions on a device. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm guessing this is based on the technology of the underlying filesystem. If not, then I think this depends a lot on some type of logical volume manager. If not only two things can happen: 1. a secondary partition is the entire disk, 2: A lot of disk fragmentation. I think more information on how number 1 is done may shed more light on this model There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it is up to the administrator to partition and create the filesystem first. And how that is done will depend on the partitioning tools (parted, mdadm, lvm, etc) available and which filesystems the administrator decides to use. From Kubernetes point of view, we will not handle creating partitions or filesystems. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's very inconvenient for an admin. Also, when such PV gets Released, who/how removes the data there and puts it back to Available? We'd like to deprecate recycler as soon as possible. IMO, some sort of simple dynamic provisioning would be very helpful and it's super simple with LVM. It should be pluggable though to work on all other platforms. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current thought for the PV deletion workflow is to set the PV to Released phase, delete all the files (similar to how EmptyDir cleans up), delete the PV object, and then the addon daemonset will detect that the partition and then create the PV for it again. So from an admin's point of view, the partitioning and fs setup is just a one time step whenever new disks are added. And for the use case that we are targeting, which requires whole disks for IOPs guarantees, the setup is simple: one partition across the whole disk, and create the filesystem on that partition. As for LVM, I agree it is a simpler user model, but we cannot get IOPs guarantees from it, which is what customers we've talked to want the most. I don't think this design will prevent supporting an LVM-based model in the future though. I can imagine there can be a "storageType = lvm" option as part of the PV spec, and a dynamic provisioner can be written to use that to carve out LVs from a single VG. The scheduling changes that we have to make to support local PVs can still apply to a lvm-based volume. We're just not prioritizing it right now based on user requirements. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree with @jsafrane that we should have some default out of the box local disk PV provisioner and for default cases, we dont have to do a addon or some such thing. 90% use cases might be just simple use of local disks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on feedback we have gotten from customers and workloads team, it's the opposite. Most of the use cases require dedicated disks. We have not seen many requests for dynamic provisioning of shared disks. If you see some real use cases where an app wants to use persistent local storage (and all its semantics), but doesn't need performance guarantees, then I would be interested in hearing about them as well. I do want to make sure that nothing in this proposal would prevent LVM and dynamic provisioning from being supported in the future. And that it will be able to take advantage of the scheduling and failure handling features we will be adding. In terms of admin work, my hope is that the default addon will require a similar amount of admin intervention as the LVM model (configure the disk once in the beginning, the system takes care of the rest). |
||
|
||
```yaml | ||
kind: PersistentVolume | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to clear my confusion, these all are created by hand? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They could be created by hand, or if you put the partitions in a known directory, then the addon daemonset can discover the partitions and automatically create the PVs. |
||
apiVersion: v1 | ||
metadata: | ||
name: local-pv-1 | ||
annotations: | ||
storage.kubernetes.io/node: node-1 | ||
labels: | ||
storage.kubernetes.io/medium: ssd | ||
spec: | ||
volume-type: local | ||
storage-type: filesystem | ||
capacity: | ||
storage: 100Gi | ||
hostpath: | ||
path: /var/lib/kubelet/storage-partitions/local-pv-1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is the path in the spec, i am assuming the pv gets populated with the path of where the storage is allocated. Putting it in spec, suggests its an input, when its probably just an output, right @msau42 ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is an input for the PV layer, to allow for situations where admins decide not to follow the default partitioning discovery scheme (where the addon auto-discovers the partitions at a known location and creates the PVs). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we want the API to be generic. The path can be whatever. From an API standpoint, if a user needs to attach a local PV to a pod then they should have all the information they need from the API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a solution to make sure that same partition is re-mounted to the same path after Node restarts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, it is a requirement on the storage admin to setup the local mounts such that they will be mounted to the same location across reboots. |
||
accessModes: | ||
- ReadWriteOnce | ||
persistentVolumeReclaimPolicy: Delete | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why the reclaim policy is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why the reclaim policy is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The recycle policy is going to be deprecated and eventually removed. We are thinking about getting similar recycling behavior by: having the 'delete' operation cleanup the volume, and having addon daemonset create the PV again after it's been deleted |
||
``` | ||
``` | ||
$ kubectl get pv | ||
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM … NODE | ||
local-pv-1 100Gi RWO Delete Available node-1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does delete do with local PV? Does the node delete the PVs and no longer make them available? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The daemonset will operate like an external provisioner and handle the cleanup and reclaimation. It will delete the contents of the PV, and then create a new PV for it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We operate on the assumption that a PV can only be bound to one PVC. |
||
local-pv-2 10Gi RWO Delete Available node-1 | ||
local-pv-1 100Gi RWO Delete Available node-2 | ||
local-pv-2 10Gi RWO Delete Available node-2 | ||
local-pv-1 100Gi RWO Delete Available node-3 | ||
local-pv-2 10Gi RWO Delete Available node-3 | ||
``` | ||
3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is currently no notion of tainting PVs, only nodes. Can you say more about what semantics you are expecting for tainting a PV? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would like to evict pods that are using tainted PVs, unbind the PVC, and reschedule the pod so that it can bind to a different PV. I think everything after the eviction could be handled by a separate controller. |
||
4. Alice creates a StatefulSet that uses local PVCs | ||
|
||
```yaml | ||
apiVersion: apps/v1beta1 | ||
kind: StatefulSet | ||
metadata: | ||
name: web | ||
spec: | ||
serviceName: "nginx" | ||
replicas: 3 | ||
template: | ||
metadata: | ||
labels: | ||
app: nginx | ||
spec: | ||
terminationGracePeriodSeconds: 10 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think its more of a statefulset question, but are we considering any parameters for how the statefulset behaves when a bound claim to local pv, somehow becomes unhealthy or the node goes down. Will the statefulset timeout and bringup that pod on a new node ? how will we deal with partitions ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, handling those failures will be part of a "forgiveness" feature. I will update this proposal to include how to specify it, and which scenarios will be handled. Basically, if the node or pv dies, then after some specified forgiveness period, then we can unbind the PVC from the PV, evict the pod, and cause it to be rescheduled and obtain a new PV. Can you clarify what is the partition issue you are asking about? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks @msau42 ,ignore the partition issue which is more of statefulset thing. Basically how do we ensure in case of partitions that there is exactly only one pod with a given identity. Looking forward to the forgiveness update |
||
containers: | ||
- name: nginx | ||
image: gcr.io/google_containers/nginx-slim:0.8 | ||
ports: | ||
- containerPort: 80 | ||
name: web | ||
volumeMounts: | ||
- name: www | ||
mountPath: /usr/share/nginx/html | ||
- name: log | ||
mountPath: /var/log/nginx | ||
volumeClaimTemplates: | ||
- metadata: | ||
name: www | ||
labels: | ||
storage.kubernetes.io/medium: ssd | ||
spec: | ||
volume-type: local | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't the convention be |
||
accessModes: [ "ReadWriteOnce" ] | ||
resources: | ||
requests: | ||
storage: 100Gi | ||
- metadata: | ||
name: log | ||
labels: | ||
storage.kubernetes.io/medium: hdd | ||
spec: | ||
volume-type: local | ||
accessModes: [ "ReadWriteOnce" ] | ||
resources: | ||
requests: | ||
storage: 1Gi | ||
``` | ||
|
||
5. The scheduler identifies nodes for each pods that can satisfy cpu, memory, storage requirements and also contains free local PVs to satisfy the pods PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node. | ||
``` | ||
$ kubectl get pvc | ||
NAME STATUS VOLUME CAPACITY ACCESSMODES … NODE | ||
www-local-pvc-1 Bound local-pv-1 100Gi RWO node-1 | ||
www-local-pvc-2 Bound local-pv-1 100Gi RWO node-2 | ||
www-local-pvc-3 Bound local-pv-1 100Gi RWO node-3 | ||
log-local-pvc-1 Bound local-pv-2 10Gi RWO node-1 | ||
log-local-pvc-2 Bound local-pv-2 10Gi RWO node-2 | ||
log-local-pvc-3 Bound local-pv-2 10Gi RWO node-3 | ||
``` | ||
``` | ||
$ kubectl get pv | ||
NAME CAPACITY … STATUS CLAIM NODE | ||
local-pv-1 100Gi Bound www-local-pvc-1 node-1 | ||
local-pv-2 10Gi Bound log-local-pvc-1 node-1 | ||
local-pv-1 100Gi Bound www-local-pvc-2 node-2 | ||
local-pv-2 10Gi Bound log-local-pvc-2 node-2 | ||
local-pv-1 100Gi Bound www-local-pvc-3 node-3 | ||
local-pv-2 10Gi Bound log-local-pvc-3 node-3 | ||
``` | ||
|
||
6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we are depending on priority and preemption to be implemented prior to this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like what's described in this section also relies on some variant of #7562 / #30044 being implemented, as today there is no notion of a local PV (beyond the experimental HostPath volume type which doesn't do what's needed here). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this proposal is also covering this new local PV type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kow3ns regarding priority and preemption, it is not a strict requirement for this feature, but will make the workflow smoother. There are also plans to implement this soon. |
||
7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should occur only when the PV backing the PVC is permanently unavailable. If a controller creates a new PVC and relaunches the Pod with that PVC, it will never be able to reuse the data on the old PV anyway. To simplify this for controller developers, when some policy is applied to indicate that K8s should "give up" on recovering a PV, can we just delete the PV, and set the status of the PVC to pending? This would reduce the complexity of the interaction with DaemonSet, StatefulSet, and any other controllers and local persistent storage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This situation could also occur if the node has failed or can no longer fulfill other requested resources, for example, if other pods got scheduled and took up the cpu or memory needed. The main concern with deleting the PV and keeping the PVC, is that it may not follow the retention policy. The user may want to recover data from the PV, but won't have the pod->PVC->PV binding anymore. As another alternative, we could remove the PVC->PV binding, and if the PV policy is retain, also add an annotation with the old pod, PVC information so the user can figure out which PV had their data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like the idea of keeping the PVC and just removing the PVC->PV binding. If we expect the StatefulSet controller to modify the Pod to use a new PVC, that essentially means only the StatefulSet controller can perform the task of unblocking its unschedulable Pods. That in turn means that every controller needs to separately implement this behavior. For example, what if I have "stateless" Deployment Pods that want this behavior for their large caches on local PV? If unblocking can be done without modifying the Pods to use a different PVC, then it leaves the door open to write a generic "local PV unbinding" controller that implements this behavior once for everyone who requests it via some annotation or field. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The generic PVC unbinding controller can monitor for this error condition, unbind the PVC, clean up the PV according to the reclaim policy, and then evict and reschedule the pods to force them to obtain a new PV. |
||
8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you are targeting DBs and DFSs, and if a "taint" is really pertaining to a problem with the underlying storage media, I don't think anything in your target set will tolerate a taint. @davidopp shouldn't this be expressed by the controller in terms of declarative tolerations against node taints in any case. That is, don't I have to explicitly declare the taint as tolerated? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of having every controller/operator watch for the appearance of taints on a node and delete Pods should we consider the following approach?
If we take an approach that is closer to this we don't have to duplicate the watch logic in every controller/operator. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand correctly, are you suggesting to leave it up to the application to handle local storage failures, since each application may have its own unique requirements and policies? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry if I was not clear. I am saying the opposite. The "application monitoring the storage device" referred to above is based on the design statement that " Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the controller (e.g. StatefulSet) should be responsible for deleting Pods. I think @kow3ns is also saying that if I'm reading him correctly. My understanding is that regular Node taints are noticed and enforced by kubelet, which may evict the Pod if it doesn't tolerate the taint. Wouldn't it make sense for kubelet to also evict the Pod if it does not tolerate a taint on one of its local PVs? If recreated with the same PVC, the Pod would remain unschedulable due to the taint on the PV. At this point, the problem is reduced to being the same as (7) above. In this way, both (7) and (8) can be handled without necessarily requiring any changes to StatefulSet or other controllers (if a generic controller can be implemented as suggested above). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently taints are only at the node level, but I think it could worth looking into expanding, as it already has a flexible interface for specifying per pod the tolerations and forgiveness for each taint. This workflow could also work for the case when the node fails or becomes unavailable. @davidopp Then, when the pod gets evicted due to the taint, it reduces the problem to (7), as mentioned above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is also possible to implement this without taints, and instead add an error state to the PV, and have a controller monitor for the error state and evict pods that way. But using taints may be nice as a future enhancement to unify the API. |
||
9. Once Alice decides to delete the database, the PVCs are expected to get deleted by the StatefulSet. PVs will then get recycled and deleted, and the addon adds it back to the cluster. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this be global for all PVCs for StatefulSet going forward? Also, we will be depending on reasonable collection timeouts to ensure that users have time to collect data from Volumes after deletion (assuming they have a need to do so)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By default, the PVC will need to be deleted by the user to retain similar behavior as today. We are looking into an "inline" PVC feature that can automatically delete the PVCs when the StatefulSet gets destroyed. I'll update this to clarify that. Regarding the retention policy, the PV can be changed to use the "Retain" policy if users need to collect data after deletion. |
||
|
||
### Bob manages a distributed filesystem which needs access to all available storage on each node | ||
|
||
1. The cluster that Bob is using is provisioned with nodes that contain one or more secondary partitions | ||
2. The cluster administrator runs a DaemonSet addon that discovers secondary partitions across all nodes and creates corresponding PVs for them. | ||
3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy. | ||
4. Bob creates a specialized controller (Operator) for his distributed filesystem and deploys it. | ||
5. The operator will identify all the nodes that it can schedule pods onto and discovers the PVs available on each of those nodes. The operator has a label selector that identifies the specific PVs that it can use (this helps preserve fast PVs for Databases for example). | ||
6. The operator will then create PVCs and manually bind to individual local PVs across all its nodes. | ||
7. It will then create pods, manually place them on specific nodes (similar to a DaemonSet) with high enough priority and have them use all the PVCs create by the Operator on those nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. *created |
||
8. If a pod dies, it will get replaced with a new pod that uses the same set of PVCs that the old pod had used. | ||
9. If a PV gets tainted as unhealthy, the Operator is expected to delete pods if they cannot tolerate device failures. | ||
|
||
### Bob manages a specialized application that needs access to Block level storage | ||
|
||
1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet. | ||
2. The cluster admin creates a DaemonSet addon that discovers all the raw block devices on a node that are available within a specific directory and creates corresponding PVs for them with a ‘StorageType’ of ‘Block’ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this DaemonSet do formatting for the raw block devices? and what the filesystem type will be? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, the assumption is that if an application wants a raw block device, then there is no filesystem created. We won't do any formatting at all and just expose the device. On deletion, we may want to consider some zeroing option in order to cleanup the data, but that could add considerable time to the operation. Do you know of any use cases that need raw devices? Right now this is low on the priority list, so we haven't been focusing much on designing this feature. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So in the example below, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is what we are thinking, but we haven't looked into the technical aspects of this yet. If anyone is aware of any limitations/challenges, please let us know. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (should also be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just another note: Instead of using a boolean or type flag to define the access to a volume, another approach could be to have a separate property to enumerate the "raw volumes", i.e.: kind: Pod
apiVersion: v1
metadata:
name: mypod
spec:
containers:
- name: myfrontend
image: dockerfile/nginx
volumeMounts:
- mountPath: "/var/www/html"
name: mypd
# alternatively
blockVolumes:
- deviceNode: /dev/sda
name: mypd
volumes:
- name: mypd
persistentVolumeClaim:
claimName: myclaim On the node or cluster leve it could be evaluated if the referenced This would prevent us from blowing up the volume spec. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it reduce the total amount of changes to a spec? To me, it looks like it just moves the attribute from the PVC/PV spec to the container spec. The advantages I see for keeping the attribute in the PVC/PV spec:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @msau42 I do agree, it has benefits to explicitly make the storage access type a property of the PV. This information helps in several places, provisioning is one of them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do people think of the names of these optional new fields: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another use case is ceph osd's need raw devices as well. |
||
|
||
```yaml | ||
kind: PersistentVolume | ||
apiVersion: v1 | ||
metadata: | ||
name: foo | ||
annotations: | ||
storage.kubernetes.io/node: k8s-node | ||
labels: | ||
storage.kubernetes.io/medium: ssd | ||
spec: | ||
volume-type: local | ||
storage-type: block | ||
capacity: | ||
storage: 100Gi | ||
hostpath: | ||
path: /var/lib/kubelet/storage-raw-devices/foo | ||
accessModes: | ||
- ReadWriteOnce | ||
persistentVolumeReclaimPolicy: Delete | ||
``` | ||
|
||
3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request. | ||
|
||
```yaml | ||
kind: PersistentVolumeClaim | ||
apiVersion: v1 | ||
metadata: | ||
name: myclaim | ||
labels: | ||
storage.kubernetes.io/medium: ssd | ||
spec: | ||
volume-type: local | ||
storage-type: block | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would storageLevel be better? |
||
accessModes: | ||
- ReadWriteOnce | ||
resources: | ||
requests: | ||
storage: 80Gi | ||
``` | ||
|
||
*The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How will Bob's pod receive the block device? Will kubelet bind whole /dev from the host into the pod? That might be insecure. On the other way the pod needs to be privileged to access a raw device anyway, so probably nobody cares. Still, how will the pod find the right device(s) in /dev? There can be many of them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUIC whitelisting access to individual block devices can be achieved by the devices cgroup subsystem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This definitely needs some more investigation into the details. I did a quick search and saw that docker does support a device option that doesn't require privileged containers. And you can set read, write and mknod permissions on each device. But we'll have to see what CRI supports. |
||
|
||
# Open Questions & Discussion points | ||
* Single vs split “limit” for storage across writable layer and logs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i prefer split. for device mapper, we are limited by the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack. |
||
* Local Persistent Volume bindings happening in the scheduler vs in PV controller | ||
* Should the PV controller fold into the scheduler | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. at present, local PVs have above question,PV, PVC bounded before schedulering,when bounded completed,scheduler select the node with PV node affinity,but now the node CPU, Mem not enough and so on,so the pod all the time schedule failed,so above question have plan to solve? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is the limitation in the first phase. We hope to solve it in the next release, but no concrete ideas yet, we're just prototyping at this point. At a high level, the PVC binding needs to be delayed until a pod is scheduled, so that it can take into account all the other scheduling requirements of the pod. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's cool if solve local volume PV PVC delay bound,now my project team worry about the question,so not dare use local volume plugin because pod schedule fail all the time easily. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it will not work well right now in general purpose situations. But if you use the critical pod feature, or you run your workload that needs local storage first, then it that may work better. Still, the PVs may not get spread very well because the PV controller doesn't know that all these PVCs are replicas in the same workload. You may be able to work around the issue by labeling the PVs per workload. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK,Thanks. I pay close attention to V1.8(Scheduler predicate for prebound local PVCs#43640) . |
||
* Supporting dedicated partitions for logs and volumes in Kubelet in addition to runtime overlay filesystem | ||
* This complicates kubelet.Not sure what value it adds to end users. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Logs should usually not accumulate as they should be collected to a central location. Overlay FS data can be used, but for heavy use or increased storage needs we do recommend and provide emptyDirs and the new local PVs. As emptyDirs might be used for caches and heavy IO it might makes sense to let this be separated from the planned root PV. Complicating the Kubelet for logs and overlay doesn't seem to make sense. We should definitely think about the usage pattern of emptyDirs after local PVs are available. Definitely needs to be clearly documented what use-cases each one solves. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, since we don't plan to provide IOPS isolation for emptydir, then local PV should be used instead for those use cases. One question we have is are there use cases that need ephemeral IOPS guarantees that cannot be adapted to use local PVs? Do we need to look into an "inline" local PV feature where the PV gets created and destroyed with the pod? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depends on the automation of creating and using local PVs. The path I see as reasonable would be:
This would lead to no huge complexity additions in the kubelet as root, emptyDir, log and overlay FS are kept on the primary partition in the first iteration. As additional note: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good plan! LocalDisk as the actual volume plugin name sounds good. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For clarification. We would have:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are planning to recommend using LocalDisk only through the PV/PVC interfaces for the following reasons:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the clarification. Always great to have that documented. So recommendation: Small addition: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will update this doc to clarify that thanks! I agree the PV name could be misleading since the local disk can only offer semi-persistence, and has different semantics then normal PVs. I can add a section about the different semantics. Also, because of the different behavior, and its very targeted use cases, I want to make sure in the API layer, the user explicitly selects this option, and that they cannot use a local PV "accidentally". |
||
* Providing IO isolation for ephemeral storage | ||
* IO isolation is difficult. Use local PVs for performance | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i worry this is not cost effective for many of our users and is not a viable long term state. i think we need to provide some reasonable io isolation for ephemeral storage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this proposal saying that we should never handle IO isolation, or that we dont plan to tackle this as part of this set of changes? I was under the impression that long term we would eventually provide disk IO, and network isolation and scheduling as well. But I think it makes sense to tackle disk first. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This proposal is not going to tackle IO isolation for the primary partition, but that doesn't mean it can't be handled in the future. This proposal will provide IO isolation for secondary partitions through PVs. So for ephemeral IO isolation use cases, perhaps we can have an "inline" PV feature where the PVC gets created and destroyed with the pod. |
||
* Support for encrypted partitions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems more a concern of the underlaying infrastructure to provide the partition on top of full disk encryption. Not sure k8s should by default take care of this. We don't do this currently for any storage. If it's planned for the root disk and current PVs please let me know. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We were thinking encrypted volumes could offer a more secure way of wiping data after the volume gets destroyed. You just need to delete and change the key. It's currently not planned, but it could be considered as a future feature. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That use case I definitely agree with. Using encryption to basically reduce wiping/reuse latency seems like a good optional feature especially for pods with a high turnaround. For security purposes itself, I think the cluster admin should just use full disk encryption. AS already said we are not offering encryption on networked volumes or the root disk. If encrypted volumes would be used for faster a wiping/reuse cycle that should be done on other persistent disks too. (Makes more sense for manually created rare PVs, but would still be useful.) |
||
* Do applications need access to performant local storage for ephemeral use cases? Ex. A pod requesting local SSDs for use as ephemeral scratch space. | ||
* Typically referred to as “inline PVs” in kube land | ||
* Should block level storage devices be auto formatted to be used as file level storage? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No - Block level storage should be untouched to allow pods to consume them as-is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. For raw device access, no formatting should be done. For the normal use case of file level access, we could consider auto formatting to simplify the administrator's role. With the current proposal, the administrator has to preconfigure all the filesystem partitions. As an alternative design, we could instead take a list of raw devices and form a pool, and then format them when the PV is created. Then that's one task the admin does not have to do, and it could work better with dynamic provisioning. But some downsides are:
|
||
* Flexibility vs complexity | ||
* Do we really need this? | ||
* Repair/replace scenarios. | ||
* What are the implications of removing a disk and replacing it with a new one? | ||
* We may not do anything in the system, but may need a special workflow | ||
|
||
# Recommended Storage best practices | ||
|
||
* Have the primary partition on a reliable storage device | ||
* Consider using RAID and SSDs (for performance) | ||
* Partition the rest of the storage devices based on the application needs | ||
* SSDs can be statically partitioned and they might still meet IO requirements of apps. | ||
* TODO: Identify common durable storage requirements for most databases | ||
* Avoid having multiple logical partitions on hard drives to avoid IO isolation issues | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i want to challenge this more. this assumption avoids complexity by just increasing operator cost. i think io isolation will still be required for primary partition. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As it stands today, this solution is not changing primary partition IO isolation semantics, and doesn't increase operator cost for managing the primary partition. For local persistent use cases, the only way to get IO isolation today is to use HostPath volumes, and data gravity with node affinity, all of which already has a high operator cost. I believe this solution is an improvement on this existing method and will decrease operator cost for both the cluster admin and the application developer. But if you see any ways that operator cost is increasing compared to the current methods, please let me know.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @derekwaynecarr Take a look at the updated proposal. I think I covered most of the IO isolation challenges. |
||
* Run a reliable cluster level logging service to drain logs from the nodes before they get rotated or deleted | ||
* The runtime partition for overlayfs is optional. You do not **need** one. | ||
* Alert on primary partition failures and act on it immediately. Primary partition failures will render your node unusable. | ||
* Use EmptyDir for all scratch space requirements of your apps. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. EmptyDir is effectively best effort local storage. Best effort is cheap, easy, and a known quantity. Local volumes are Burstable / Guaranteed storage. Best effort is "fair", but not predictable. Like CPU and memory, best effort storage shouldn't be able to impact guaranteed / burstable storage. |
||
* Make the container’s writable layer `readonly` if possible. | ||
* Another option is to keep the writable layer on tmpfs. Such a setup will allow you to eventually migrate from using local storage for anything but super fast caching purposes or distributed databases leading to higher reliability & uptime for nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't seem to be a huge benefit in writable container layers in general, other than general laziness / adapting existing workloads. I find it hard to imagine we care enough about the writable layer to implement something like this - I suspect making volume mounts easier and more predictable is a better investment. The vast majority of workloads need 0..3 volumes (the third is just for weird cases), and the writable layer seems like it's only a when when you are lazy. Most people with weird workloads aren't lazy (it takes work to get into that state). |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what support for "random access storage devices only" means? Does this mean using RAM as storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I took this to mean DASD (i.e. this will not work for Tape drives or other sequential access storage media). Is this not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it means not supporting tape.