-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Support for CSI Plugins on Windows Nodes #1141
Conversation
/sig storage |
/assign @yujuhong @msau42 @PatrickLang @jingxu97 @michmike |
|
||
### Non-Goals | ||
|
||
1. Support CSI controller plugin operations from Windows nodes: This may be considered in the future but not an immediate priority. If all the worker nodes in the cluster are Windows nodes and Linux master nodes have scheduling disabled then CSI controller plugins cannot be scheduled for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to clarify that this doesn't necessarily require privilege and could already be done today.
|
||
### Risks and Mitigations | ||
|
||
Any pod on a Windows node can be configured to mount `\\.\pipe\csi-proxy` and perform privileged operations. Thus csi-proxy presents a potential security risk. To mitigate the risk, an admission webhook will need to be implemented and deployed that will reject all containers that bind mount `\\.\pipe\csi-proxy` as a hostPath volume but does not have `privileged` flag set in the pod's securityContext specification. This allows the `privileged` setting to be used for Windows pods as an indication the container will perform privileged operations. Other cluster-wide policies (e.g. PSP) that act on the `privileged` setting in a container's `securityContext` can enforce the same for CSI Node plugins targeting Windows nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be unix socket or pipe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/open-policy-agent/gatekeeper would also work great - could write a policy such as only allow unix socket mounts if in kube-system namespace
rpc LinkVolume(LinkVolumeRequest) returns (LinkVolumeResponse) {} | ||
|
||
// GetDiskNumberWithLocation finds a disk device with requested Location <Bus, Target, LUN ID> | ||
rpc GetDiskNumberWithLocation(GetDiskNumberWithLocationRequest) returns (GetDiskNumberWithLocationResponse) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the Location information isn't available anywhere? The AWS EBS driver would like to run a ebsnvme-id.exe
on the host to map the cloud's reported volume-ID to the disk number: https://github.com/kubernetes/kubernetes/pull/79552/files#diff-238adf22aa109b4ad41399fe48280ec8R214. GCE PD may be similar, it runs some powershell cmdlets on the host: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/gcepd/attacher.go#L147
Adding a proxy call to Get-Disk
to the API will be a suitable workaround for EBS (ebsnvme-id.exe is not strictly necessary, at the moment Get-Disk
is guaranteed to return the same information), but I'm not sure about GCE PD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jingxu97 to determine if this api works for GCE PD or we need a special service to map PD devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the prototype to make sure everything does work end-2-end from a OS plumbing perspective, I ported the GCE PD CSI driver to work on Windows node. As part of that, I got the SCSI Page 83 queries and mapping working (required for the mapping of GCE PD and potentially by other vendors too) by creating a generic version of the Get-GcePdName
cmdlet that directly issues IOCTL_STORAGE_QUERY_PROPERTY
and matches StorageIdCodeSetAscii
IDs of type StorageIdAssocDevice
with the passed ID.
In case of GCE PD, the ID is conveniently just the PVC name. For other storage systems, their CSI plugins may require more complicated logic to figure out what the SCSI Page 83 ID should be. The generic wrapper for identifying a disk with the desired SCSI Page 83 is exposed through the GetDiskNumberWithSCSIID
API mentioned in the KEP.
For the GetDiskNumberWithLocation
API, I was indeed planning to use the information returned through Get-Disk
@wongma7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I smiled when I saw GetDiskNumberWithSCSIID
, I knew exactly why it was needed...
The GCE Persistent Disk driver (in-tree at least, haven't looked at the CSI version but assuming it's the same...) needs to map the PD "friendly name" to a Windows device id. We're starting to investigate if we can provide this information natively through the Get-Disk
cmdlet (kubernetes/kubernetes#74674). Even if we can change GCE to support this though, it looks like the PD driver would still end up using GetDiskNumberWithSCSIID
rather than GetDiskNumberWithLocation
since the latter will use <Bus, Target, LUN ID> as its key. Is that right?
Is the PD driver the only storage driver that uses such a "friendly name" for identifying the volume the pod wants to use? If not then perhaps an API method that accepts that as the argument would be useful; if so then GetDiskNumberWithSCSIID
seems probably suitable.
rpc LinkVolume(LinkVolumeRequest) returns (LinkVolumeResponse) {} | ||
|
||
// GetDiskNumberWithLocation finds a disk device with requested Location <Bus, Target, LUN ID> | ||
rpc GetDiskNumberWithLocation(GetDiskNumberWithLocationRequest) returns (GetDiskNumberWithLocationResponse) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jingxu97 to determine if this api works for GCE PD or we need a special service to map PD devices.
|
||
A CSI Node Plugin will interact with csi-proxy.exe using a named pipe: `\\.\pipe\csi-proxy` (exposed by csi-proxy.exe). The pipe will need to be mounted into the Node Plugin container from the host using the pod's volume mount specifications. Note that domain sockets cannot be used in this scenario since Windows blocks a containerized process from interacting with a host process that is listening on a domain socket. | ||
|
||
A GRPC based interface will be used by CSI Node Plugins to invoke privileged operations through csi-proxy.exe. The GRPC API will be versioned and any release of csi-proxy.exe binary will maintain backward compatibility across all prior versions of the API. This avoids having to run multiple versions of the csi-proxy.exe binary on the same Windows host if multiple CSI Node Plugins have been configured and the version of the csi-proxy API required by the plugins are wide apart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backwards compatibility even for alpha?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up!
|
||
## Summary | ||
|
||
Container Storage Interface (CSI) is a modern GRPC based standard for implementing external storage plugins (maintained by storage vendors, cloud providers, etc.) for container orchestrators like Kubernetes. Persistent storage requirements of containerized workloads can be satisfied from a diverse array of storage systems by installing and configuring the CSI plugins supported by the desired storage system. This KEP covers the enhancements necessary in Kubernetes core and CSI related out-of-tree components (specific to Kubernetes) to support CSI plugins for Windows nodes in a Kubernetes cluster. With the enhancements proposed in this KEP, Kubernetes operators will be able to leverage modern CSI plugins to satisfy the persistent storage requirements of Windows workloads in Kubernetes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a link to CSI
|
||
## Motivation | ||
|
||
Support for containerized Windows workloads on Windows nodes in a Kubernetes cluster reached GA status in v1.14. For persistent storage requirements, Windows workloads today depend on: (1) Powershell based FlexVolume plugins maintained by Microsoft that support mounting remote storage volumes over SMB and iSCSI protocols and (2) In-tree plugins in Kubernetes core (kubernetes/kubernetes repository) for popular cloud environments that support formatting and mounting direct attached disks on Windows nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a link to the FlexVolume plugins.
|
||
### Non-Goals | ||
|
||
1. Support CSI controller plugin operations from Windows nodes: This may be considered in the future but not an immediate priority. If all the worker nodes in the cluster are Windows nodes and Linux master nodes have scheduling disabled then CSI controller plugins cannot be scheduled for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could state that this is orthogonal to the proposal.
hostPath: | ||
path: C:\var\lib\kubelet\ | ||
type: Directory | ||
- name: test-dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the test-dir for?
|
||
The main drawback associated with the approach leveraging csi-proxy.exe is that the life cycle of that binary will need to be managed out-of-band from Kubernetes. However, cluster administrators need to maintain and manage life cycle of other core binaries like kubeproxy.exe, kubelet.exe, dockerd.exe and containerd.exe (in the future). Therefore csi-proxy.exe will be one additional binary that will need to be treated in a similar way. | ||
|
||
The API versioning scheme described above, as of now, will maintain backward compatibility indefinitely. This requires the scope of csi-proxy.exe to be limited to a very scoped down fundamental set of operations. Maintainers therefore will need to be very cautious when accepting suggestions for new APIs and enhancements. This may slow progress at times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding another API and maintaining is my biggest concern with the approach. (I think we've also discussed this in the @PatrickLang's doc) Does SIG storage think the API surface will be manageable and stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if instead of a common proxy we have a per-driver proxy that is maintained and deployed by each driver vendor? It pushes the responsibility to the driver maintainers, but API and versioning compatibility is simpler since it's all under the control of one vendor. However, this does imply we need to give CSI drivers permissions to install binaries directly on the host.
But even for Linux CSI drivers today, some already need to do this to an extent (ie iscsd doesn't work well in a container)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that point, why not use that same deployment mechanism to install and run the CSI driver directly on the host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative to having a set of operations as a scoped down API exposed by the proxy is: surfacing a simple Exec
API in the proxy that accepts arbitrary parameters and returns stdout results. That can supersede all the operations mentioned here as containers will be free to run arbitrary commands on the host. I can list this as an alternative. The main disadvantage is this makes the proxy open to massive abuse.
With this approach + bind mounting a dedicated per-plugin path in the host, plugins can drop arbitrary exes they may need (like ebsnvme-id.exe
above) and run it through the exec interface if desired. Thoughts?
|
||
### Version Skew Strategy | ||
|
||
Different nodes in the cluster may be configured with different versions of csi-proxy.exe as part of a rolling upgrade of csi-proxy.exe. In such a scenario, it is recommended that csi-proxy.exe is upgraded first and once that is complete, the CSI Node Plugins that can take advantage of the new version of csi-proxy.exe is rolled out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the failure caused by the version skew be manifested to the users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will call out explicitly that plugins need to check version of the proxy using the version API call and fail initialization the plugin detects an incompatibility.
|
||
### Package CSI Node Plugins as containers and deployed as processes running on the host: | ||
|
||
With this approach, the container run time is enhanced to be able to launch binaries directly on the host after pulling down a container image and extracting the binary from the image. While usual Kubernetes constructs may be used to launch pods with a special RuntimeClass that can handle launching of the binaries as processes on hosts, Kubernetes won't be able to fully monitor and control the whole life-cycle of the binaries post launch. Collection of logs from the plugins also become problematic and will require tooling out-of-band from Kubernetes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think managing lifecycles and logs is possible, but require more work in the container runtime(s).
The cons mentioned here (i.e., lifecycle management and logs) also present in the approach above "Deploying CSI Node Plugins as binaries and deployed as processes running on the host".
|
||
### Support for Privileged Operations and Bi-directional mount propagation in Windows containers: | ||
|
||
At some point in the future, it is expected that a Windows LTS release will implement support for execution of privileged operations and bi-directional mount propagation from inside containers. At that point, the requirement of a proxy to handle privileged operations on behalf of containers will disappear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is confirmed yet (or maybe I've missed the update). I think it'd be good to point out that this option won't be available in a year, hence the need for a interim or even permanent solution.
|
||
## Infrastructure Needed | ||
|
||
The code for csi-proxy as well as the GRPC API will be maintained in a dedicated repo: github.com/kubernetes-csi/windows-csi-proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be a csi proxy for linux as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level, a CSI proxy for Linux in the context of operations enumerated in the API above, is unnecessary as the operations can be executed from privileged containers directly in Linux. However, as @msau42 pointed out above there are some special scenarios in Linux where a host footprint may be necessary for Linux plugins as well. iSCSI is an interesting corner case today for Linux where the iscsiadm
binary is typically mounted into the plugin container from the host while interactions with the iSCSI kernel modules is facilitated through iscsid
running on the host.
I will call out that the proxy called out in the KEP focusses on Windows today and may be expanded for Linux scenarios in the future. Does that sound good @msau42 @yujuhong?
@msau42 @PatrickLang @yujuhong Thanks for the detailed review so far. I have addressed all the comments thus far and also added a diagram clarifying the interactions on the node side with csi-proxy. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KEP is well-written and clear, thanks a lot Deep!
|
||
#### Enhancements in CSI Node Plugins | ||
|
||
Code for CSI Node Plugins need to be refactored to support CSI Node APIs in both Linux and Windows nodes. While the code targeting Linux nodes can assume privileged access to the host, the code targeting Windows nodes need to invoke the GRPC client API associated with the desired version of the CSIProxyService described above. Thus the codebase for most plugins that wishes to support Windows nodes may be common across Windows and Linux with some amount of platform specific logic to call CSIProxyService in case of Windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know approximately how many CSI Node Plugins already exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My very rough approximation is more than 20 at this point (targeting Linux). Pretty much all the major storage vendors are working on beta versions of CSI plugins. Some of them are working on multiple plugins targeting different product lines. The major clouds are also developing a CSI plugin and sometimes multiple ones for different storage backends (shared storage vs block).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 40+ CSI plugins listed publicly: https://kubernetes-csi.github.io/docs/drivers.html
rpc LinkVolume(LinkVolumeRequest) returns (LinkVolumeResponse) {} | ||
|
||
// GetDiskNumberWithLocation finds a disk device with requested Location <Bus, Target, LUN ID> | ||
rpc GetDiskNumberWithLocation(GetDiskNumberWithLocationRequest) returns (GetDiskNumberWithLocationResponse) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I smiled when I saw GetDiskNumberWithSCSIID
, I knew exactly why it was needed...
The GCE Persistent Disk driver (in-tree at least, haven't looked at the CSI version but assuming it's the same...) needs to map the PD "friendly name" to a Windows device id. We're starting to investigate if we can provide this information natively through the Get-Disk
cmdlet (kubernetes/kubernetes#74674). Even if we can change GCE to support this though, it looks like the PD driver would still end up using GetDiskNumberWithSCSIID
rather than GetDiskNumberWithLocation
since the latter will use <Bus, Target, LUN ID> as its key. Is that right?
Is the PD driver the only storage driver that uses such a "friendly name" for identifying the volume the pod wants to use? If not then perhaps an API method that accepts that as the argument would be useful; if so then GetDiskNumberWithSCSIID
seems probably suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great KEP @ddebroy! Just a couple comments and questions
1. Ability to handle Windows file paths in the Kubelet plugin watcher for domain sockets on Windows nodes. | ||
2. Refactor code in the CSI Node Driver Registrar so that it can be compiled for Windows. | ||
3. Build official CSI Node Driver Registrar container images based on Windows base images. | ||
4. Refactor code in existing CSI Node Plugins to support Windows. All privileged operations will need to be driven through an API exposed by a "privileged proxy" binary described below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we leave this step to individual plugin owners? Maybe producing a guide for driver developers to follow is more scalable
5. Build CSI Node Plugin container images based on Windows base images. | ||
6. Create DaemonSet YAMLs referring to CSI Node Driver Registrar container images [3] and CSI Node Plugin container images [5] targeting Windows. | ||
|
||
Besides the above enhancements, a new "privileged proxy" binary, named csi-proxy.exe is a key aspect of this KEP. csi-proxy.exe will run as a native Windows process on the Windows nodes configured as a Windows Service. csi-proxy.exe will expose an API (through GRPC over a named pipe) for executing privileged storage related operations on Windows hosts on behalf of Windows containers like CSI Node Plugins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general can you line break around 80 chars for review ease (can comment on specific lines)
|
||
![Kubernetes CSI Components](https://raw.githubusercontent.com/kubernetes/community/master/contributors/design-proposals/storage/container-storage-interface_diagram1.png?raw=true "Kubernetes CSI Components") | ||
|
||
Support for Unix Domain Sockets has been introduced in Windows Server 2019 and works across containers as well as host and container as long as the processes running in containers are listening on the socket. If a process from within a container wishes to connect to a domain socket that a process on the host OS is listening on, Windows returns a permission error. This scenario however does not arise in the context of interactions between Kubelet, CSI Node Driver Registrar and CSI Node Plugin as these involve a process in a container listening on a domain socket (CSI Node Driver Registrar or CSI Node Plugin) that a process on the host (Kubelet) connects to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the proxy? Host listening and the process in the container connecting. Also is there documentation or reasoning on why this restriction exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proxy exposes a Windows named pipe. Will call it out here and mention it is detailed in later sections. Regarding why this restriction exists, given the very specific permission error when trying to connect to a host socket from a containerized process, my assumption is it's a conscious security choice at the OS layer. Unfortunately, this behavior or error is not clearly documented in MSDN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my assumption is it's a conscious security choice at the OS layer
can we get more information about whether that is the case, and whether what is proposed here is undoing that choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Windows named pipes are an established way to have a container communicate with a host process for a while now. Support was added back in 2016: moby/moby#19911 in Docker engine and was specifically intended to provide a mechanism similar to Unix Domain Sockets in Windows before support for Domain Sockets was enabled in Windows later in 2019. /cc @PatrickLang for any additional comments on any specific reasons why Domain Sockets in the container process => host process path is not allowed while named pipes allow it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm not sure why it's being blocked. It may be a regression in Windows and it's something we can root cause as we implement this. Named pipes and unix domain sockets are functionally equivalent and both work as container mounts, so I think we'll decide on which to enable as part of this work. We've already shown that named pipes work, so if unix domain sockets can be "fixed" we can choose to enable that later.
|
||
Based on the above, we can conclude that some of the fundamental support in the OS and compiler with regards to domain sockets in the context of CSI plugin discovery and a channel for API invocation is present in a stable state in Windows Server 2019 today. Although there are some observed limitations with respect to domain sockets in Windows Server 2019, they are not major blockers in the context of CSI Node Plugins. In the section below, we call out the components in the context of CSI Node Plugins in Kubernetes that will need to be enhanced to properly account for Windows paths and make use of domain sockets in Windows in a manner very similar to Linux. | ||
|
||
CSI Node Plugins need to execute certain privileged operations at the host level as well as propagate mount points in response to the CSI API calls. Such operations involve: scanning disk identities to map the node OS's view of a disk device to a CSI volume provisioned and attached by CSI controller plugins, partitioning a disk and formatting it when necessary, bind-mounting volumes from the host to the container workload, resizing of the file system as part of a volume resize, etc. These operations cannot be invoked from a container in Windows today. As a result containerized CSI Node Plugins in Windows require some mechanism to perform these privileged operations on their behalf on the Windows host OS. csi-proxy.exe, described below serves that role by performing the storage related privileged operations on behalf of containers. Alternative approaches to csi-proxy.exe are described further below in the Alternatives section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These operations cannot be invoked from a container in Windows today.
Which of these operations can't be performed from inside a container... all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - all of them. No operations in the host namespace is permitted from inside a container.
Golang supports domain socket operations for Windows since go version 1.12. It was found that in Windows, `os.ModeSocket` is not set on the `os.FileMode` associated with domain socket files in Windows. Therefore determining whether a file is a domain socket file using `os.ModeSocket` does not work on Windows. | ||
|
||
Based on the above, we can conclude that some of the fundamental support in the OS and compiler with regards to domain sockets in the context of CSI plugin discovery and a channel for API invocation is present in a stable state in Windows Server 2019 today. Although there are some observed limitations with respect to domain sockets in Windows Server 2019, they are not major blockers in the context of CSI Node Plugins. In the section below, we call out the components in the context of CSI Node Plugins in Kubernetes that will need to be enhanced to properly account for Windows paths and make use of domain sockets in Windows in a manner very similar to Linux. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we consider instead of using the proxy to deploy the drivers uncontainerized completely? Might be worth noting somewhere if that was an option considered and why it's not the final design
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a note here pointing to the alternative design sections later down.
|
||
#### Beta -> GA Graduation | ||
|
||
- All in-tree storage plugins that implements support for Windows can use csi-proxy.exe along with other enhancements listed above to successfully deploy CSI plugins on Windows nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"all"? think we should clarify here, I think all would encompass things like "emptydir", "hostpath".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will make this more specific.
|
||
- All in-tree storage plugins that implements support for Windows can use csi-proxy.exe along with other enhancements listed above to successfully deploy CSI plugins on Windows nodes. | ||
- csi-proxy.exe supports v1 stable version of the CSIProxyService API. | ||
- Successful usage of csi-proxy.exe with support for v1 version of CSIProxyService API in Windows nodes by at least two storage vendors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you want to bundle graduation criteria with exact implementation details (maybe implementation might change?), instead it might be helpful to focus on what the end result would be and the ways we can validate those results? Not sure about this comment though, maybe we do want to speak about exact implementation details for more accuracy
|
||
The CSIProxyService API will be a defined set of operations that will need to expand over time as new CSI APIs are introduced that require new operations on every node as well as desire for richer operations by CSI plugin authors. Unfortunately this comes with a maintenance burden associated with tracking and graduating new RPCs across versions. | ||
|
||
An alternative approach that simplifies the above involves exposing a single Exec interface in CSIProxyService that supports passing an arbitrary set of parameters to arbitrary executables and powershell cmdlets on the Windows host and collecting and returning the stdout and stderr back to the invoking containerized process. Since all the currently enumerated operations in the CSIProxyService API can be driven through the generic Exec interface, the set of desired privileged operations necessary becomes a decision for plugin authors rather than maintainers of csi-proxy. The main drawback of this highly flexible mechanism is that it drastically increases the potential for abuse in the host by 3rd party plugin authors. The effect of bugs or vulnerabilities in individual plugins and ability to exploit them to take control of the host becomes critical. Depending on the adoption of csi-proxy in the Alpha and Beta phases and the need for specialized privileged operations, we may consider adding a generic Exec interface in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a driver is compromised, would exec even pose an even greater risk than the api surface in the main design? Looks like you could expose arbitrary directories to arbitrary places effectively giving you control over the entire system (maybe root through some finite number of steps). Not a security or windows expert though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that with a compromised plugin, a lot of damage can be done through the privileged interface. I will clarify that the security concern should not be a reason to pre-clue an Exec
interface if we do need one.
|
||
#### Support for Privileged Operations and Bi-directional mount propagation in Windows containers: | ||
|
||
At some point in the future, a Windows LTS release may implement support for execution of privileged operations and bi-directional mount propagation from inside containers. At that point, the requirement of a proxy to handle privileged operations on behalf of containers will disappear. However, before such support is committed to and implemented in a Windows LTS release (which is not expected in at least a year), we need solutions as described in the KEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we aware of any active development in this direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no publicly declared timeline/commitment around this.
/assign @liggitt |
/lgtm - let's get this draft merged and work on promoting to |
rpc LinkVolume(LinkVolumeRequest) returns (LinkVolumeResponse) {} | ||
|
||
// GetDiskLocations returns locations <Bus, Target, LUN ID> of all disk devices enumerated by Windows | ||
rpc GetDiskLocations(GetDiskLocationsRequest) returns (GetDiskLocationsResponse) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: List
instead of Get
|
||
### Upgrade / Downgrade Strategy | ||
|
||
In order to install a CSI Node Plugin or upgrade to a version of a CSI Node Plugin that uses an updated version of the CSIProxyService API not supported by the currently deployed version of csi-proxy.exe in the cluster, csi-proxy.exe will need to be upgraded first on all nodes of the cluster before deploying or upgrading the CSI Node Plugin. In case there is a very old CSI Node Plugin in the cluster that relies on a version of CSIProxyService API that is no longer supported by the new version of csi-proxy.exe, the previously installed version of csi-proxy.exe should not be uninstalled from the nodes. Such scenarios are expected to be an exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another strategy for the old apiversion scenario is to have csiproxy support multiple api versions for a few releases according to deprecation policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will call out a deprecation policy mirroring Kubernetes policy but again emphasize that goal should be to maintain compatibility as much as possible.
For provisional, this lgtm from storage perspective. To move to implementable, we should:
|
Can you squash your commits? |
- name: csi-proxy-pipe | ||
hostPath: | ||
path: \\.\pipe\csi-proxy | ||
type: null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type: null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no filesystem - do you have another suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absent or "" matches what would be specified in the API (and are equivalent, I think)
|
||
![Kubernetes CSI Components](https://raw.githubusercontent.com/kubernetes/community/master/contributors/design-proposals/storage/container-storage-interface_diagram1.png?raw=true "Kubernetes CSI Components") | ||
|
||
Support for Unix Domain Sockets has been introduced in Windows Server 2019 and works across containers as well as host and container as long as the processes running in containers are listening on the socket. If a process from within a container wishes to connect to a domain socket that a process on the host OS is listening on, Windows returns a permission error. This scenario however does not arise in the context of interactions between Kubelet, CSI Node Driver Registrar and CSI Node Plugin as these involve a process in a container listening on a domain socket (CSI Node Driver Registrar or CSI Node Plugin) that a process on the host (Kubelet) connects to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my assumption is it's a conscious security choice at the OS layer
can we get more information about whether that is the case, and whether what is proposed here is undoing that choice?
|
||
Registration of CSI Node Plugins on a Kubernetes node is handled by the Kubelet plugin watcher using the fsnotify package. This component needs to convert paths detected by fsnotify to Windows paths in handleCreateEvent() and handleDeleteEvent() before the paths are passed to AddOrUpdatePlugin() RemovePlugin() routines in desiredStateOfTheWorld. A new utility function, NormalizeWindowsPath(), will be added in utils to handle this. | ||
|
||
Given `os.ModeSocket` is not set on a socket file's `os.FileMode` in Windows, a specific check for `os.ModeSocket` in handleCreateEvent() will need to be relaxed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that a bug in go that should be reported/resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does appear to be a bug in go and I will report it and a pointer to it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will mention golang/go#33357 in the KEP.
Signed-off-by: Deep Debroy <[email protected]>
commits squashed |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ddebroy, msau42, PatrickLang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- name: plugin-dir | ||
mountPath: C:\csi | ||
- name: csi-proxy-pipe | ||
mountPath: \\.\pipe\csi-proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it is using relative path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jingxu97 No. Named pipe paths in Windows are always absolute. The format is \\.\pipe\pipe-name
. It is missing the version suffix mentioned in the KEP and I will at that to the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not getting these comments in before this merged. Overall, I don't see any problems with this approach, it seems on par with the security model for other plugins that expose privileged sockets (e.g. CSI plugin itself).
One mitigation that would be good to add is audit logging. At a minimum, it would be good to log each request to the API (with options), but ideally we'd have some notion of the source as well. If the proxy was authenticating, then that would be easy.
Support will need to be implemented in AllowsHostVolumePath to handle Windows pipe paths. | ||
|
||
#### Mitigation using a webhook | ||
An admission webhook can be implemented and deployed in clusters with Windows nodes that will reject all containers that mount `\\.\pipe\csi-proxy` as a hostPath volume but does not have privileged flag set in the pod's securityContext specification. This allows the privileged setting to be used for Windows pods as an indication the container will perform privileged operations. Other cluster-wide policies (e.g. PSP) that act on the privileged setting in a container's securityContext can enforce the same for CSI Node plugins targeting Windows nodes. Note that this does not in any way change how the privileged setting is used today for Linux nodes. If in the future, full privileged container support is introduced in Windows (as in Linux today), functionality of existing CSI Node Plugin DaemonSets (targeting Windows) with the privileged flag set should not get negatively impacted as they will be launched as privileged containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how symlinks are handled on windows, but on linux this would be easy to circumvent.
In general, mounting of hostpaths should be thought of as a highly privileged operation, especially if they aren't read-only (can these be mounted read-only?). If the only way to get access to the socket is via a hostpath mount, I'm not too worried.
If we did want to add more enforcement, I'd prefer to add authentication & authorization to the socket itself. E.g. ues token review + subject access review to ensure the container using the proxy is authorized to do so. I'm not sure this is necessary though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
symlinks to named pipes are not supported in Windows.
The purpose of this channel is to set up a shared communication place for the working group on supporting CSI windows The proposal for enhancements to support CSI plugins on Windows nodes is here kubernetes/enhancements#1141
Proposal for enhancements to support CSI plugins on Windows nodes.