Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4639: adding OCI VolumeSource #4642

Merged
merged 37 commits into from
Jun 21, 2024

Conversation

sallyom
Copy link
Contributor

@sallyom sallyom commented May 17, 2024

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 17, 2024
@sftim
Copy link
Contributor

sftim commented May 18, 2024

Can this feature be provided out-of-tree (for example, using CSI)?

Comment on lines 784 to 788
<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CSI is an option for this, explain how that alternative could look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A CSI driver would be a better approach IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtimes already pull and handle images using the CRI. I'm wondering how a CSI parallel pull, auth and storage implementation is better than extending an existing solution.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch very cool! is it possible to gate pulls based on signature verification & policy?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry missed this! Not at the moment, but I don't see why not.

Comment on lines 181 to 182
We propose to add a new `VolumeSource` that supports OCI images and/or artifacts. This `VolumeSource` will allow users to mount an OCI image or artifact directly into a pod,
making the files within the image accessible to the containers without the need for a shell or additional image manipulation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which kinds of artifacts? Not all OCI artifacts represent filesystems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I'm referencing files - the equivalent of an OCI scratch image - for container volume mounts - an artifact that can be represented as a file is all that makes sense here I think - but really, anything that can be mounted within a pod's or container's filesystem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the volume driver know what kind of OCI artifact it's trying to pull and mount? I know “container image” is one of the options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we handle artifacts/images for pods shared across containers in a completely different way than artifacts/images of individual containers... hmm

Copy link
Contributor

@sftim sftim May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this does end up in-tree, I'd hope to see a dedicated artifact type that means “here are files, don't try to run them”; this is just data.

An example: you run an antimalware service with scanning patterns for email messages. The scanner uses a blue/green approach where the same application container image gets deployed alongside an updated scanning patterns image, n times an hour. Both the application image artifact and the scanning pattern artifact are signed. The files in that signature image consist of compiled patterns that would work as a volume source, but there's no executable code in there.

(I do like the idea of making signature verification an eventual goal; that's a useful feature to leave room for).

Copy link
Contributor

@sftim sftim May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me what needs clarifying is:

  1. are we planning to support OCI container images as a readonly volume type, using CRI to handle those images but not to run any code?
  2. are we planning to support arbitrary OCI artifacts as volume sources, include OCI container images as one of potentially many kinds of artifact you can mount?

If then CRI feels like the better fit; that way we can use the Pod's RuntimeClass to fetch the image, in case it matters. If then we can't assume CRI knows about that kind of artifact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way we should document the alternative, even if we rule it out. It's good for KEPs to mention a sketch of the design(s) we didn't pick as well as the one we did.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the arbitrary OCI artifacts note.. needs investigation/discussion.. We are talking about setting up a call to discuss next week.

There are a number of ways artifacts have been implemented pre. and post oci 1.1 specs.

@SFTM to your 1. proposal question: What does it mean to provide a mount volume into the pod's containers but not to run any code? I get that we are not expanding a pod to be an arbitrary container image/artifact with it's own command to run (yet). Sure. But what does it mean to mount a container image or artifact, but not to run any code there or from there? Does the volume include platforms/versions or just the one image the oci tooling would've chosen if it was being run on the node as a container by a runtime engine, does the volume include the entire index manifest tree/chain, manifest configs layers and all? Or just the layers that would otherwise makeup the container's rootfs?

To your 2. proposal we also can't assume any CSI driver would know about all particular kinds of artifacts, registries and how to pull, parse & mount them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the arbitrary OCI artifacts note.. needs investigation/discussion.. We are talking about setting up a call to discuss next week.

The OCI volume source should be a mountable artifact in the first place.
Arbitrary OCI artifacts may not work very well. ORAS supports pushing any object to the registry and generating its OCI artifacts, but not all of them can be mounted as volumes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ① then CRI feels like the better fit; that way we can use the Pod's RuntimeClass to fetch the image, in case it matters

This may matter for VM-based runtimes that use block devices to expose the image rootfs.

But what does it mean to mount a container image or artifact, but not to run any code there or from there? Does the volume include platforms/versions or just the one image the oci tooling would've chosen if it was being run on the node as a container by a runtime engine, does the volume include the entire index manifest tree/chain, manifest configs layers and all? Or just the layers that would otherwise makeup the container's rootfs?

As a basic idea here: the one image the container runtime would have chosen (i.e., matching platforms in the case of an index) as a union (or flattened), but without the config (since the config does not affect any files). IOW the union of the layers that would otherwise make up the container's rootfs (and probably without any runtime-generated files such as /etc/hosts and without runtime-generated mounts like /proc or /sys).

If ② then we can't assume CRI knows about that kind of artifact.

To your 2. proposal we also can't assume any CSI driver would know about all particular kinds of artifacts, registries and how to pull, parse & mount them.

An argument in favor of the CSI approach is that a cluster admin could deploy a CSI driver that matches whatever artifact types they expect to have used in the cluster. On the other hand, another approach that we could take on the runtime side is to build out an extension mechanism such that the runtime delegates to some sort of mount-provider program based on the mediatype that is found after pulling the artifact.

@sallyom
Copy link
Contributor Author

sallyom commented May 21, 2024

Can this feature be provided out-of-tree (for example, using CSI)?

Anyone know what happened with this? I'll add this as an Alternative, I'm gathering info csi-driver-image-populator

@rchincha
Copy link

Noticing that there are a lot of groups working on "similar" things.
Getting like-minded folks together ...
https://github.com/converged-computing/oras-csi

- Update Kubelet and CRI to recognize and handle new media types associated with OCI artifacts.
- Ensure that pulling and storing these artifacts is as efficient and secure as with OCI images.

**Lifecycling and Garbage Collection:**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### [KEP 1495: Volume Populators](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1495-volume-populators)

The volume-populators API extension allows you to populate a volume with data from an external data source when the volume is created.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volume populator with a datasource pointing to a OCI artifact on a registry could still be an option. But with all the options k8s has, prudent to pick one.

### Custom CSI Plugin

See [https://github.com/warm-metal/container-image-csi-driver](https://github.com/warm-metal/container-image-csi-driver)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volumes and container images. The in-tree OCI VolumeSource will utilize these existing mechanisms.

2. **Integration with Kubernetes:**
- **Optimal Performance:** Deep integration with the scheduler and kubelet ensures optimal performance and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More details needed about how one would handle large OCI artifacts - AI/ML inference data use case for example. The real win we are looking for here is the latency/network cost of pulling said artifact and not needing to repeat this across multiple pod restarts.

Copy link

@rchincha rchincha May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://kubernetes.io/docs/concepts/storage/volumes/#gitrepo
^ could be similar to this, deprecated though in favor on initContainer/emptyDir

We likely need a "persistent" (cache?) volume source.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://kubernetes.io/docs/concepts/storage/volumes/#gitrepo ^ could be similar to this, deprecated though in favor on initContainer/emptyDir

We likely need a "persistent" (cache?) volume source.

On-demand loading with a separate standalone cache can handle large artifact sizes very well. We have done some POC about it.

If the artifacts support lazy loading, then only some metadata needs to be saved locally, and there is no need to worry too much about garbage collection or layer data reuse. We only need a local cache system to save temporary data, and the management of images/artifacts can be easier.

an additional operational burden. For a generic, vendor-agnostic, and widely-adopted solution this would not make sense.
- External CSI plugins implement their own lifecycle management and garbage collection mechanisms,
yet these already exist in-tree for OCI images.
- Performance: There is additional overhead with an out-of-tree CSI plugin, especially in scenarios requiring frequent image pulls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand more on what is this additional overhead?

Signed-off-by: Sascha Grunert <[email protected]>
@saschagrunert
Copy link
Member

Requesting approval from @kubernetes/sig-node-feature-requests

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 19, 2024
saschagrunert and others added 2 commits June 20, 2024 09:09
Co-authored-by: Brandon Mitchell <[email protected]>
Signed-off-by: Sascha Grunert <[email protected]>
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 21, 2024
@haircommander
Copy link
Contributor

excellent work here!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 21, 2024
@mrunalp
Copy link
Contributor

mrunalp commented Jun 21, 2024

/approve

@saschagrunert
Copy link
Member

@kubernetes/prod-readiness-reviewers PTAL

@soltysh
Copy link
Contributor

soltysh commented Jun 21, 2024

/approve
David previously approved the PRR, and he's currently out, so replacing him on that job 😉

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp, sallyom, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 21, 2024
@k8s-ci-robot k8s-ci-robot merged commit fa8ed51 into kubernetes:master Jun 21, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jun 21, 2024

To clear old volumes, all workloads using the `VolumeSource` needs to be
recreated after restarting the kubelets. The kube-apiserver does only the API
validation whereas the kubelets serve the implementation. This means means that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
s/means means/means/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Development

Successfully merging this pull request may close these issues.