-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PCIe] Community Guidelines and Roadmap #4894
base: poc/pcie
Are you sure you want to change the base?
[PCIe] Community Guidelines and Roadmap #4894
Conversation
6605c89
to
6b412fe
Compare
docs/pci/contribution-guidelines.md
Outdated
if possible, merged in rust-vmm, unless explicit exemption is granted by the maintainers. | ||
* Contributors should provide design documents in case of features spanning multiple PRs to receive | ||
early guidance from maintainers. | ||
* Contributors should not leave open PRs stale for more than two weeks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the maintainers commit to not let the PRs go stale from their end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should've just read ahead 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do.
Maintainers will review new PRs to the feature branch within one week.
of course we are peer in this relationship that is why we took the commitment to answer within 1 week (half of the time respect to contributor) because it is our duty set the example and guarantee a nice developer experience.
- notes from previous meeting - add a goals (mvp, stretch, out-of-scope) section - be more clear about PCIe and not PCI - incorporate feedback on PCIe topology Signed-off-by: Riccardo Mancini <[email protected]>
6b412fe
to
4f0f91d
Compare
PCIe Support in Firecracker Community Roadmap
This document describes the high-level changes required to support PCIe and
device passthrough in Firecracker and the main responsibilities of the
maintainers and the community to achieve the success of the initiative. This
document was first reviewed on November 6, 2024 and will go through a second
round of review on December 4, 2024. I will upload this document as a PR to the
poc/pcie
branch so that everybody will have the opportunity to leave comments along the
way.
Goals
MVP
These are the minimal set of goals that we are aiming to achieve:
bus topology.
with up to 256 buses, if we add support for multiple buses.
performance of virtio devices, over legacy IRQ, and opening the door for I/O
scalability / MQ devices.
Challenges below)
supported.
Stretch Goals
While we would like to get to do these goals, their priority will need to be
revisited once we have completed the MVP:
Out of Scope
We are not looking to support the following features in the medium term, to
focus on the core PCIe implementation. These will be reprioritized after the
goals above have been completed.
the tools offered by Firecracker, like vsock and network ports.
Challenges
Supporting PCIe in Firecracker and, in particular, device pass-through,
introduces new challenges. Namely:
the boot time and memory overheads of Firecracker VMs.
could be toggled on a per-VM basis through VM config or the HTTP API. This
would allow for use cases that don't want to enable PCIe to keep the
overheads and kernel footprint low (lightweight virtualization).
would initially be offered as an optional compilation feature.
VMM to allocate the entire physical memory of the VM to allow for DMA from the
device.
swiotlb and PCI ATS/PRI
may change the security posture of firecracker.
interferences.
security posture needs to be carefully evaluated.
devices, therefore, snapshot/resume will not be supported for active/online
passed-through devices.
Contribution Guidelines
Before diving deeper into the required changes in Firecracker, it’s important to
be clear on the responsibility split between the maintainers and the community
contributors. As this is a community-driven initiative, it will be the
responsibility of contributors to propose designs, make changes, and work with
the upstream rust-vmm community. Maintainers of Firecracker will provide
guidance, code reviews, project organization, facilitate rust-vmm interactions,
and automated testing of the new features.
Maintainers
features/pcie
andperiodically rebase it on top of main (every 3 weeks or on-demand in case of
required dependencies)
basic PCIe support:
poc/pcie.
The POC is just a scrappy implementation and will need to be rewritten from
scratch to meet the quality and security bars of Firecracker.
separate artifacts with PCIe support (eg guest kernels)
specific artifacts
feature branch
project using public github issues
to organize the work (proposed every 2 weeks)
approvals from maintainers are required to merge a PR. Maintainers should
provide the required approvals or guidance to unblock the PR to unblock within
two weeks.
changes before every merge of the feature branch in main. Any finding will be
shared with the community to help address the issues.
Contributors
multiple PRs to receive early guidance from maintainers.
into main and a PCI-specific part merged into the feature branch. For example,
we need to rework FC device management to support PCI, the development will
need to be done in main, and then merged to the PCIe feature branch.
upstream rust-vmm community, and, if possible, merged in rust-vmm, unless
explicit exemption is granted by the maintainers.
CONTRIBUTING.md.
Acceptance Criteria
A proposal of the different milestones of the project is defined in the
following sections. Each milestone identifies a point in the project where a
merge of the developed features in the main branch is possible. In order to
accept the merge:
planned.
all supported devices.
disabled for all devices, when compared to MMIO devices. In other words:
PCIe is opted out.
virtio-MMIO, in case PCI is opted in.
In case of regressions, details and reproducers will be shared with the
community.
blockers, details will be shared with the community.
significantly (more than 5%)
example, in the case of VFIO support).
Milestones
This section describes a proposed high-level plan of action to be discussed with
the community. A more detailed plan will need to be provided by contributors
before starting the implementation, which maintainers will help refine.
0. Proof of Concept and Definition of Goals
It is important that both maintainers and the community build confidence with
the changes and verify that it’s possible to achieve the respective goals with
this solution. For this reason, the Firecracker team has built a public
proof-of-concept with basic PCI passthrough and virtio-pci support:
poc/pcie.
The implementation of the POC is scrappy and would require a complete rewrite
from scratch that meets Firecracker quality and security bars, but it showcases
the main features (and drawbacks) of PCIe-passthrough and virtio-pci devices.
Before starting the actual implementation below, we need to be able to answer:
in firecracker?
1. virtio-pci support
The first milestone will be the support of the virtio-pci transport layer for
virtio. This is not strictly required for PCIe device passthrough, but we
believe it is the easier way to get the bulk of the PCIe code merged into
firecracker and rust-vmm, as there shouldn’t be any concerns from the security
and over-subscription point of view.
With this milestone, Firecracker customers will be able to configure any virtual
device to be attached to the PCIe root complex instead of the MMIO bus through a
per-device config. If no device in the VM uses PCIe, no PCIe functionality will
be created and there will be no changes over the current state. PCIe support
will be a first-class citizen of Firecracker and will be compiled in the
official releases of Firecracker.
Maintainers will:
tests on the new branch).
A proposed high-level plan for the contributions is presented below. A more
detailed plan will need to be provided by contributors before starting the
implementation.
work with the PCIe bus.
pci).
configuration space.
starting point is cloud-hypervisor implementation.
allowing for up to 2048 interrupt lines per device
Open questions:
using rust-vmm crates not yet used in Firecracker (vm-devices, vm-allocator,
...)? How much work will it be to refactor FC device management to start using
those crates as well?
devices?
2. PCIe-passthrough support design
The second milestone will be the design of the support of VFIO-based
PCIe-passthrough which will allow passing to the guest any physical PCIe device
from the host. This design will need to answer the still open questions around
snapshot/resume and VM oversubscriptability, and will guide the implementation
of the following milestones.
In particular, the main problems to solve are:
kernel address space to remove sensible information from it, protecting it
from speculative execution attacks.
hotplugging?
To enable prototyping of this milestone, maintainers will setup test artifacts
and infrastructure to test on Nvidia GPUs on PR and nightly. Maintainers will
also start early consultation with Amazon Security to identify additional
requirements.
3. Basic PCIe-passthrough support implementation
This proposed milestone will cover the basic implementation of PCIe
device-passthrough via VFIO. With this milestone, Firecracker customers will be
able to attach any and as many VFIO devices to the VM before boot. However,
customers will not be able to oversubscribe memory of VMs with PCI-passthrough
devices, as the entire guest physical memory needs to be allocated for DMA. It
should be possible, depending on the investigations in milestone 2, to
snapshot/resume a VM with an offlined VFIO device.
We expect this change to be fairly modular and self-contained as it builds upon
the first milestone, adding just an additional device type. The biggest hurdle
will be the thorough security review and the considerations around its
usefulness for internal customers.
We expect the biggest hurdles for this change to be the security review, as it’s
a change in the current Firecracker threat model. Furthermore, a path forward
towards full oversubscribability needs to be identified and prototyped for this
milestone to be accepted.
Stretch Goals
Once we reach the MVP goals with the milestones above, we'll need to prioritize
the stretch goals:
Memory Oversubscription
Depending on the investigations in milestone 2, we need to implement a way to
oversubscribe memory from VMs with PCI-passthrough devices. The challenge is
that the hypervisor needs to know in advance which guest physical memory ranges
will be used by DMA.
One way to do it would be to ask the guest to configure a virtual IOMMU to
enable DMA from the device. In this case, the hypervisor will know which memory
ranges the guest is using for DMA so that they can be granularly pre-allocated.
This could be done through the
virtio-iommu
device.One alternative could be PCI ATS/PRI or using a swiotlb in the guest.
PCIe hotplugging
This needs to be investigated further, but it's a highly requested feature for
the containerization world (eg Kata containers). One challenge to keep in mind
is the PCIe aperture size of the devices to be hotplugged, which might not be
known in advance, and which requires additional care.
Appendix
Meeting Notes
November 6, 2024
MVP, refining the acceptance criteria to narrow down the targeted use-cases.
about P2P? We are aiming for simple support of a single GPU.
we're focusing on PF, but VF is something we want and we will call it out
explicitly
cold-plugging first? In the MVP, we want to focus on simple cold plugging
with the intention to support hotplugging in the future.
their API. Also, it introduces issues around detecting PCI root port
topology as the required aperture size might not be known in advance as
it depends on GPU.
supported in the first iterations.
supporting
iommufd
as backend. We will look into these.including Firecracker GPU support. Details on how they interact with hardware
devices can be found here (thanks @zvonkok):
https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-vra.md
https://docs.google.com/presentation/d/13TDKyASpMfDrVBSRj4JiU6gFeChx0ws4DTenBN1qUnA/edit?usp=sharing
Pass-through resource allocations from runtime-config (CRI) to oci-spec cri-o/cri-o#8321,
Pass-through resource allocations from runtime-config (CRI) to oci-spec containerd/containerd#10282
Next steps:
identified in the meeting [PCIe] Community Guidelines and Roadmap #4894
milestone (just virtio-pci device support, no GPU or device passthrough yet).
s3://spec.ccfc.min/firecracker-ci/v1.11-pcie-poc/$ARCH