Skip to content

Commit 933425f

Browse files
committed
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini: "First batch of KVM changes for 4.4. s390: A bunch of fixes and optimizations for interrupt and time handling. PPC: Mostly bug fixes. ARM: No big features, but many small fixes and prerequisites including: - a number of fixes for the arch-timer - introducing proper level-triggered semantics for the arch-timers - a series of patches to synchronously halt a guest (prerequisite for IRQ forwarding) - some tracepoint improvements - a tweak for the EL2 panic handlers - some more VGIC cleanups getting rid of redundant state x86: Quite a few changes: - support for VT-d posted interrupts (i.e. PCI devices can inject interrupts directly into vCPUs). This introduces a new component (in virt/lib/) that connects VFIO and KVM together. The same infrastructure will be used for ARM interrupt forwarding as well. - more Hyper-V features, though the main one Hyper-V synthetic interrupt controller will have to wait for 4.5. These will let KVM expose Hyper-V devices. - nested virtualization now supports VPID (same as PCID but for vCPUs) which makes it quite a bit faster - for future hardware that supports NVDIMM, there is support for clflushopt, clwb, pcommit - support for "split irqchip", i.e. LAPIC in kernel + IOAPIC/PIC/PIT in userspace, which reduces the attack surface of the hypervisor - obligatory smattering of SMM fixes - on the guest side, stable scheduler clock support was rewritten to not require help from the hypervisor" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits) KVM: VMX: Fix commit which broke PML KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0() KVM: x86: allow RSM from 64-bit mode KVM: VMX: fix SMEP and SMAP without EPT KVM: x86: move kvm_set_irq_inatomic to legacy device assignment KVM: device assignment: remove pointless #ifdefs KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic KVM: x86: zero apic_arb_prio on reset drivers/hv: share Hyper-V SynIC constants with userspace KVM: x86: handle SMBASE as physical address in RSM KVM: x86: add read_phys to x86_emulate_ops KVM: x86: removing unused variable KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr() KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings KVM: arm/arm64: Optimize away redundant LR tracking KVM: s390: use simple switch statement as multiplexer KVM: s390: drop useless newline in debugging data KVM: s390: SCA must not cross page boundaries KVM: arm: Do not indent the arguments of DECLARE_BITMAP ...
2 parents a3e7531 + a3eaa86 commit 933425f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+2956
-1029
lines changed

Documentation/kernel-parameters.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1585,6 +1585,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
15851585
nosid disable Source ID checking
15861586
no_x2apic_optout
15871587
BIOS x2APIC opt-out request will be ignored
1588+
nopost disable Interrupt Posting
15881589

15891590
iomem= Disable strict checking of access to MMIO memory
15901591
strict regions from userspace.

Documentation/virtual/kvm/api.txt

+47-5
Original file line numberDiff line numberDiff line change
@@ -401,10 +401,9 @@ Capability: basic
401401
Architectures: x86, ppc, mips
402402
Type: vcpu ioctl
403403
Parameters: struct kvm_interrupt (in)
404-
Returns: 0 on success, -1 on error
404+
Returns: 0 on success, negative on failure.
405405

406-
Queues a hardware interrupt vector to be injected. This is only
407-
useful if in-kernel local APIC or equivalent is not used.
406+
Queues a hardware interrupt vector to be injected.
408407

409408
/* for KVM_INTERRUPT */
410409
struct kvm_interrupt {
@@ -414,7 +413,14 @@ struct kvm_interrupt {
414413

415414
X86:
416415

417-
Note 'irq' is an interrupt vector, not an interrupt pin or line.
416+
Returns: 0 on success,
417+
-EEXIST if an interrupt is already enqueued
418+
-EINVAL the the irq number is invalid
419+
-ENXIO if the PIC is in the kernel
420+
-EFAULT if the pointer is invalid
421+
422+
Note 'irq' is an interrupt vector, not an interrupt pin or line. This
423+
ioctl is useful if the in-kernel PIC is not used.
418424

419425
PPC:
420426

@@ -1598,7 +1604,7 @@ provided event instead of triggering an exit.
15981604
struct kvm_ioeventfd {
15991605
__u64 datamatch;
16001606
__u64 addr; /* legal pio/mmio address */
1601-
__u32 len; /* 1, 2, 4, or 8 bytes */
1607+
__u32 len; /* 0, 1, 2, 4, or 8 bytes */
16021608
__s32 fd;
16031609
__u32 flags;
16041610
__u8 pad[36];
@@ -1621,6 +1627,10 @@ to the registered address is equal to datamatch in struct kvm_ioeventfd.
16211627
For virtio-ccw devices, addr contains the subchannel id and datamatch the
16221628
virtqueue index.
16231629

1630+
With KVM_CAP_IOEVENTFD_ANY_LENGTH, a zero length ioeventfd is allowed, and
1631+
the kernel will ignore the length of guest write and may get a faster vmexit.
1632+
The speedup may only apply to specific architectures, but the ioeventfd will
1633+
work anyway.
16241634

16251635
4.60 KVM_DIRTY_TLB
16261636

@@ -3309,6 +3319,18 @@ Valid values for 'type' are:
33093319
to ignore the request, or to gather VM memory core dump and/or
33103320
reset/shutdown of the VM.
33113321

3322+
/* KVM_EXIT_IOAPIC_EOI */
3323+
struct {
3324+
__u8 vector;
3325+
} eoi;
3326+
3327+
Indicates that the VCPU's in-kernel local APIC received an EOI for a
3328+
level-triggered IOAPIC interrupt. This exit only triggers when the
3329+
IOAPIC is implemented in userspace (i.e. KVM_CAP_SPLIT_IRQCHIP is enabled);
3330+
the userspace IOAPIC should process the EOI and retrigger the interrupt if
3331+
it is still asserted. Vector is the LAPIC interrupt vector for which the
3332+
EOI was received.
3333+
33123334
/* Fix the size of the union. */
33133335
char padding[256];
33143336
};
@@ -3627,6 +3649,26 @@ struct {
36273649

36283650
KVM handlers should exit to userspace with rc = -EREMOTE.
36293651

3652+
7.5 KVM_CAP_SPLIT_IRQCHIP
3653+
3654+
Architectures: x86
3655+
Parameters: args[0] - number of routes reserved for userspace IOAPICs
3656+
Returns: 0 on success, -1 on error
3657+
3658+
Create a local apic for each processor in the kernel. This can be used
3659+
instead of KVM_CREATE_IRQCHIP if the userspace VMM wishes to emulate the
3660+
IOAPIC and PIC (and also the PIT, even though this has to be enabled
3661+
separately).
3662+
3663+
This capability also enables in kernel routing of interrupt requests;
3664+
when KVM_CAP_SPLIT_IRQCHIP only routes of KVM_IRQ_ROUTING_MSI type are
3665+
used in the IRQ routing table. The first args[0] MSI routes are reserved
3666+
for the IOAPIC pins. Whenever the LAPIC receives an EOI for these routes,
3667+
a KVM_EXIT_IOAPIC_EOI vmexit will be reported to userspace.
3668+
3669+
Fails if VCPU has already been created, or if the irqchip is already in the
3670+
kernel (i.e. KVM_CREATE_IRQCHIP has already been called).
3671+
36303672

36313673
8. Other capabilities.
36323674
----------------------
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
KVM/ARM VGIC Forwarded Physical Interrupts
2+
==========================================
3+
4+
The KVM/ARM code implements software support for the ARM Generic
5+
Interrupt Controller's (GIC's) hardware support for virtualization by
6+
allowing software to inject virtual interrupts to a VM, which the guest
7+
OS sees as regular interrupts. The code is famously known as the VGIC.
8+
9+
Some of these virtual interrupts, however, correspond to physical
10+
interrupts from real physical devices. One example could be the
11+
architected timer, which itself supports virtualization, and therefore
12+
lets a guest OS program the hardware device directly to raise an
13+
interrupt at some point in time. When such an interrupt is raised, the
14+
host OS initially handles the interrupt and must somehow signal this
15+
event as a virtual interrupt to the guest. Another example could be a
16+
passthrough device, where the physical interrupts are initially handled
17+
by the host, but the device driver for the device lives in the guest OS
18+
and KVM must therefore somehow inject a virtual interrupt on behalf of
19+
the physical one to the guest OS.
20+
21+
These virtual interrupts corresponding to a physical interrupt on the
22+
host are called forwarded physical interrupts, but are also sometimes
23+
referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
24+
25+
Forwarded physical interrupts are handled slightly differently compared
26+
to virtual interrupts generated purely by a software emulated device.
27+
28+
29+
The HW bit
30+
----------
31+
Virtual interrupts are signalled to the guest by programming the List
32+
Registers (LRs) on the GIC before running a VCPU. The LR is programmed
33+
with the virtual IRQ number and the state of the interrupt (Pending,
34+
Active, or Pending+Active). When the guest ACKs and EOIs a virtual
35+
interrupt, the LR state moves from Pending to Active, and finally to
36+
inactive.
37+
38+
The LRs include an extra bit, called the HW bit. When this bit is set,
39+
KVM must also program an additional field in the LR, the physical IRQ
40+
number, to link the virtual with the physical IRQ.
41+
42+
When the HW bit is set, KVM must EITHER set the Pending OR the Active
43+
bit, never both at the same time.
44+
45+
Setting the HW bit causes the hardware to deactivate the physical
46+
interrupt on the physical distributor when the guest deactivates the
47+
corresponding virtual interrupt.
48+
49+
50+
Forwarded Physical Interrupts Life Cycle
51+
----------------------------------------
52+
53+
The state of forwarded physical interrupts is managed in the following way:
54+
55+
- The physical interrupt is acked by the host, and becomes active on
56+
the physical distributor (*).
57+
- KVM sets the LR.Pending bit, because this is the only way the GICV
58+
interface is going to present it to the guest.
59+
- LR.Pending will stay set as long as the guest has not acked the interrupt.
60+
- LR.Pending transitions to LR.Active on the guest read of the IAR, as
61+
expected.
62+
- On guest EOI, the *physical distributor* active bit gets cleared,
63+
but the LR.Active is left untouched (set).
64+
- KVM clears the LR on VM exits when the physical distributor
65+
active state has been cleared.
66+
67+
(*): The host handling is slightly more complicated. For some forwarded
68+
interrupts (shared), KVM directly sets the active state on the physical
69+
distributor before entering the guest, because the interrupt is never actually
70+
handled on the host (see details on the timer as an example below). For other
71+
forwarded interrupts (non-shared) the host does not deactivate the interrupt
72+
when the host ISR completes, but leaves the interrupt active until the guest
73+
deactivates it. Leaving the interrupt active is allowed, because Linux
74+
configures the physical GIC with EOIMode=1, which causes EOI operations to
75+
perform a priority drop allowing the GIC to receive other interrupts of the
76+
default priority.
77+
78+
79+
Forwarded Edge and Level Triggered PPIs and SPIs
80+
------------------------------------------------
81+
Forwarded physical interrupts injected should always be active on the
82+
physical distributor when injected to a guest.
83+
84+
Level-triggered interrupts will keep the interrupt line to the GIC
85+
asserted, typically until the guest programs the device to deassert the
86+
line. This means that the interrupt will remain pending on the physical
87+
distributor until the guest has reprogrammed the device. Since we
88+
always run the VM with interrupts enabled on the CPU, a pending
89+
interrupt will exit the guest as soon as we switch into the guest,
90+
preventing the guest from ever making progress as the process repeats
91+
over and over. Therefore, the active state on the physical distributor
92+
must be set when entering the guest, preventing the GIC from forwarding
93+
the pending interrupt to the CPU. As soon as the guest deactivates the
94+
interrupt, the physical line is sampled by the hardware again and the host
95+
takes a new interrupt if and only if the physical line is still asserted.
96+
97+
Edge-triggered interrupts do not exhibit the same problem with
98+
preventing guest execution that level-triggered interrupts do. One
99+
option is to not use HW bit at all, and inject edge-triggered interrupts
100+
from a physical device as pure virtual interrupts. But that would
101+
potentially slow down handling of the interrupt in the guest, because a
102+
physical interrupt occurring in the middle of the guest ISR would
103+
preempt the guest for the host to handle the interrupt. Additionally,
104+
if you configure the system to handle interrupts on a separate physical
105+
core from that running your VCPU, you still have to interrupt the VCPU
106+
to queue the pending state onto the LR, even though the guest won't use
107+
this information until the guest ISR completes. Therefore, the HW
108+
bit should always be set for forwarded edge-triggered interrupts. With
109+
the HW bit set, the virtual interrupt is injected and additional
110+
physical interrupts occurring before the guest deactivates the interrupt
111+
simply mark the state on the physical distributor as Pending+Active. As
112+
soon as the guest deactivates the interrupt, the host takes another
113+
interrupt if and only if there was a physical interrupt between injecting
114+
the forwarded interrupt to the guest and the guest deactivating the
115+
interrupt.
116+
117+
Consequently, whenever we schedule a VCPU with one or more LRs with the
118+
HW bit set, the interrupt must also be active on the physical
119+
distributor.
120+
121+
122+
Forwarded LPIs
123+
--------------
124+
LPIs, introduced in GICv3, are always edge-triggered and do not have an
125+
active state. They become pending when a device signal them, and as
126+
soon as they are acked by the CPU, they are inactive again.
127+
128+
It therefore doesn't make sense, and is not supported, to set the HW bit
129+
for physical LPIs that are forwarded to a VM as virtual interrupts,
130+
typically virtual SPIs.
131+
132+
For LPIs, there is no other choice than to preempt the VCPU thread if
133+
necessary, and queue the pending state onto the LR.
134+
135+
136+
Putting It Together: The Architected Timer
137+
------------------------------------------
138+
The architected timer is a device that signals interrupts with level
139+
triggered semantics. The timer hardware is directly accessed by VCPUs
140+
which program the timer to fire at some point in time. Each VCPU on a
141+
system programs the timer to fire at different times, and therefore the
142+
hardware is multiplexed between multiple VCPUs. This is implemented by
143+
context-switching the timer state along with each VCPU thread.
144+
145+
However, this means that a scenario like the following is entirely
146+
possible, and in fact, typical:
147+
148+
1. KVM runs the VCPU
149+
2. The guest programs the time to fire in T+100
150+
3. The guest is idle and calls WFI (wait-for-interrupts)
151+
4. The hardware traps to the host
152+
5. KVM stores the timer state to memory and disables the hardware timer
153+
6. KVM schedules a soft timer to fire in T+(100 - time since step 2)
154+
7. KVM puts the VCPU thread to sleep (on a waitqueue)
155+
8. The soft timer fires, waking up the VCPU thread
156+
9. KVM reprograms the timer hardware with the VCPU's values
157+
10. KVM marks the timer interrupt as active on the physical distributor
158+
11. KVM injects a forwarded physical interrupt to the guest
159+
12. KVM runs the VCPU
160+
161+
Notice that KVM injects a forwarded physical interrupt in step 11 without
162+
the corresponding interrupt having actually fired on the host. That is
163+
exactly why we mark the timer interrupt as active in step 10, because
164+
the active state on the physical distributor is part of the state
165+
belonging to the timer hardware, which is context-switched along with
166+
the VCPU thread.
167+
168+
If the guest does not idle because it is busy, the flow looks like this
169+
instead:
170+
171+
1. KVM runs the VCPU
172+
2. The guest programs the time to fire in T+100
173+
4. At T+100 the timer fires and a physical IRQ causes the VM to exit
174+
(note that this initially only traps to EL2 and does not run the host ISR
175+
until KVM has returned to the host).
176+
5. With interrupts still disabled on the CPU coming back from the guest, KVM
177+
stores the virtual timer state to memory and disables the virtual hw timer.
178+
6. KVM looks at the timer state (in memory) and injects a forwarded physical
179+
interrupt because it concludes the timer has expired.
180+
7. KVM marks the timer interrupt as active on the physical distributor
181+
7. KVM enables the timer, enables interrupts, and runs the VCPU
182+
183+
Notice that again the forwarded physical interrupt is injected to the
184+
guest without having actually been handled on the host. In this case it
185+
is because the physical interrupt is never actually seen by the host because the
186+
timer is disabled upon guest return, and the virtual forwarded interrupt is
187+
injected on the KVM guest entry path.

Documentation/virtual/kvm/devices/arm-vgic.txt

+10-8
Original file line numberDiff line numberDiff line change
@@ -44,28 +44,29 @@ Groups:
4444
Attributes:
4545
The attr field of kvm_device_attr encodes two values:
4646
bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 |
47-
values: | reserved | cpu id | offset |
47+
values: | reserved | vcpu_index | offset |
4848

4949
All distributor regs are (rw, 32-bit)
5050

5151
The offset is relative to the "Distributor base address" as defined in the
5252
GICv2 specs. Getting or setting such a register has the same effect as
53-
reading or writing the register on the actual hardware from the cpu
54-
specified with cpu id field. Note that most distributor fields are not
55-
banked, but return the same value regardless of the cpu id used to access
56-
the register.
53+
reading or writing the register on the actual hardware from the cpu whose
54+
index is specified with the vcpu_index field. Note that most distributor
55+
fields are not banked, but return the same value regardless of the
56+
vcpu_index used to access the register.
5757
Limitations:
5858
- Priorities are not implemented, and registers are RAZ/WI
5959
- Currently only implemented for KVM_DEV_TYPE_ARM_VGIC_V2.
6060
Errors:
61-
-ENODEV: Getting or setting this register is not yet supported
61+
-ENXIO: Getting or setting this register is not yet supported
6262
-EBUSY: One or more VCPUs are running
63+
-EINVAL: Invalid vcpu_index supplied
6364

6465
KVM_DEV_ARM_VGIC_GRP_CPU_REGS
6566
Attributes:
6667
The attr field of kvm_device_attr encodes two values:
6768
bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 |
68-
values: | reserved | cpu id | offset |
69+
values: | reserved | vcpu_index | offset |
6970

7071
All CPU interface regs are (rw, 32-bit)
7172

@@ -91,8 +92,9 @@ Groups:
9192
- Priorities are not implemented, and registers are RAZ/WI
9293
- Currently only implemented for KVM_DEV_TYPE_ARM_VGIC_V2.
9394
Errors:
94-
-ENODEV: Getting or setting this register is not yet supported
95+
-ENXIO: Getting or setting this register is not yet supported
9596
-EBUSY: One or more VCPUs are running
97+
-EINVAL: Invalid vcpu_index supplied
9698

9799
KVM_DEV_ARM_VGIC_GRP_NR_IRQS
98100
Attributes:

Documentation/virtual/kvm/locking.txt

+12
Original file line numberDiff line numberDiff line change
@@ -166,3 +166,15 @@ Comment: The srcu read lock must be held while accessing memslots (e.g.
166166
MMIO/PIO address->device structure mapping (kvm->buses).
167167
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
168168
if it is needed by multiple functions.
169+
170+
Name: blocked_vcpu_on_cpu_lock
171+
Type: spinlock_t
172+
Arch: x86
173+
Protects: blocked_vcpu_on_cpu
174+
Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
175+
When VT-d posted-interrupts is supported and the VM has assigned
176+
devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
177+
protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
178+
wakeup notification event since external interrupts from the
179+
assigned devices happens, we will find the vCPU on the list to
180+
wakeup.

MAINTAINERS

+7
Original file line numberDiff line numberDiff line change
@@ -11348,6 +11348,13 @@ L: [email protected]
1134811348
S: Maintained
1134911349
F: drivers/net/ethernet/via/via-velocity.*
1135011350

11351+
VIRT LIB
11352+
M: Alex Williamson <[email protected]>
11353+
M: Paolo Bonzini <[email protected]>
11354+
11355+
S: Supported
11356+
F: virt/lib/
11357+
1135111358
VIVID VIRTUAL VIDEO DRIVER
1135211359
M: Hans Verkuil <[email protected]>
1135311360

0 commit comments

Comments
 (0)