Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: use TPR for interrupt reentrancy protection #568

Merged
merged 2 commits into from
Dec 18, 2024

Conversation

msft-jlange
Copy link
Collaborator

Using TPR for interrupt reentrancy protection allows high-priority interrupts to be serviced while low-priority interrupts are masked. This PR introduces TPR management logic and integrates TPR management into locking mechanisms.

@msft-jlange msft-jlange force-pushed the tpr branch 2 times, most recently from fa35f13 to 0f3880c Compare December 11, 2024 17:38
@msft-jlange msft-jlange added the wait-for-review PR needs for approval by reviewers label Dec 11, 2024
Copy link
Member

@joergroedel joergroedel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes look good to me, I left two comments to improve the readability.

Also, on my KVM setup, the test-in-svsm run failed with this message and stack-trace:

[SVSM] test cpu::irq_state::tests::tpr_guard_test ...
[SVSM] ERROR: Panic: CPU[0] panicked at kernel/src/cpu/irq_state.rs:445:9:
assertion `left == right` failed
  left: 0
 right: 9
[SVSM] ---BACKTRACE---:
[SVSM]   [0xffffff8000144e73]
[SVSM]   [0xffffff800013eb3a]
[SVSM]   [0xffffff80000480a9]
[SVSM]   [0xffffff80000713c1]
[SVSM]   [0xffffff800010dd01]
[SVSM]   [0xffffff80000da236]
[SVSM]   [0xffffff80000175d5]
[SVSM]   [0xffffff800009e93e]
[SVSM]   [0xffffff80000cdeea]
[SVSM]   [0xffffff80000cdef0]
[SVSM] ---END---

Decoded stack-trace is:

core::panicking::assert_failed_inner
??:?
core::panicking::assert_failed
??:?
svsm::cpu::irq_state::tests::tpr_guard_test
/home/joro/src/svsm/kernel/src/cpu/irq_state.rs:?
svsm::cpu::irq_state::tests::tpr_guard_test::{{closure}}
/home/joro/src/svsm/kernel/src/cpu/irq_state.rs:428
core::ops::function::FnOnce::call_once
/rustc/bc5cf994db9fb46712cefd89f78ad7fc51f184a2/library/core/src/ops/function.rs:250
svsm::testing::svsm_test_runner
/home/joro/src/svsm/kernel/src/testing.rs:80
svsm::test_main
/home/joro/src/svsm/kernel/src/lib.rs:?
svsm_main
/home/joro/src/svsm/kernel/src/svsm.rs:335
svsm::task::tasks::task_exit
/home/joro/src/svsm/kernel/src/task/tasks.rs:787

kernel/src/cpu/idt/svsm.rs Outdated Show resolved Hide resolved
kernel/src/cpu/irq_state.rs Outdated Show resolved Hide resolved
@joergroedel joergroedel added wait-for-update PR is waiting to be updated to address review comments in-review PR is under active review and not yet approved and removed wait-for-review PR needs for approval by reviewers labels Dec 16, 2024
@msft-jlange
Copy link
Collaborator Author

Overall the changes look good to me, I left two comments to improve the readability.

That suggests a fatal problem with TPR use in an environment that doesn't support the use of interrupts.

I cannot think of any reason that the SVSM would observe a TPR of 9, particularly if the firmware has not yet started executing. But regardless, if vTPR is shared between the SVSM and the guest OS, then it is not usable for any reason in the SVSM, and is certainly not usable for lock management.

For the sake of making progress on TPR and making it usable for IPIs (another scenario that depends on the use of SVSM interrupts), I will plan to split out the scheduler lock changes into a separate PR which we will postpone until SVSM interrupts are fully supported. I will also change the unit test so it only executes on systems that support SVSM interrupts.

I would be grateful if you could try to figure out where the raised TPR is coming from, since that only happens on KVM systems. We should at least understand why it appears random in the test environment, since that is unexpected.

@msft-jlange
Copy link
Collaborator Author

I would be grateful if you could try to figure out where the raised TPR is coming from, since that only happens on KVM systems. We should at least understand why it appears random in the test environment, since that is unexpected.

It turns out that the 9 issue was my bug, and it is now fixed. However, I still believe that TPR cannot be used in lock handling if the kernel cannot make use of interrupts, so I will still defer the scheduler changes to a future point in time.

@msft-jlange msft-jlange removed the wait-for-update PR is waiting to be updated to address review comments label Dec 16, 2024
@joergroedel
Copy link
Member

I just ran into a fundamental problem with KVM and these patches. As it turns out, with the 6.8 kernel the TPR is still shared across VMPL levels, which makes the SVSM crash when Linux changes the TPR. The road forward here is to update the SVSM development branches for kernel and QEMU.

@msft-jlange
Copy link
Collaborator Author

I just ran into a fundamental problem with KVM and these patches. As it turns out, with the 6.8 kernel the TPR is still shared across VMPL levels, which makes the SVSM crash when Linux changes the TPR. The road forward here is to update the SVSM development branches for kernel and QEMU.

In what way does that cause this change to break? I've removed the changes to consume TPR-safe locks, so the only reference to TPR should be in interrupt dispatch, and since the SVSM does not support interrupts when running under KVM, there should be no issue. I would expect that everything now functions correctly under KVM. Are you still seeing issues?

@joergroedel
Copy link
Member

In what way does that cause this change to break? I've removed the changes to consume TPR-safe locks, so the only reference to TPR should be in interrupt dispatch, and since the SVSM does not support interrupts when running under KVM, there should be no issue. I would expect that everything now functions correctly under KVM. Are you still seeing issues?

The assert for tpr == 0 in the scheduling code triggers at some point during guest boot, because apparently Linux set the TPR to 1 before calling into the SVSM. I have updated the Linux and kernel branches and the documentation to use the 6.11 based kernel and 9.0 QEMU with Roys patches. With that KVM version this PR works, but we need to give people some time to update their environments before merging this PR (because it will break things when used on the old kernel).

The priority scheme inherently associated with interrupts permits
masking interrupts of lower priority classes while handling
higher-priority interrupts. This makes it possible to enable interrupts
while in an interrupt handler, allowing high-priority interrupts to be
handled while masking lower-priorty interrupts.  By raising TPR during
interrupt handling, it becomes possible to be safely preempted by
higher-priority interrupts during interrupt handling.

This change also defines a mechanism where code unrelated to interrupt
handlers can temporarily raise TPR to mask low-priority interrupts while
permitting the delivery of high-priority interrupts.  In the future, it
will be possible to enable spin locks or other locks to be associated
with TPR so that a lock can safely be acquired and prevent reentracncy
by low-priority interrupts while still permitting the handilng of
high-priority interrupts.

Signed-off-by: Jon Lange <[email protected]>
Using TPR for reentrancy protection means that higher priority
interrupts can continue to be delivered while locks are held.  Each
TPR-protected lock defines the TPR with which it is associated, so any
attempt to acquire the lock from a higher priority interrupt will panic
due to TPR inversion.

Signed-off-by: Jon Lange <[email protected]>
@msft-jlange
Copy link
Collaborator Author

The assert for tpr == 0 in the scheduling code triggers at some point during guest boot, because apparently Linux set the TPR to 1 before calling into the SVSM.

Thanks for identifying the issue. I've changed the assert so it will only fire when interrupts are usable (since the assert only matters if interrupts are usable). I believe this PR will now work without requiring a kernel change.

@joergroedel
Copy link
Member

Thanks for identifying the issue. I've changed the assert so it will only fire when interrupts are usable (since the assert only matters if interrupts are usable). I believe this PR will now work without requiring a kernel change.

Thanks for updating the PR. Testing looks good now.

@joergroedel joergroedel merged commit 08c12fd into coconut-svsm:main Dec 18, 2024
4 checks passed
@msft-jlange msft-jlange deleted the tpr branch December 18, 2024 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-review PR is under active review and not yet approved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants