-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Live Migration: Instruction_Abort when executing restored VM #27
Comments
Hi, Have you "restored" all of the RAM regions ? With RMM-1.0, we cannot "load any content into Realm memory" after the ACTIVATE step. Since you are restoring a Realm VM from a previous state, you need to make sure all of the RAM region is POPULATED (not just INIT_RIPAS) before ACTIVATE. Future RMM spec might add support for "Paging" which could let the VMM load "previously" captured content into an ACTIVE Realm, with some guarantees from RMM on the contents. Looking at the logs: The ESR=0x820000a5 => EC => Instruction Abort from a lower Exception level, IFSC="Granule Protection Fault on translation table walk or hardware update of translation table, level 1." ? This should never have happened, as the RMM must ensure that the "Granule" mapped in Protected space is in the Realm PAS. |
I assume that this is because DRAM region is left with ripas=empty state. Instruction fetch from ripas_empty page is (by the RMM spec.) reported to the Realm (the exception is injected), not to the Host. The Realm's exception handler also resides in ripas_empty DRAM so the Realm gets into the endless loop of generating instruction aborts from the first instruction in its exception handler. |
I agree with Suzuki's analysis but it is quite difficult to figure out why the abort has happened, so I am thinking about the way how to simplify the case... As the reported abort is: "Instruction Abort from a lower Exception level, IFSC="Granule Protection Fault on translation table walk or hardware update of translation table, level 1.", can you run (and migrate) the Realm that runs with stage 1 MMU disabled? A simple (1) If the tests passes OK, we'll know definitely that it is something about stage 1 MMU (likely EL1 sys reg misconfiguration, so e.g. you may focus your search on REC migration). (2) If it still fails, there is something more serious, but it would be easier to figure it out on a simpler test. |
Thanks for your help!
RMM does have a static void rme_vm_state_change(void *opaque, bool running, RunState state) {
// .....
/*
* When booting RME vm from a QemuFile snapshot,
* it goes like this:
*/
// Init RIPAS for entire RAM region
kvm_vm_ioctl(kvm_state, KVM_CAP_ARM_RME_INIT_IPA_REALM);
// `cgs_migration()` goes like this:
kvm_vm_ioctl(kvm_state, CCA_MIGRATION_LOAD_RAM);
// populate every guest pages in kvm memslot using `smc_data_create`
// load REC registers
kvm_vm_ioctl(kvm_state, CCA_MIGRATION_REC_LOAD, &input);
// set Realm status to `ACTIVE`
kvm_vm_ioctl(kvm_state, KVM_CAP_ARM_RME_ACTIVATE_REALM);
// .....
} In these two hours, I rechecked this populate problem you mentioned, and I am sure that I have populate all the pages onto correct IPA. I did an experiment in RMM: after we had completed For example, the Guest Kernel Image is mapped to Therefore, I think that the EXPORT, IMPORT( init and populate) of RAM pages should be correct.
As for this, I don't know how to do further testing at the moment.. |
Thanks! I'll check REC's sys-regs as you suggested. By the way, these Realm VMs were being migrated in their early stages of "Kernel Boot" (simply because I don't have the patience to wait for it to finish), I don't know if this will have any impact.
I will redo some experiments according to the |
Thanks for your advice, I migrated the single-core VM in Can we now confirm that the sys-regs are correct?
If indeed we have encountered this situation, how should we proceed with debugging? |
Please can you confirm:
start:
|
Okay, sorry, I didn't think it through before and misunderstood the meaning of "disabling the MMU in a while(true) loop". I will modify the guest kernel booting assembly and retry. |
I think it worked. I modified guest kernel's booting assembly at Now, when restored VM crashes, the new ESR is Next, I will start checking the initialization of the sysregs. Besides checking if the values of the EL1 registers are the same before and after migration, do you have any other suggestions? |
Update: Now, in most cases, ESR should be `82000025. However, in some abnormal cases, the ESR might be
So, accroding to |
Update: We found that the number of host cores may have an impact on the migration result. Realm VM can be migrated successfully when host smp=1 and guest smp=1. (all the migrated Realm VM are single-core here) However, if we use a host QEMU with 2 cores, it will encounter GPF with
|
ESR Indicates the following EC = Inst. Abort. S1PTW=1 => Fault on S1 page table walk. Have you made sure that the "RAM" was restored properly without any errors ? Are you able to provide more information / collect RMM logs to clearly pin point "what the race condition looks like" ? Without proper logs, it is hard to predict what has gone wrong. Given your case shows a GPF and valid HPFAR=(41dd70)=> IPA=0x41dd70-000. Are you able to collect the relevant calls that dealt with the IPA ( DATA_LOAD = PA for IPA, and DELEGATE calls for the PA, also any DATA_DESTROY calls that could have been made). |
If your emulation platform supports "trace" it would be helpful to collect trace information about the TLB operations too. |
One reason SMP may affect behavior of software is because that's when effect of Caches and incoherency between contents of memory and caches start to have larger effect. Is the FAR value a valid address for the Realm VM ? |
Recently, while developing Realm VM live migration, we encountered an
instruction_abort
issue.The specific scenario is as follows...
When importing the Realm VM on the destination platform:
smc_rtt_init_ripas
to set the entire RAM area of the Realm VM as unassigned RAM.delegate
the dst_granule.smc_create_data
.smc_data_create
fails withRMI_ERROR_RTT
, we create the missing RTT and retry.(I've omitted the parts related to Qemu, describing only the operations in RMM here)
smc_rec_enter
, the vCPU executes the first instruction pointed to by the PC, which results in an instuction_abort.Environment:
● Simulation platform is FVP, ShrinkWrap cca-3-world.
● All components (QemuVMM, KVM and RMM) in this cca-3-world environment have new code added, but we have kept the original interfaces unchanged.
● This bug might be impracticable to reproduce, so I'll try my best to desceibe it..
ShrinkWrap Log:
Discussions:
The RMM spec describes the cause of instruction abort as follows:
However, for S2TTEs with a valid IPA, the states of RIPAS and HIPAS will not be checked, refer to this in issue #21 :
By the way, if we don't populate the Realm VM's memory, only load the REC registers, and start running, the Realm VM will enter an endless loop because RMM choose to handle the
instruction_abort
himself. However, with the memory populated, RMM forwards theinstruction_abort
to KVM, and the system panics. Therefore, I guess the memory import is at least partially correct...Conclusion:
We are not concerned about privacy and performance at this stage; we only wish to verify whether VM can restart successfully on dst platfrom after populating all the plaintext-exported guest pages back to their original IPA.
Our questions can be summarized into two:
instruction_abort
?We sincerely appreciate your ongoing assistance. If you need more information or have any suggestions, please let me know.
The text was updated successfully, but these errors were encountered: