Skip to content

Commit eb2863d

Browse files
olethanhhoh
andauthored
Problem: QEMUVM killed before shutdown command (#698)
* Problem: QEMUVM killed before shutdown command When running the controller as a systemd service (the normal usecase in prod) and stopping the service the QEMU process was always killed BEFORE the shutdown command could be sent so the VM could not properly clean up As an additional symptom this error appeared and confused dev and user ``` Sep 05 12:24:03 testing-hetzner python3[2468548]: 2024-09-05 12:24:03,187 | ERROR | Task exception was never retrieved Sep 05 12:24:03 testing-hetzner python3[2468548]: future: <Task finished name='Task-3' coro=<QemuVM.stop() done, defined at /opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py:149> exception=QMPCapabilitiesError()> Sep 05 12:24:03 testing-hetzner python3[2468548]: Traceback (most recent call last): Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 151, in stop Sep 05 12:24:03 testing-hetzner python3[2468548]: self.send_shutdown_message() Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 141, in send_shutdown_message Sep 05 12:24:03 testing-hetzner python3[2468548]: client = self._get_qmpclient() Sep 05 12:24:03 testing-hetzner python3[2468548]: ^^^^^^^^^^^^^^^^^^^^^ Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 136, in _get_qmpclient Sep 05 12:24:03 testing-hetzner python3[2468548]: client.connect() Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/qmp.py", line 162, in connect Sep 05 12:24:03 testing-hetzner python3[2468548]: return self.__negotiate_capabilities() Sep 05 12:24:03 testing-hetzner python3[2468548]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/qmp.py", line 88, in __negotiate_capabilities Sep 05 12:24:03 testing-hetzner python3[2468548]: raise QMPCapabilitiesError Sep 05 12:24:03 testing-hetzner python3[2468548]: qmp.QMPCapabilitiesError Sep 05 12:24:03 testing-hetzner python3[2468548]: 2024-09-05 12:24:03,285 | WARNING | Process terminated with 0 ``` Solution: Use mixed kill mode in Systemd, which will at first only send the term signal to the main process, and give the VM time to properly cleanup and shutdown. Note that some time the "shutdown" error is not acted upon so stoping the process is still necessary. It seems to happend when the boot is not completed yet. So a fallback kill is done after a timeout. --------- Co-authored-by: Hugo Herter <[email protected]>
1 parent 7ee7d04 commit eb2863d

File tree

2 files changed

+7
-0
lines changed

2 files changed

+7
-0
lines changed

packaging/aleph-vm/etc/systemd/system/[email protected]

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,12 @@ WorkingDirectory=/opt/aleph-vm
1111
Environment=PYTHONPATH=/opt/aleph-vm/:$PYTHONPATH
1212
ExecStart=/usr/bin/python3 -m aleph.vm.controllers --config=/var/lib/aleph/vm/%i-controller.json
1313
Restart=on-failure
14+
# KillMode=Mixed is used so initially only the Python controller process receives the SIGTERM signal.
15+
# The controller catches it and sends a QEMU command to shut down the Guest VM, allowing it to clean up
16+
# properly and avoid disk corruption.
17+
# After 30s (TimeoutStopSec), if the process is still running, both the controller and subprocesses receive SIGKILL.
18+
KillMode=mixed
19+
TimeoutStopSec=30
1420

1521
[Install]
1622
WantedBy=multi-user.target

src/aleph/vm/controllers/__main__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ async def handle_persistent_vm(config: Configuration, execution: MicroVM | QemuV
9090

9191
def callback():
9292
"""Callback for the signal handler to stop the VM and cleanup properly on SIGTERM."""
93+
logger.debug("Received SIGTERM")
9394
loop.create_task(execution.stop())
9495

9596
loop.add_signal_handler(signal.SIGTERM, callback)

0 commit comments

Comments
 (0)