Skip to content

Conversation

olethanh
Copy link
Collaborator

Related Jira ticket : ALEPH-549

Problem:
When the whole server rebooted, internet was not working inside instance and they were unreachable on the network. As the network interface were not properly recreated

Analysis:

At the instance automatic restart once the serevr rebooted, the network interface is not existant, thus QEMU recreating some kind of default network that is not setup as needed (tunneling, ip address, nat, ipv6, etc.._ and thus the network inside the instance was non functional.)

Solution:
In the controller, wait for the network interface to be recreated by the supervisor In the supevisor recreate the network interface and set it up if it is not existing This is done this way as it is not possible to add tuntap after the interface is created

How to test:

  • Stop the supervisor
  • Stop the instance controller (via systemctl stop aleph-vm-controller@ )
  • Delete the network interface : ip link del vmtapX (e.g. vmtap4 if it is the one the interface is using)
  • Start the instance again (systemctl start aleph-vm-controler@... )
  • Check that it is waiting for the network interface (journalctl -u aleph-vm-controller@... -f )
  • Start the supervisor again, in the log it should create the network interface
  • The instance controler should detect it and launch QEMU
  • Check that the interface exist and has an ip: ip a
  • SSH into the instance , check that the network is working as intented

Related Jira ticket : ALEPH-549

Problem:
When the whole server rebooted, internet was not working inside instance
and they were unreachable on the network. As the network interface were not properly recreated

Analysis:

At the instance automatic restart once the serevr rebooted, the network interface is not existant, thus QEMU recreating some kind of default network  that is
not setup as needed (tunneling, ip address,  nat, ipv6, etc.._ and thus the network inside the instance was non functional.)

Solution:
In the controller, wait for the network interface to be recreated by the supervisor
In the supevisor recreate the network interface and set it up if it is not existing
This is done this way as it is not possible to add tuntap after the
interface is created

How to test:
* Stop the supervisor
* Stop the instance controller (via systemctl stop aleph-vm-controller@ )
* Delete the network interface : ip link del vmtapX (e.g. vmtap4 if it is the one the interface is using)
* Start the instance again (systemctl start aleph-vm-controler@... )
* Check that it is waiting for the network interface (journalctl -u aleph-vm-controller@... -f )
* Start the supervisor again, in the log it should create the network interface
* The instance controler should detect it and launch QEMU
* Check that the interface exist and has an ip: ip a
* SSH into the instance , check that the network is working as intented
Copy link

codecov bot commented Jul 23, 2025

Codecov Report

Attention: Patch coverage is 12.50000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 63.86%. Comparing base (c978419) to head (cdebb90).

Files with missing lines Patch % Lines
src/aleph/vm/controllers/__main__.py 16.66% 5 Missing ⚠️
src/aleph/vm/pool.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #823      +/-   ##
==========================================
- Coverage   63.90%   63.86%   -0.04%     
==========================================
  Files          86       86              
  Lines        7931     7937       +6     
  Branches      706      708       +2     
==========================================
+ Hits         5068     5069       +1     
- Misses       2651     2656       +5     
  Partials      212      212              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nesitor nesitor merged commit 91891de into main Jul 24, 2025
51 of 55 checks passed
@nesitor nesitor deleted the ol-ALEPH-549-crn-reboot-network branch July 24, 2025 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants