Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#4547: Refactor vmm builder code to simplify logic that creates the microVM to boot #4910

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

tommady
Copy link
Contributor

@tommady tommady commented Nov 13, 2024

fix(4547): Refactor vmm builder code to simplify logic

Changes

Refactoring Vmm builder code

Reason

close issue #4547

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@tommady tommady force-pushed the fix-4547 branch 3 times, most recently from 38e9537 to 6b97c07 Compare November 15, 2024 21:50
@tommady tommady marked this pull request as ready for review November 15, 2024 21:56
Copy link
Contributor

@ShadowCurse ShadowCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tommady, thx for doing the PR. Overall changes seems solid, but I left some small notes here and there. Also can you change names of the commits to refactor(builder): .... This is usually how we treat refactor commits.

src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
@tommady
Copy link
Contributor Author

tommady commented Nov 22, 2024

Hi @ShadowCurse

I’ve addressed your comments, with one note regarding the set_stdout_nonblocking function. Please review and provide your guidance when you have a moment.

Thank you!

Copy link
Contributor

@ShadowCurse ShadowCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you need to make sure the changes you made pass our CI. You can look at the buildkite/firecracker-pr line at the bottom of the PR and click details to look at the CI status. Start with making sure changes do compile. I can see that on x86_86 only errors are absence of doc comments for public functions. For aarch64, well, it does not even compile. If you don't have access to arm system, you can try to use cross to cross compile.
After everything compiles, please move all changes to corresponding commits.
Also please rebase the PR so it only includes your commits.

@tommady
Copy link
Contributor Author

tommady commented Nov 23, 2024

Hi @ShadowCurse

Thank you for your patience and thorough review—it has been an incredible learning experience for me!
I’ve revisited your feedback and ensured that the entire architecture compiles cleanly, with no documentation warnings.
Please take a look whenever you have the time. 😊

Copy link
Contributor

@ShadowCurse ShadowCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that it at least compiles, you need to make sure it also does function as expected. If you look at the Buildkite pipelines output: https://buildkite.com/firecracker/firecracker-pr/builds/11856#0193640b-6e7c-4f49-bbcf-cb60c1c946ef/62-392 you will see that there are some unit tests that fail in the same place src/vmm/src/builder.rs:519:51. Quick look around and it seems the issue is that when pio_device_manager registers itself, it registers interrupts for a VM, but the interrupt controller is not yet created at that point. So you need to either move creation of interrupt_controller before pio_device_manager creation, or move this .register_device(..) call.

You can reproduce all of this errors if you run CI locally. You can find exact commands CI is running looking at the very beginning of the logs like this: https://buildkite.com/firecracker/firecracker-pr/builds/11856#0193640b-6e7c-4f49-bbcf-cb60c1c946ef/62-63

Start by running unit tests. There are a couple of ways to do this:

  • ./tools/devtool test -- integration_tests/build/test_unittests.py - this will use the docker container to compile and run tests
  • Compile test with cargo test -p vmm --no-run and then run with sudo ./build/cargo_target/debug/deps/vmm-.... (there will be a full path at the end of cargo test) With this method some network related unit tests will fail most likely, but you can skip those.

Also you can make sure your changes are sound by to simply trying to launch a VM with them. We have a doc for this: https://github.com/firecracker-microvm/firecracker/blob/main/docs/getting-started.md (in the section about getting a rootfs you can skip the ssh key creation if you don't want to use/test network)

@tommady
Copy link
Contributor Author

tommady commented Nov 27, 2024

Hi @ShadowCurse,
Thank you once again for your patience and for crafting such a detailed guide! 🙏
I’ve fixed the unit test as per your suggestions and ran the command:

./tools/devtool test -- integration_tests/build/test_unittests.py
...
integration_tests/build/test_unittests.py::test_unittests PASSED                                                                                         [ 50%]
integration_tests/build/test_unittests.py::test_benchmarks_compile PASSED                                                                                [100%]

All tests passed successfully! 🎉

When you have a moment, please review it again.

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 89.05775% with 36 lines in your changes missing coverage. Please review.

Project coverage is 84.17%. Comparing base (d7734e2) to head (659b0ec).
Report is 2 commits behind head on main.

Current head 659b0ec differs from pull request most recent head 02d3905

Please upload reports for the commit 02d3905 to get more accurate results.

Files with missing lines Patch % Lines
src/vmm/src/builder.rs 88.85% 36 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4910      +/-   ##
==========================================
+ Coverage   84.07%   84.17%   +0.09%     
==========================================
  Files         251      251              
  Lines       28059    28165     +106     
==========================================
+ Hits        23592    23708     +116     
+ Misses       4467     4457      -10     
Flag Coverage Δ
5.10-c5n.metal 84.53% <66.04%> (-0.12%) ⬇️
5.10-m5n.metal 84.51% <66.04%> (-0.12%) ⬇️
5.10-m6a.metal 83.82% <66.04%> (-0.13%) ⬇️
5.10-m6g.metal 80.70% <68.00%> (-0.05%) ⬇️
5.10-m6i.metal 84.50% <66.04%> (-0.12%) ⬇️
5.10-m7g.metal 80.70% <68.00%> (-0.05%) ⬇️
6.1-c5n.metal 84.53% <66.04%> (-0.12%) ⬇️
6.1-m5n.metal 84.51% <66.04%> (-0.13%) ⬇️
6.1-m6a.metal 83.82% <66.04%> (-0.12%) ⬇️
6.1-m6g.metal 80.70% <68.00%> (-0.05%) ⬇️
6.1-m6i.metal 84.51% <66.04%> (-0.12%) ⬇️
6.1-m7g.metal 80.70% <68.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ShadowCurse
Copy link
Contributor

With x86_64 version done, now you need to make sure aarhc64 also compiles/works. If you look at the BK run again: https://buildkite.com/firecracker/firecracker-pr/builds/11886#01936ce9-9b52-4459-ac0a-f8cfbe7a3690/62-624, https://buildkite.com/firecracker/firecracker-pr/builds/11886#01936ce9-9b62-4713-b1eb-33926eacc8b0/62-882 you will see some compilation errors in unit tests and CI test for serial port in a restored VM failing. As I mentioned before, you can use cross to cross compile FC to aarch64. It might even allow some subset of unit tests to be run locally. Or you can use qemu to create an emulated aarch64 VM and test your code there.

@tommady
Copy link
Contributor Author

tommady commented Nov 29, 2024

Oh, I see… I assumed the devtool test would handle all platform testing in one sweep. Let me double-check again.

@tommady tommady requested a review from ShadowCurse December 2, 2024 05:17
@tommady
Copy link
Contributor Author

tommady commented Dec 2, 2024

hmmm so still https://buildkite.com/firecracker/firecracker-pr/builds/11939#019386b4-8ab4-4c7e-9fd4-1507dbddbafb/63-882
I'll run an aarch64 VM to run the integration myself and try to fix it.

Copy link
Contributor

@ShadowCurse ShadowCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left couple of comments and as I think, the reason for CI failures. After addressing everything please rebase PR on top of main so only your commits are in here.

src/vmm/src/device_manager/persist.rs Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
@tommady tommady requested a review from ShadowCurse December 5, 2024 04:32
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
@ShadowCurse
Copy link
Contributor

Alright, CI functional and performance tests are green. The last thing is to make commits look better. As I mentioned before you need to rebase PR on top of main branch so only your commits are visible. This should be as simple as updating you fork of Firecracker and running git rebase origin/main. After that update commits with meaningful titles and messages and squash some of them if necessary.

zulinx86 and others added 23 commits December 6, 2024 05:15
We bumped the snapshot version up twice recently, requiring users to
regenerate their snapshot, but the user action isn't clearly stated.

Signed-off-by: Takahiro Itazuri <[email protected]>

refactor(builder): Refactor vmm builder code to simplify logic

eliminate the unnecessary usage of the event_manager argument and
fix up aarch64 attach_legacy_devices_aarch64 fn

Signed-off-by: tommady <[email protected]>

refactor(builder): address comments

remove the aarch64 suffix from the attach_legacy_devices_aarch64
function and ensure that aarch64 smt is always set to false int the
configure_system_for_boot function

Signed-off-by: tommady <[email protected]>
`TcpIPv4Handler` for MMDS network stack preallocates several buffers
whose sizes are saved into a snapshot as `max_connections` and
`max_pending_resets` in `MmdsNetworkStackState`. But they are always the
same constant hardcoded values (`DEFAULT_MAX_CONNECTIONS` and
`DEFAULT_MAX_PENDING_RESETS`) as of today, which means there is no need
to save them into a snapshot. Even if we change the hardcoded sizes
across Firecracker versions, that should not be a problem. This is
because the snapshot feature does not support migration of network
connections and those buffers are initialized with empty on snapshot
restoration. When migrating from a Firecracker version with larger
buffers to another version with smaller ones, guest workloads that
worked previously might start to fail due to the less buffer spaces.
However, the issue is not a problem of the snapshot feature and it
should also occur even on a purely booted microVM (not restored from a
snapshot). Thus, it is fine to remove those fields from a snapshot.

Since this is a breaking change of the snapshot format, bumps the major
version.

Signed-off-by: Takahiro Itazuri <[email protected]>
We bumped the snapshot version up twice recently, requiring users to
regenerate their snapshot, but the user action isn't clearly stated.

Signed-off-by: Takahiro Itazuri <[email protected]>

refactor(builder): address comments

let the x86_64 and aarch64 architectures code can compile and without
warnning

Signed-off-by: tommady <[email protected]>
make the test all passed

Signed-off-by: tommady <[email protected]>
make the aarch64 test all passed

Signed-off-by: tommady <[email protected]>
refactor(builder): address comments

address comments

Signed-off-by: tommady <[email protected]>
There is some information in the Microvm class that we don't save in the
snapshot. Some tests do depend on those, so to make the booted/restored
case homogenous, make room in the snapshot to save metadata that we can
then restore.

Signed-off-by: Pablo Barbáchano <[email protected]>
Some further simplifications:
- Simplify the "_host" tests as they will provide the same result in
  both A/B situations.
- Add a second MicrovmFactory fixture with the "A" firecracker.
- Add a uvm_any fixture that includes all CPU templates and also a
  boot/restored dimension.

This decreases the need for separate tests. For example the guest test
test_check_vulnerability_files_ab can now test all variants:

2 (restored/booted) * 3 (kernels) * 9 (cpu templates) = 54 tests

Signed-off-by: Pablo Barbáchano <[email protected]>
Use a new uvm_any_booted fixture to make the test simpler.

Signed-off-by: Pablo Barbáchano <[email protected]>
Use the new uvm_any fixture to make the tests simpler

Signed-off-by: Pablo Barbáchano <[email protected]>
All tests in the file are x86_64 specific, so do the skip at the
top-level, once.

Signed-off-by: Pablo Barbáchano <[email protected]>
Make the common feature flags to get some clarity on what are the
differences between the CPUs.

Signed-off-by: Pablo Barbáchano <[email protected]>
This is the same test spread over two files. Putting it together in a
new file so they don't drift over time.

Signed-off-by: Pablo Barbáchano <[email protected]>
This is a fairly common repeated pattern, so extract it into a method to
make writing tests simpler.

Signed-off-by: Pablo Barbáchano <[email protected]>
Rewrite test_rng using uvm_any.

Signed-off-by: Pablo Barbáchano <[email protected]>
This fixture is not used anymore, as it is mostly superseded by
cpu_template_any.

Signed-off-by: Pablo Barbáchano <[email protected]>
Since we only have one rootfs, simplify the fixture to return a single
value. This will have also not show it as part of the test instance
name.

Signed-off-by: Pablo Barbáchano <[email protected]>
Add a parameter so that we can control vcpu number and memory.

Right now it uses a hardcoded value, but it can be overridden per test.

Signed-off-by: Pablo Barbáchano <[email protected]>
@tommady tommady force-pushed the fix-4547 branch 2 times, most recently from aca5101 to d807ab3 Compare December 6, 2024 05:32
…boot

Streamline the event_manager to avoid passing it through unnecessary
functions, isolating the code with distinct architectural designs.
Consolidate architecture-specific configurations exclusively within the
build_microvm_for_boot and build_microvm_from_snapshot functions for
clarity and efficiency.

Signed-off-by: tommady <[email protected]>
move interrupt controller into create_vmm_and_vcpus
and fix aarm64 integration test

Signed-off-by: tommady <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants