-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: do not panic if virtio device activation return Err(...) #4665
fix: do not panic if virtio device activation return Err(...) #4665
Conversation
Combine the `interrupt_evt` and `interrupt_status` methods into a single method `interrupt_trigger` that returns a `IrqTrigger` reference (which essentially combines the two objects originally returned by the status and evt methods). The advantage to this is that `IrqTrigger` exposes a `trigger_irq` method, which I'd like to use in the next commit. Signed-off-by: Patrick Roy <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4665 +/- ##
==========================================
+ Coverage 82.08% 82.11% +0.02%
==========================================
Files 255 255
Lines 31267 31261 -6
==========================================
+ Hits 25666 25670 +4
+ Misses 5601 5591 -10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Correctly mark the doc comment as an inner doc comment for the enclosing macro, instead of an outer doc comment for the second import. Signed-off-by: Patrick Roy <[email protected]>
When the guest driver sets a virtio devices status to `DRIVER_OK`, we proceed with calling `VirtioDevice::activate`. However, our MMIO transport layer assumes that this activation cannot go wrong, and calls `.expect(...)` on the result. For most devices, this is fine, as the activate method doesn't do much besides writing to an event_fd (and I can't think of a scenario in which this could go wrong). However, our vhost-user-blk device has some non-trivial logic inside of its `activate` method, which includes communication with the vhost-user-backend via a unix socket. If this unix socket gets closed early, this causes `activate` to return an error, and thus consequently a panic in the MMIO code. The virtio spec, in Section 2.2, has the following to say [1]: > The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state that a reset is needed. If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device MUST send a device configuration change notification to the driver. So the spec-conform way of handling an activation error is setting the `DEVICE_NEEDS_RESET` flag in the device_status field (which is what this commit does). This will fix the panic, however it will most certainly still not result in correct device operations (e.g. a vhost-user-backend dying will still be unrecoverable). This is because Firecracker does not actually implement device reset, see also firecracker-microvm#3074. Thus, the device will simply be "dead" to the guest: All operations where we currently simply abort and do nothing if the device is in the FAILED state will do the same in the DEVICE_NEEDS_RESET state. [1]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf Signed-off-by: Patrick Roy <[email protected]>
Add a unittest to deal with the case where virtio device activation fails. In this case, the device state needs to be put to DEVICE_NEEDS_RESET, and an interrupt should have been generated. Signed-off-by: Patrick Roy <[email protected]>
776b969
to
53338f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable)
When you speak about recovery, do you mean Firecracker will need to try to reconnect to the backend when the device is reset? This isn't going to solve the problem for test_vhost_user_block_disconnect
where the backend never comes back though?
Yeah, it won't solve the problem where the backend never comes online, I would expect in this case the device will just remain in either |
Log activation failures at the only call-site of `activate`, instead of inside each individual `activate` function. For this, untangle some of the `ActivationError` variants - `BadActivate` was almost exclusively used in the case where writing to the activation eventfd failed, except in the vsock device, where it was also used to indicate that the number of queues the guest gave us was wrong (which this commit factors out into its own error variant). Signed-off-by: Patrick Roy <[email protected]>
Make sure that whenever the activation of a virtio device fails, we set the `activate_fails` metric. Signed-off-by: Patrick Roy <[email protected]>
e97b8ef
to
8eb2229
Compare
When the guest driver sets a virtio devices status to
DRIVER_OK
, weproceed with calling
VirtioDevice::activate
. However, our MMIOtransport layer assumes that this activation cannot go wrong, and calls
.expect(...)
on the result. For most devices, this is fine, as theactivate method doesn't do much besides writing to an event_fd (and I
can't think of a scenario in which this could go wrong). However, our
vhost-user-blk device has some non-trivial logic inside of its
activate
method, which includes communication with thevhost-user-backend via a unix socket. If this unix socket gets closed
early, this causes
activate
to return an error, and thus consequentlya panic in the MMIO code.
The virtio spec, in Section 2.2, has the following to say [1]:
So the spec-conform way of handling an activation error is setting
the
DEVICE_NEEDS_RESET
flag in the device_status field (which is whatthis commit does).
This will fix the panic, however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable). This is because Firecracker does not actually
implement device reset, see also #3074. Thus, the device will simply be
"dead" to the guest. But at least Firecracker won't crash anymore.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
PR.
CHANGELOG.md
.TODO
s link to an issue.contribution quality standards.
rust-vmm
.