Wait for submissions to complete on `Queue` drop #6413

teoxoy · 2024-10-15T14:11:33Z

This PR addresses leaks caused by circular references due to the Device still holding strong references to resources when removed from the registry and while submissions are still active. Global::device_drop was wrongly assuming device_poll with Maintain::Wait was called but this is not a documented invariant and only wgpu was upholding this.

Instead of calling device_poll in device_drop (which would solve the issue) this PR takes a different approach since we want to remove the registries in the future (#5121):

it moves the LifetimeTracker and PendingWrites into the Queue so that we never have circular references
it adds the wait for active submissions in Queue's Drop impl

One thing to note is that ideally we shouldn't be waiting in the Queue's Drop implementation but that's what wgpu was previously doing (by calling device_poll) and the alternative is more involved: We could put the burden of keeping the Queue alive on wgpu-core users if there are any active submissions. With the changes in this PR we will panic if we hit a timeout or ran into any errors; this would solve that too. I want to talk about this approach at the next maintainers call before making any changes though.

teoxoy · 2024-10-15T17:46:06Z

CI was failing because when targeting WebGL we were timing out. eade5cf fixes CI and behaves the same as current trunk where the timeout is ignored (related: #3601 & #4589) but is not correct.

This is another argument for requiring users of wgpu-core to keep the Queue alive and keep polling it until submissions complete.

teoxoy · 2024-10-16T17:02:53Z

Conclusions after discussing the idea of requiring users of wgpu-core to keep the Queue alive and keep polling it until submissions complete with the other maintainers:

Most users don’t call poll so we shouldn’t make it a requirement (maintain gets called implicitly on submit).
@cwfitzgerald said it should be fine to drop everything early on WebGL since WebGL implementations will still keep the objects alive (I will double check with @kdashg).
We should still implement the idea for Firefox since we shouldn't be blocking the GPU process.

teoxoy · 2024-10-16T17:05:56Z

Some follow-up work is still needed though, instead of panicking, we need to add a retry mechanism if we time out or if we run into OOMs. Device loss needs to be propagated to the device (making it invalid).

teoxoy · 2024-11-07T10:21:32Z

I revised the comment explaining why it's actually ok that we time out on WebGL.

teoxoy · 2024-11-07T16:34:28Z

I added the retry mechanism as well, this should be ready for review.

…still active submissions `Global::device_drop` was wrongly assuming `device_poll` with `Maintain::Wait` was called but this is not a documented invariant and only `wgpu` was upholding this.

The `Device` should not contain any `Arc`s to resources as that creates cycles (since all resources hold strong references to the `Device`). Note that `PendingWrites` internally has `Arc`s to resources. I think this change also makes more sense conceptually since most operations that use `PendingWrites` are on the `Queue`.

We should rely on the ranks in `wgpu-core\src\lock\rank.rs`.

wgpu-core/src/device/mod.rs

wgpu-core/src/device/queue.rs

wgpu-hal/src/gles/device.rs

wgpu-hal/src/gles/mod.rs

wgpu-core/src/device/queue.rs

ErichDonGubler

Inclined to approve if we can get my questions answered.

The `Device` should not contain any `Arc`s to resources as that creates cycles (since all resources hold strong references to the `Device`). Note that `LifetimeTracker` internally has `Arc`s to resources.

…he `Queue`

…d `Queue` instead

…rop`

…ission This gets the `wgpu_test::ray_tracing::as_build::out_of_order_as_build` test to pass. This seems to be an issue even on trunk, looking at the nr of calls to `create_command_encoder` & `destroy_command_encoder` in hal, they are not equal. So, I'm not sure why the validation layers don't raise the `VUID-vkDestroyDevice-device-05137`. There is still an issue with previous command buffers being leaked but I will fix this in a follow-up.

ErichDonGubler · 2024-11-13T20:16:02Z

LGTM, with the discussions we've had. If you're still interested in @jimblandy's feedback, I'd suggest giving 'im...maybe 24 hours, if you're feeling patient? Otherwise, I think it's reasonable to do review after merging, if really necessary.

teoxoy · 2024-11-14T14:26:53Z

Let's land it, I already had to rebase it and fix conflicts a few times.

teoxoy self-assigned this Oct 15, 2024

teoxoy requested a review from a team as a code owner October 15, 2024 14:11

teoxoy force-pushed the device-drop branch 5 times, most recently from 2f16a44 to eade5cf Compare October 15, 2024 17:41

teoxoy force-pushed the device-drop branch 2 times, most recently from bb90304 to f03855a Compare November 7, 2024 10:20

teoxoy force-pushed the device-drop branch 2 times, most recently from 2d27279 to 86c4b4d Compare November 7, 2024 16:33

teoxoy force-pushed the device-drop branch from 86c4b4d to e304151 Compare November 12, 2024 11:10

This comment was marked as resolved.

Sign in to view

ErichDonGubler self-assigned this Nov 12, 2024

ErichDonGubler added area: correctness We're behaving incorrectly type: bug Something isn't working labels Nov 12, 2024

rely on our ownership model to keep the device alive while there are …

c912323

…still active submissions `Global::device_drop` was wrongly assuming `device_poll` with `Maintain::Wait` was called but this is not a documented invariant and only `wgpu` was upholding this.

teoxoy force-pushed the device-drop branch 2 times, most recently from c421f62 to 004a982 Compare November 12, 2024 14:38

ErichDonGubler requested review from ErichDonGubler and jimblandy November 12, 2024 16:59

ErichDonGubler assigned jimblandy Nov 12, 2024

teoxoy added 2 commits November 12, 2024 19:11

Remove outdated locking comments

26d90a9

We should rely on the ranks in `wgpu-core\src\lock\rank.rs`.

teoxoy force-pushed the device-drop branch from 004a982 to 8fbfd7e Compare November 12, 2024 19:20