Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for submissions to complete on Queue drop #6413

Merged
merged 12 commits into from
Nov 14, 2024
Merged

Conversation

teoxoy
Copy link
Member

@teoxoy teoxoy commented Oct 15, 2024

This PR addresses leaks caused by circular references due to the Device still holding strong references to resources when removed from the registry and while submissions are still active. Global::device_drop was wrongly assuming device_poll with Maintain::Wait was called but this is not a documented invariant and only wgpu was upholding this.

Instead of calling device_poll in device_drop (which would solve the issue) this PR takes a different approach since we want to remove the registries in the future (#5121):

  • it moves the LifetimeTracker and PendingWrites into the Queue so that we never have circular references
  • it adds the wait for active submissions in Queue's Drop impl

One thing to note is that ideally we shouldn't be waiting in the Queue's Drop implementation but that's what wgpu was previously doing (by calling device_poll) and the alternative is more involved: We could put the burden of keeping the Queue alive on wgpu-core users if there are any active submissions. With the changes in this PR we will panic if we hit a timeout or ran into any errors; this would solve that too. I want to talk about this approach at the next maintainers call before making any changes though.

@teoxoy teoxoy self-assigned this Oct 15, 2024
@teoxoy teoxoy requested a review from a team as a code owner October 15, 2024 14:11
@teoxoy teoxoy force-pushed the device-drop branch 5 times, most recently from 2f16a44 to eade5cf Compare October 15, 2024 17:41
@teoxoy
Copy link
Member Author

teoxoy commented Oct 15, 2024

CI was failing because when targeting WebGL we were timing out. eade5cf fixes CI and behaves the same as current trunk where the timeout is ignored (related: #3601 & #4589) but is not correct.

This is another argument for requiring users of wgpu-core to keep the Queue alive and keep polling it until submissions complete.

@teoxoy
Copy link
Member Author

teoxoy commented Oct 16, 2024

Conclusions after discussing the idea of requiring users of wgpu-core to keep the Queue alive and keep polling it until submissions complete with the other maintainers:

  • Most users don’t call poll so we shouldn’t make it a requirement (maintain gets called implicitly on submit).
  • @cwfitzgerald said it should be fine to drop everything early on WebGL since WebGL implementations will still keep the objects alive (I will double check with @kdashg).
  • We should still implement the idea for Firefox since we shouldn't be blocking the GPU process.

@teoxoy
Copy link
Member Author

teoxoy commented Oct 16, 2024

Some follow-up work is still needed though, instead of panicking, we need to add a retry mechanism if we time out or if we run into OOMs. Device loss needs to be propagated to the device (making it invalid).

@teoxoy teoxoy force-pushed the device-drop branch 2 times, most recently from bb90304 to f03855a Compare November 7, 2024 10:20
@teoxoy
Copy link
Member Author

teoxoy commented Nov 7, 2024

I revised the comment explaining why it's actually ok that we time out on WebGL.

@teoxoy teoxoy force-pushed the device-drop branch 2 times, most recently from 2d27279 to 86c4b4d Compare November 7, 2024 16:33
@teoxoy
Copy link
Member Author

teoxoy commented Nov 7, 2024

I added the retry mechanism as well, this should be ready for review.

@ErichDonGubler

This comment was marked as resolved.

@ErichDonGubler ErichDonGubler self-assigned this Nov 12, 2024
@ErichDonGubler ErichDonGubler added area: correctness We're behaving incorrectly type: bug Something isn't working labels Nov 12, 2024
…still active submissions

`Global::device_drop` was wrongly assuming `device_poll` with `Maintain::Wait` was called but this is not a documented invariant and only `wgpu` was upholding this.
The `Device` should not contain any `Arc`s to resources as that creates cycles (since all resources hold strong references to the `Device`).
Note that `PendingWrites` internally has `Arc`s to resources.

I think this change also makes more sense conceptually since most operations that use `PendingWrites` are on the `Queue`.
We should rely on the ranks in `wgpu-core\src\lock\rank.rs`.
Copy link
Member

@ErichDonGubler ErichDonGubler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inclined to approve if we can get my questions answered.

The `Device` should not contain any `Arc`s to resources as that creates cycles (since all resources hold strong references to the `Device`).
Note that `LifetimeTracker` internally has `Arc`s to resources.
…ission

This gets the `wgpu_test::ray_tracing::as_build::out_of_order_as_build` test to pass.

This seems to be an issue even on trunk, looking at the nr of calls to `create_command_encoder` & `destroy_command_encoder` in hal, they are not equal. So, I'm not sure why the validation layers don't raise the `VUID-vkDestroyDevice-device-05137`.

There is still an issue with previous command buffers being leaked but I will fix this in a follow-up.
@ErichDonGubler
Copy link
Member

LGTM, with the discussions we've had. If you're still interested in @jimblandy's feedback, I'd suggest giving 'im...maybe 24 hours, if you're feeling patient? Otherwise, I think it's reasonable to do review after merging, if really necessary.

@teoxoy
Copy link
Member Author

teoxoy commented Nov 14, 2024

Let's land it, I already had to rebase it and fix conflicts a few times.

@teoxoy teoxoy enabled auto-merge (rebase) November 14, 2024 14:27
@teoxoy teoxoy merged commit 278620b into gfx-rs:trunk Nov 14, 2024
27 checks passed
@teoxoy teoxoy deleted the device-drop branch November 14, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: correctness We're behaving incorrectly type: bug Something isn't working
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants