Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Flakes on VK discarding_either_depth_or_stencil_aspect_test #4740

Closed
cwfitzgerald opened this issue Nov 21, 2023 · 13 comments · Fixed by #4782
Closed

CI Flakes on VK discarding_either_depth_or_stencil_aspect_test #4740

cwfitzgerald opened this issue Nov 21, 2023 · 13 comments · Fixed by #4782
Labels
api: vulkan Issues with Vulkan type: bug Something isn't working

Comments

@cwfitzgerald
Copy link
Member

  TRY 3 FAIL [   0.460s] wgpu-test::gpu-tests [Executed] [Vulkan/llvmpipe (LLVM 15.0.7, 256 bits)/0] gpu_tests::zero_init_texture_after_discard::discarding_either_depth_or_stencil_aspect_test

thread '<unnamed>' panicked at 'texture was not fully cleared', tests/tests/zero_init_texture_after_discard.rs:306:9

https://github.com/gfx-rs/wgpu/actions/runs/6947331918/job/18900887783?pr=4723

@cwfitzgerald cwfitzgerald added area: infrastructure Testing, building, coordinating issues type: bug Something isn't working api: vulkan Issues with Vulkan and removed area: infrastructure Testing, building, coordinating issues labels Nov 21, 2023
@cwfitzgerald cwfitzgerald changed the title CI Flakes on discarding_either_depth_or_stencil_aspect_test CI Flakes on VK discarding_either_depth_or_stencil_aspect_test Nov 21, 2023
@teoxoy
Copy link
Member

teoxoy commented Nov 21, 2023

Related: #4739 & #4245

@teoxoy teoxoy added this to the WebGPU Specification V1 milestone Nov 21, 2023
@teoxoy
Copy link
Member

teoxoy commented Nov 22, 2023

If we revert the changes to the test (#4739), it seems to be broken on MoltenVK as well and even D3D12 #4739 (comment).

Looking at the errors thrown by Vulkan & D3D12, I suspect part of the texture tracking code is not working properly.

@cwfitzgerald
Copy link
Member Author

cwfitzgerald commented Nov 22, 2023

Yeah, something weird is going on here. When the resource is finally uses as copy_src after initialization, the device tracker thinks it hasn't seen it and just inserts the state without inserting a barrier at all.

@cwfitzgerald
Copy link
Member Author

cwfitzgerald commented Nov 22, 2023

We end up in https://github.com/gfx-rs/wgpu/blob/trunk/wgpu-core/src/track/texture.rs#L404-L436 during triage_suspected for the texture. external_count == 2 and the existing_ref_count == 2 and 2 <= 1 + 2, so the texture gets removed from the tracker, despite the texture being very much alive.

@cwfitzgerald
Copy link
Member Author

This part of the tracking and maintenance code is new after arcinization, so is new to me, still getting up to speed. cc @gents83

@cwfitzgerald
Copy link
Member Author

cwfitzgerald commented Nov 22, 2023

The difference is which submission each submission is determined done at. If the submission is determined finished in the same submit call the texture will be removed.

The order of operations is:

  • Build submission
  • Submit
  • Do some stuff
  • Maintain.

If the maintain call finds that the submission they just kicked off is done, you'll get

Texture Id { index: 0, epoch: 1, backend: Vulkan } is not tracked anymore due to ref count 3. External count: 2 every single submission.

If a submission lasts until the next submit call, you'll get:

Texture Id { index: 0, epoch: 1, backend: Vulkan } is still referenced with ref count 4. External count: 2 instead.

log-success.txt
log-fail.txt

Also go our logging!

@gents83
Copy link
Contributor

gents83 commented Nov 23, 2023

Yep! The situation is exactly as you described. In some situations the texture is not referenced and so can be released at the end of the frame.

When I was investigating this anyway I had the impression that when submitting right every operations the barrier transition operation doesn't work as intended.
While if all operations are in the same sumbit the resource state is following the correct flow.

So I think that our best option is to add some logging to barriers and transitions to check what is going on in that sense probably

@cwfitzgerald
Copy link
Member Author

When I was investigating this anyway I had the impression that when submitting right every operations the barrier transition operation doesn't work as intended.
While if all operations are in the same sumbit the resource state is following the correct flow.

The only reason it's not working as intended is that the trackers are removing the textures every frame from the trackers. If that didn't happen the barriers would be correct.

So I think that our best option is to add some logging to barriers and transitions to check what is going on in that sense probably

It already does :)

@gents83
Copy link
Contributor

gents83 commented Nov 23, 2023

Ok so I'll have to dig a bit into barriers to find out why they are not referencing the resource needed and allow it to be destroyed 👍🏼 hopefully will be on this in a couple of days

@gents83
Copy link
Contributor

gents83 commented Nov 26, 2023

@teoxoy & @cwfitzgerald I need your input here.
Once reverted the zero_init_texture_after_discard test, I've the following log, where everything seems correct to me a part from the fact that the texture instead of being in DEPTH_STENCIL_WRITE (that is last state) is in COPY_SRC.

[2023-11-26T15:57:41Z DEBUG wgpu_core::device::resource] Create view for texture Id(0,3,d3d12) filters usages to TextureUses(DEPTH_STENCIL_READ | DEPTH_STENCIL_WRITE)
[2023-11-26T15:57:41Z TRACE wgpu_core::storage] User is inserting TextureViewId(1,10,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::device::global] Texture::create_view(Id(0,3,d3d12)) -> Id(1,10,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::command::render] Encoding render pass begin in command buffer Id(0,19,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::command::render] Merging renderpass into cmd_buf Id(0,19,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::track::texture] tex 0: insert start TextureUses(DEPTH_STENCIL_WRITE)
[2023-11-26T15:57:41Z TRACE wgpu_core::track::texture] tex 0: insert start TextureUses(DEPTH_STENCIL_WRITE)
[2023-11-26T15:57:41Z TRACE wgpu_hal::dx12::command] List 0x1596bf63c20 buffer transitions
[2023-11-26T15:57:41Z TRACE wgpu_hal::dx12::command] List 0x1596bf63c20 texture transitions
[2023-11-26T15:57:41Z TRACE wgpu_core::device::global] TextureView::drop Id(1,10,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::storage] User is removing TextureViewId(1,10,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::command] Command buffer Id(0,19,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::device::queue] Queue::submit Id(0,1,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::command] Extracting BakedCommands from CommandBuffer "[] Id(0,19,d3d12)"
[2023-11-26T15:57:41Z TRACE wgpu_core::device::queue] Stitching command buffer Id(0,19,d3d12) before submission
[2023-11-26T15:57:41Z TRACE wgpu_core::track::texture] tex 0: insert start TextureUses(DEPTH_STENCIL_WRITE)
[2023-11-26T15:57:41Z TRACE wgpu_core::track::texture] tex 0: insert end TextureUses(DEPTH_STENCIL_WRITE)
[2023-11-26T15:57:41Z TRACE wgpu_hal::dx12::command] List 0x1596bf38950 buffer transitions
[2023-11-26T15:57:41Z TRACE wgpu_hal::dx12::command] List 0x1596bf38950 texture transitions
[2023-11-26T15:57:41Z TRACE wgpu_core::device::queue] Device after submission 19
[2023-11-26T15:57:41Z TRACE wgpu_core::device::queue] SUBMIT
[2023-11-26T15:57:41Z TRACE wgpu_core::track::stateless] StatelessTracker::remove_abandoned Id(1,10,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::track::stateless] TextureView Id(1,10,d3d12) is not tracked anymore
[2023-11-26T15:57:41Z TRACE wgpu_core::track::texture] Texture Id(0,3,d3d12) is not tracked anymore
[2023-11-26T15:57:41Z DEBUG wgpu_core::device::life] Active submission 19 is done
[2023-11-26T15:57:41Z TRACE wgpu_core::resource] Destroy raw TextureView [] Id(1,10,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::device::global] Device::poll
[2023-11-26T15:57:41Z TRACE wgpu_core::storage] User is inserting CommandBufferId(0,20,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::device::global] Device::create_command_encoder -> Id(0,20,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::command::transfer] CommandEncoder::copy_texture_to_buffer Id(0,3,d3d12) -> Id(2,2,d3d12) Extent3d { width: 256, height: 256, depth_or_array_layers: 1 }
[2023-11-26T15:57:41Z TRACE wgpu_core::track::texture] tex 0: insert start TextureUses(COPY_SRC)
[2023-11-26T15:57:41Z TRACE wgpu_core::track::buffer] buf 2: insert BufferUses(COPY_DST)..BufferUses(COPY_DST)
[2023-11-26T15:57:41Z TRACE wgpu_hal::dx12::command] List 0x1596bf4a680 buffer transitions
[2023-11-26T15:57:41Z TRACE wgpu_hal::dx12::command] List 0x1596bf4a680 texture transitions
[2023-11-26T15:57:41Z TRACE wgpu_core::command] Command buffer Id(0,20,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::device::queue] Queue::submit Id(0,1,d3d12)
[2023-11-26T15:57:41Z TRACE wgpu_core::command] Extracting BakedCommands from CommandBuffer "[] Id(0,20,d3d12)"
[2023-11-26T15:57:41Z TRACE wgpu_core::device::queue] Stitching command buffer Id(0,20,d3d12) before submission
[2023-11-26T15:57:41Z TRACE wgpu_core::track::buffer] buf 2: transition BufferUses(COPY_DST) -> BufferUses(COPY_DST)
[2023-11-26T15:57:41Z TRACE wgpu_core::track::texture] tex 0: insert start TextureUses(COPY_SRC)
[2023-11-26T15:57:41Z TRACE wgpu_core::track::texture] tex 0: insert end TextureUses(COPY_SRC)
[2023-11-26T15:57:41Z TRACE wgpu_hal::dx12::command] List 0x1596bf63c20 buffer transitions
[2023-11-26T15:57:41Z TRACE wgpu_hal::dx12::command] 0x1596b75f240: usage BufferUses(COPY_DST)..BufferUses(COPY_DST)
[2023-11-26T15:57:41Z TRACE wgpu_hal::dx12::command] List 0x1596bf63c20 texture transitions
[2023-11-26T15:57:41Z TRACE wgpu_core::device::queue] Device after submission 20
[2023-11-26T15:57:41Z TRACE wgpu_core::device::queue] SUBMIT
[2023-11-26T15:57:41Z ERROR wgpu_hal::auxil::dxgi::exception] ID3D12CommandQueue::ExecuteCommandLists: Using CopyTextureRegion on Command List (0x000001596BF4A680:'Discard Stencil'): Resource state (0xFBCF4640: D3D12_RESOURCE_STATE_DEPTH_WRITE) of resource (0x000001596BF240D0:'RenderTarget') (subresource: 0) is invalid for use as a source resource. Expected State Bits (all): 0xFBCF4620: D3D12_RESOURCE_STATE_COPY_SOURCE, Actual State: 0xFBCF4600: D3D12_RESOURCE_STATE_DEPTH_WRITE, Missing State: 0x800: D3D12_RESOURCE_STATE_COPY_SOURCE. [ EXECUTION ERROR # 538: INVALID_SUBRESOURCE_STATE]

If I understood correctly you were suggesting that the cause could be that the texture is not tracked anymore, after previous submit, even if the resource is still alive, but it seems that everything is consistent to what I was expecting, because the resource has been submitted and we cannot know that it'll be required again in next submit.

Any input on this?

@gents83
Copy link
Contributor

gents83 commented Nov 26, 2023

Digging a bit:

  • The resource was removed from tracker (because not needed anymore - the only one keeping it alive is the user)
  • command_encoder_copy_texture_to_buffer is called
  • tracker.textures.set_single() is called - and the texture is not owned anymore - so insert_or_barrier_update() actually insert without starting from previous state and forcing COPY_SRC

@gents83
Copy link
Contributor

gents83 commented Nov 26, 2023

I think that probably we should keep the resource alive in the device tracker IF it's alive in the user land - I'll test a possible fix (and cleanup also)

@gents83
Copy link
Contributor

gents83 commented Nov 26, 2023

Created PR #4782 to fix the issue 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: vulkan Issues with Vulkan type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants