Skip to content

vulkan: async and event fixes#20518

Merged
0cc4m merged 11 commits intomasterfrom
0cc4m/vulkan-async-fixes
Mar 17, 2026
Merged

vulkan: async and event fixes#20518
0cc4m merged 11 commits intomasterfrom
0cc4m/vulkan-async-fixes

Conversation

@0cc4m
Copy link
Contributor

@0cc4m 0cc4m commented Mar 13, 2026

I noticed incoherence with my multi-GPU setup as well when investigating issues like #20462. I found that they can be fixed by disabling cpy_tensor_async, so the problem is with the async path. I narrowed it down to these problems:

  • events were set, but the wait command was never submitted to the queue, so the event_wait function didn't do anything
  • events were resetting command buffers that had long since been reused, because they didn't track that. This was causing validation errors and perhaps driver issues/crashes
  • there was a race condition between an event being set, being waited on within the GPU and the event being reset by the host. This only didn't lead to deadlocks because event_wait was not working. I fixed it by doing the event reset in the queue instead of outside of it, that way it happens immediately before the next set. command. To avoid the same issue with the fence, I replaced it with a timeline semaphore, which does not need manual resets by the host.

This may help with #20462, #20029, #20517 and maybe some more.

@0cc4m 0cc4m requested a review from jeffbolznv March 13, 2026 15:53
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 13, 2026
@0cc4m
Copy link
Contributor Author

0cc4m commented Mar 13, 2026

I'm sure about the event reset asynchronously before setting it. It does seem to work, but maybe we need multiple events instead, to avoid reuse until a full queue synchronization happens.

@jeffbolznv
Copy link
Contributor

I don't think the reset right before the set will be safe, this was one of the things that made vkevents very hard to use.

@0cc4m
Copy link
Contributor Author

0cc4m commented Mar 13, 2026

Probably not, yeah. I'll take another look tomorrow and figure out something better, I guess we need multiple events. I thought about using the timeline semaphore for synchronization within the queue as well, but I think that would be heavier than events and isn't really what they are meant for.

@0cc4m
Copy link
Contributor Author

0cc4m commented Mar 14, 2026

@jeffbolznv I got something that should work, now. I'm not sure if it could be done in a simpler way, but this was the best I found that still actually reuses without a manual synchronization step anywhere. Without reuse we might get into trouble with too many events during loading, similar to command buffers.

@jeffbolznv
Copy link
Contributor

The recycling of events seems to assume that they will only be waited on once. Is that a valid assumption?

@0cc4m
Copy link
Contributor Author

0cc4m commented Mar 14, 2026

No, they do get waited on multiple times. But the assumption is that after a new event is recorded, the previous one cannot be waited on any longer. So when the new event gets synchronized (cpu-waited), all event waits of the previous one are also done, because they were submitted into the queue before the new event got set.

vkev->events_submitted.insert(vkev->events_submitted.end(), vkev->events_pending.begin(), vkev->events_pending.end());
vkev->events_pending.clear();
// Move existing event into pending
vkev->events_pending.push_back(vkev->event);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully following the logic here. The rest of this function will get an event and submit it. Why doesn't that immediately go into events_submitted? And can there ever be more than one pending event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending and submitted are meant in relation to wait commands in the queue, not set commands. So the flow is like this, for example:

Event 1 recorded
Waits for event 1
Event 2 recorded, event 1 goes into pending. It is no longer possible to wait for event 1
Waits for event 2
Event 3 recorded, event 2 goes into pending. Event 1 goes into submitted.
Waits for event 3
Synchronize event 3, event 2 and 1 can now be reused because all waits were before event 3.

or

Event 1 recorded
Waits for event 1
Synchronize event 1, wait commands for event 1 may still be in the queue, so no reuse yet
Event 2 recorded, event 1 moves into pending
Waits for event 2
Synchronize event 2, event 1 can now be reused

But I think you're right. The pending stage isn't needed. A synchronization means the queue has reached the set command of the event being synchronized, so all waits for previous events must also be done.

@HumerousGorgon
Copy link

Hi @0cc4m,

I can confirm that this PR definitely fixes the problems I'm having with inference on Intel GPUS.
Thank you for this!

@sinister-cat
Copy link

Tried merging this to master locally, everything builds fine but the assert throws during warmup the phase with qwen3.5? I was able to use the exact same command on master with no immediate issues.

log.txt

Commenting the assert causes a segfault, --no-warmup also causes segfault... any ideas on why this could be?

@0cc4m
Copy link
Contributor Author

0cc4m commented Mar 17, 2026

@jeffbolznv I guess that answers the "valid use" question. I'll revert the assertion.

@0cc4m 0cc4m merged commit 3a5cb62 into master Mar 17, 2026
49 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-async-fixes branch March 17, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants