-
Notifications
You must be signed in to change notification settings - Fork 19.8k
Optimize Vulkan buffer transfers on UMA (Unified Memory Architecture) devices #22462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
winstonma
wants to merge
27
commits into
ggml-org:master
from
winstonma:winston/vk-uma-read-threshold
Closed
Changes from 2 commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
80e793a
vulkan: avoid preferring transfer queue on AMD UMA devices
winstonma da5e315
Optimize Vulkan buffer transfers on UMA (Unified Memory Architecture)…
winstonma fe1eb03
vulkan: centralize UMA read heuristic and configurable threshold
winstonma bd5db36
fix incorrect async/event ordering on Vulkan, where a host read could…
winstonma 91176d3
implement UMA write threshold to avoid non-cached memory penalty
winstonma dfcc950
implement an automatic calibration system for UMA (Unified Memory Arc…
winstonma 8819a4f
refactoring and cleanup pass for this PR
winstonma f1c0532
reverses the decision criteria for using direct memory access when th…
winstonma 4ecad4f
Merge commit 'refs/pull/22455/head' of https://github.com/ggml-org/ll…
winstonma 3139dcf
fixes measurement bias that was causing suboptimal transfer strategy …
winstonma 6133345
removes dead code structure in ggml_vk_buffer_read_2d_asyn
winstonma 9309d72
Fixed the indentation inconsistency
winstonma bd7701b
added UMA write thresholding for ggml_vk_buffer_memset_async and ggml…
winstonma 6f85dc0
revert prefers_transfer_queue definition and comments
winstonma 8f2fb72
use read/write barrier to address potential race conditions
winstonma d38def3
cleanup and deduplication
winstonma e7db9e1
cleanup and deduplication #2
winstonma 630716e
follows best practices for handling platform-specific size differences
winstonma 5c78cdd
added a two-line comment explaining the contract
winstonma ab18b5a
Revert "fix incorrect async/event ordering on Vulkan"
winstonma b58976a
making the barrier a silent no-op
winstonma bd0a0ff
removed the premature UMA direct transfer check in ggml_vk_buffer_rea…
winstonma 2aad038
fixing slow read speed
winstonma c52584d
fix calibration
winstonma e176a81
remove flush cache
winstonma 8c67e77
refactor and optimize the calibration process
winstonma 0ae5d6a
adjusted the read/write logic
winstonma File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct for the same reasons I commented in #20018. The async copies need to run on the queue to stay in order with other commands.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I am not familiar with these. I asked Codex to write a test case to verify the the async copies and passes the test case. And here is the follow up question that I asked:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to be familiar with it. Copy-pasting AI responses into maintainer questions is not allowed because we do not have time or patience to debate an AI that can make up wrong claims way faster than any human could debunk them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly I'm not quite sure I follow the question. But I tried to add some log and see if the question is being answered. This is the debug log:
According to the log, the Vulkan Timeline Semaphore have created a system where the Compute Queue is physically incapable of outrunning the data being moved by the Transfer Queue. Thus the ordering is maintained. Also, the Compute Queue is hardware-blocked (bound by a Vulkan Timeline Semaphore wait operation) until the Transfer Queue signals completion, there is no risk of the GPU reading "stale" or partially written memory.
Disabling Transfer Queue on AMD UMA
I also submitted another PR to disable to the transfer queue on the AMD UMA. If the Transfer Queue is disabled, the code path would naturally fall back to a single-queue model where all operations are submitted to the Compute Queue. In this scenario, ordering is maintained by default due to the sequential nature of command submission within a single Vulkan queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of the transfer queue or compute queue, ordering is maintained for commands you submit to the queue. That does not apply to deferred memcpys.
in_memcpysrun on queue submission.out_memcpysrun (in specific cases) after a fence wait that makes sure all queue commands are done. This will not work with the backend async read/write functions because those assume that the commands run in the right order in the queue.It may work in your tests because you get lucky and the order works out, but this is not guaranteed. This change is fundamentally unsafe.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for detailed explanation. I made a commit based on previous comment.
The commit moves the execution of
out_memcpysis deferred fromggml_vk_compute_forwardtoggml_vk_synchronize(ensures that the memcpy only occurs after the GPU fence has signaled completion). Also prevents dropped copies that might occur if a tensor's weak context reference was unset before the synchronization happened. Thus the ordering is enforced by the fence.in_memcpysis consumed before GPU work submission, also it is cleared after submission (inggml_vk_submit_transfer_ctx) instead of being cleared later during synchronization (avoid it from being re-executed). Thus ensure transfer complete before GPU work begins.Conclusion
in_memcpysis executed before submitIn the UMA write path
ggml_vk_buffer_write_2d_async, instead of going through a staging buffer +vkCmdCopyBuffer, the data is deferred intosubctx->in_memcpys. These are then memcpy'd beforeggml_vk_submitis called, at two sites:ggml_vk_compute_forward: loopsin_memcpys→memcpy→ then submits the command buffer.ggml_vk_synchronize: same pattern for any remainingin_memcpyson the compute context.ggml_vk_submit_transfer_ctx: same for the transfer queue context.out_memcpysexecuted after fenceIn the UMA read path
ggml_vk_buffer_read_2d_async, the read is deferred intosubctx->out_memcpys. These are consumed only afterwaitForFencessucceeds inggml_vk_synchronize