vulkan: add UMA zero-copy async transfers and fix event_record deferred memcpy handling#20018
Open
neilopet wants to merge 2 commits into
Open
vulkan: add UMA zero-copy async transfers and fix event_record deferred memcpy handling#20018neilopet wants to merge 2 commits into
neilopet wants to merge 2 commits into
Conversation
Skip staging buffers for HOST_VISIBLE mapped buffers on UMA devices by routing through deferred_memcpy in write_2d_async and read_2d_async. Fix event_record to drain in_memcpys before submit and preserve out_memcpys across context reset, preventing data loss during the set_tensor_async -> event_record -> synchronize model-loading pattern.
Four tests covering the UMA zero-copy deferred memcpy lifecycle: round-trip integrity, repeated event records, empty queues, and out_memcpy persistence through event_record context reset.
jeffbolznv
requested changes
Mar 2, 2026
Contributor
jeffbolznv
left a comment
There was a problem hiding this comment.
A change like this needs more clear before/after perf results.
| return true; | ||
| } | ||
|
|
||
| // UMA zero-copy: destination is directly mapped, skip staging buffer |
Contributor
There was a problem hiding this comment.
I don't think this is correct. When called from set_tensor_async, I think the operation needs to be ordered against other work submitted to the same backend. When called from set_tensor, we already have a CPU copy path so this would be redundant.
| @@ -0,0 +1,343 @@ | |||
| // Regression tests for UMA zero-copy async buffer transfers. | |||
Contributor
There was a problem hiding this comment.
FWIW, I do think we need more thorough tests of the set/get paths and ordering guarantees of the backend interfaces. But if we have 300 lines of AI-generated tests per PR nobody will be able to maintain it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem:
Changes:
ggml_vk_buffer_write_2d_asyncandggml_vk_buffer_read_2d_async.ggml_backend_vk_event_record, drain pendingin_memcpysbefore submit and preserve/restoreout_memcpysacross compute-context reset.tests/test-vulkan-uma.cppfor round-trip, multiple event-record sequences, empty-queue path, and out-memcpy persistence.Validation:
test-vulkan-uma: 4/4 pass.test-backend-ops: baseline-matched RDNA f16 DIV failures only.llama-benchlocal UMA:Qwen3.5-27B-Q4_K_M:pp512=152.75,tg128=11.23Fine-tuned private Qwen3-Coder-Next-Q4_K_M:pp512=460.61,tg128=42.69Scope / consistency:
ctx_end -> submit -> reset/recreate), adding only deferred-copy correctness handling.device->uma && HOST_VISIBLE).Related: #16059, #18302, #18047
AI Disclosure: AI tools were used for review and test planning.