Skip to content

vulkan: add UMA zero-copy async transfers and fix event_record deferred memcpy handling#20018

Open
neilopet wants to merge 2 commits into
ggml-org:masterfrom
neilopet:vulkan-uma-zero-copy
Open

vulkan: add UMA zero-copy async transfers and fix event_record deferred memcpy handling#20018
neilopet wants to merge 2 commits into
ggml-org:masterfrom
neilopet:vulkan-uma-zero-copy

Conversation

@neilopet
Copy link
Copy Markdown

@neilopet neilopet commented Mar 1, 2026

Problem:

  • On UMA, host-visible mapped buffers were still using staging in async transfer paths.
  • When the UMA zero-copy async path is used, deferred memcpy queues can be pending at event_record; without drain/preserve across context reset, copies can be dropped.

Changes:

  • Add UMA zero-copy async fast paths for host-visible mapped buffers in ggml_vk_buffer_write_2d_async and ggml_vk_buffer_read_2d_async.
  • In ggml_backend_vk_event_record, drain pending in_memcpys before submit and preserve/restore out_memcpys across compute-context reset.
  • Add regression tests in tests/test-vulkan-uma.cpp for round-trip, multiple event-record sequences, empty-queue path, and out-memcpy persistence.

Validation:

  • test-vulkan-uma: 4/4 pass.
  • test-backend-ops: baseline-matched RDNA f16 DIV failures only.
  • llama-bench local UMA:
    • Qwen3.5-27B-Q4_K_M: pp512=152.75, tg128=11.23
    • Fine-tuned private Qwen3-Coder-Next-Q4_K_M: pp512=460.61, tg128=42.69

Scope / consistency:

  • This builds on vulkan: improve partial offloading performance on AMD #19976 transfer-context changes and only adds deferred memcpy lifecycle handling for UMA zero-copy paths.
  • Preserves existing event lifecycle structure (ctx_end -> submit -> reset/recreate), adding only deferred-copy correctness handling.
  • UMA zero-copy path remains gated by (device->uma && HOST_VISIBLE).
  • No shader changes.
  • No changes to non-UMA transfer behavior.

Related: #16059, #18302, #18047

AI Disclosure: AI tools were used for review and test planning.

neilopet added 2 commits March 1, 2026 14:53
Skip staging buffers for HOST_VISIBLE mapped buffers on UMA devices
by routing through deferred_memcpy in write_2d_async and read_2d_async.

Fix event_record to drain in_memcpys before submit and preserve
out_memcpys across context reset, preventing data loss during the
set_tensor_async -> event_record -> synchronize model-loading pattern.
Four tests covering the UMA zero-copy deferred memcpy lifecycle:
round-trip integrity, repeated event records, empty queues, and
out_memcpy persistence through event_record context reset.
@github-actions github-actions Bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 1, 2026
@neilopet neilopet changed the title vulkan: add UMA zero-copy async path and fix event_record deferred memcpy handling vulkan: add UMA zero-copy async transfers and fix event_record deferred memcpy handling Mar 1, 2026
@neilopet neilopet marked this pull request as ready for review March 1, 2026 20:16
@neilopet neilopet requested review from 0cc4m and ggerganov as code owners March 1, 2026 20:16
Copy link
Copy Markdown
Contributor

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A change like this needs more clear before/after perf results.

return true;
}

// UMA zero-copy: destination is directly mapped, skip staging buffer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. When called from set_tensor_async, I think the operation needs to be ordered against other work submitted to the same backend. When called from set_tensor, we already have a CPU copy path so this would be redundant.

Comment thread tests/test-vulkan-uma.cpp
@@ -0,0 +1,343 @@
// Regression tests for UMA zero-copy async buffer transfers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I do think we need more thorough tests of the set/get paths and ordering guarantees of the backend interfaces. But if we have 300 lines of AI-generated tests per PR nobody will be able to maintain it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants