-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Remove Legacy Copy-OP Pointer Indirection Code #16485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Legacy Copy-OP Pointer Indirection Code #16485
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since changing addresses of cpy operations in CUDA graphs is no longer supported, the exception for GGML_OP_CPY in ggml_graph_node_has_matching_properties should also be removed.
The indirections in cpy ops should also be removed, since their only purpose was to allow this, as well as ggml_cuda_cpy_dest_ptrs_copy and ggml_cuda_graph::cpy_dest_ptrs.
|
@CISC, I've applied the function rename in the latest commit as suggested. Could you please take a look and let me know if the changes look good, or if there's anything else you'd recommend updating before merge? |
* cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) * remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function * CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) * CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks * CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) * CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code * CUDA: enable FA for FP32 KV cache (ggml-org#16546) * vulkan: Improve build time for MSVC (ggml-org#16545) Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel. * vulkan: Support FA with K/V in F32 (ggml-org#16543) * CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) * vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Stefan Savic <[email protected]> * metal : avoid using Metal's gpuAddress property (ggml-org#16576) * metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check --------- Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Anav Prasad <[email protected]> Co-authored-by: Aman Gupta <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Jeff Bolz <[email protected]> Co-authored-by: SavicStefan <[email protected]> Co-authored-by: Stefan Savic <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function
* origin/master: Add server-driven parameter defaults and syncing (ggml-org#16515) metal: optimise `GGML_OP_SUM` (ggml-org#16559) server : fix img token logs (ggml-org#16595) llama-quant: add support for mmproj (ggml-org#16592) CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585) server : fix mtmd checkpoints (ggml-org#16591) metal : avoid using Metal's gpuAddress property (ggml-org#16576) vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) vulkan: Support FA with K/V in F32 (ggml-org#16543) vulkan: Improve build time for MSVC (ggml-org#16545) CUDA: enable FA for FP32 KV cache (ggml-org#16546) CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) server : dynamic token limit for prompt cache (ggml-org#16560)
* remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function
As discussed in PR #16471, this PR removes the legacy copy-op pointer indirection code. This change allows
cudaMemcpyAsyncto be used instead of CUDA copy kernel for contiguous F32 tensors, resulting in ~4% performance improvement for Nemotron Nano v2 (NemotronH) model on RTX 5090.Results:
Weights: bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF
Quantization: Q4_K_M
Performance before:
Performance after: