cuda : use fast copy when src and dst are of different type and contiguous #16789

CISC · 2025-10-26T18:37:37Z

Before:

  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                32769 runs -    31.43 us/run -     9216 kB/run -  279.70 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 6156 runs -   162.96 us/run -    65536 kB/run -  383.90 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                16392 runs -    62.51 us/run -    24576 kB/run -  375.07 GB/s
  CPY(type_src=f32,type_dst=q4_0,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                3592 runs -   296.98 us/run -    37376 kB/run -  120.14 GB/s
  CPY(type_src=q4_0,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 898 runs -  5036.64 us/run -    37376 kB/run -    7.08 GB/s

After:

  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                65538 runs -    15.95 us/run -     9216 kB/run -  550.99 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[8192,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                12977 runs -    79.29 us/run -    49152 kB/run -  592.95 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                32778 runs -    30.57 us/run -    18432 kB/run -  575.63 GB/s
  CPY(type_src=f32,type_dst=q4_0,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                3592 runs -   298.89 us/run -    37376 kB/run -  119.37 GB/s
  CPY(type_src=q4_0,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 898 runs -  5089.87 us/run -    37376 kB/run -    7.00 GB/s

Note/Edit: I fudged the permuted tests by making them contiguous (and changed type) just to verify that different shapes are OK, normally they would not be faster.

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 33ac27ff5..97a2bcde2 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -2541,6 +2541,8 @@ struct test_cpy : public test_case {
 
         if (_src_use_permute) {
             src = ggml_permute(ctx, src, permute_src[0], permute_src[1], permute_src[2], permute_src[3]);
+            if (type_src == GGML_TYPE_F32 || type_src == GGML_TYPE_F16 || type_src == GGML_TYPE_BF16)
+                src = ggml_cont(ctx, src);
             ggml_set_name(src, "src_permuted");
         }
 
@@ -2549,6 +2551,8 @@ struct test_cpy : public test_case {
 
         if (_dst_use_permute) {
             dst = ggml_permute(ctx, dst, permute_dst[0], permute_dst[1], permute_dst[2], permute_dst[3]);
+            if (type_dst == GGML_TYPE_F32 || type_dst == GGML_TYPE_F16 || type_dst == GGML_TYPE_BF16 || type_dst == GGML_TYPE_I32)
+                dst = ggml_cont(ctx, dst);
             ggml_set_name(dst, "dst_permuted");
         }
 
@@ -7213,8 +7217,8 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
     test_cases.emplace_back(new test_bin_bcast(ggml_add, GGML_TYPE_F32, {4096, 1, 1, 1}, {1, 512, 1, 1}));
 
     test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F16,  {512, 3072, 1, 1}));
-    test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F32,  {8192, 512, 2, 1}, {0, 2, 1, 3}));
-    test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F32,  {3072, 512, 2, 1}, {0, 2, 1, 3}));
+    test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F16,  {8192, 512, 2, 1}, {0, 2, 1, 3}));
+    test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F16,  {3072, 512, 2, 1}, {0, 2, 1, 3}));
     test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_Q4_0, {8192, 512, 2, 1}));
     test_cases.emplace_back(new test_cpy(GGML_TYPE_Q4_0, GGML_TYPE_F32,  {8192, 512, 2, 1}));

CISC · 2025-10-26T18:51:04Z

Though, does the shape matter here? We already assert that it's the same number of elements...

Also, probably should use int64_t for ne here.

JohannesGaessler · 2025-10-26T19:14:42Z

If the tensors are contiguous, did you try just using cudaMemcpyAsync?

CISC · 2025-10-26T19:20:23Z

If the tensors are contiguous, did you try just using cudaMemcpyAsync?

That surely only works when types are equal, which is caught at the top.

JohannesGaessler

Ah, you're right, sorry. Do you need me to click the merge button or do you have the permissions to do it yourself?

CISC · 2025-10-26T20:05:51Z

Ah, you're right, sorry. Do you need me to click the merge button or do you have the permissions to do it yourself?

I have the power. :)

…guous (ggml-org#16789) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape

use fast copy when src and dst are contiguous and same shape

f3fb5de

CISC requested a review from JohannesGaessler October 26, 2025 18:43

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 26, 2025

use int64_t ne and ignore shape

6ffdcc5

CISC changed the title ~~cuda : use fast copy when src and dst are contiguous and same shape~~ cuda : use fast copy when src and dst are contiguous Oct 26, 2025

CISC changed the title ~~cuda : use fast copy when src and dst are contiguous~~ cuda : use fast copy when src and dst are of different type and contiguous Oct 26, 2025

JohannesGaessler approved these changes Oct 26, 2025

View reviewed changes

CISC merged commit bd562fe into master Oct 26, 2025
72 checks passed

CISC deleted the cisc/cuda-cont-shape-cpy branch October 26, 2025 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

cuda : use fast copy when src and dst are of different type and contiguous #16789

cuda : use fast copy when src and dst are of different type and contiguous #16789

CISC commented Oct 26, 2025 •

edited

Loading

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

JohannesGaessler commented Oct 26, 2025

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

JohannesGaessler left a comment •

edited

Loading

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

cuda : use fast copy when src and dst are of different type and contiguous #16789

cuda : use fast copy when src and dst are of different type and contiguous #16789

Conversation

CISC commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

JohannesGaessler commented Oct 26, 2025

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

JohannesGaessler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CISC commented Oct 26, 2025 •

edited

Loading

JohannesGaessler left a comment •

edited

Loading