Skip to content

Conversation

@CISC
Copy link
Collaborator

@CISC CISC commented Oct 26, 2025

Before:

  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                32769 runs -    31.43 us/run -     9216 kB/run -  279.70 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 6156 runs -   162.96 us/run -    65536 kB/run -  383.90 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                16392 runs -    62.51 us/run -    24576 kB/run -  375.07 GB/s
  CPY(type_src=f32,type_dst=q4_0,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                3592 runs -   296.98 us/run -    37376 kB/run -  120.14 GB/s
  CPY(type_src=q4_0,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 898 runs -  5036.64 us/run -    37376 kB/run -    7.08 GB/s

After:

  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                65538 runs -    15.95 us/run -     9216 kB/run -  550.99 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[8192,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                12977 runs -    79.29 us/run -    49152 kB/run -  592.95 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                32778 runs -    30.57 us/run -    18432 kB/run -  575.63 GB/s
  CPY(type_src=f32,type_dst=q4_0,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                3592 runs -   298.89 us/run -    37376 kB/run -  119.37 GB/s
  CPY(type_src=q4_0,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 898 runs -  5089.87 us/run -    37376 kB/run -    7.00 GB/s

Note/Edit: I fudged the permuted tests by making them contiguous (and changed type) just to verify that different shapes are OK, normally they would not be faster.

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 33ac27ff5..97a2bcde2 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -2541,6 +2541,8 @@ struct test_cpy : public test_case {
 
         if (_src_use_permute) {
             src = ggml_permute(ctx, src, permute_src[0], permute_src[1], permute_src[2], permute_src[3]);
+            if (type_src == GGML_TYPE_F32 || type_src == GGML_TYPE_F16 || type_src == GGML_TYPE_BF16)
+                src = ggml_cont(ctx, src);
             ggml_set_name(src, "src_permuted");
         }
 
@@ -2549,6 +2551,8 @@ struct test_cpy : public test_case {
 
         if (_dst_use_permute) {
             dst = ggml_permute(ctx, dst, permute_dst[0], permute_dst[1], permute_dst[2], permute_dst[3]);
+            if (type_dst == GGML_TYPE_F32 || type_dst == GGML_TYPE_F16 || type_dst == GGML_TYPE_BF16 || type_dst == GGML_TYPE_I32)
+                dst = ggml_cont(ctx, dst);
             ggml_set_name(dst, "dst_permuted");
         }
 
@@ -7213,8 +7217,8 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
     test_cases.emplace_back(new test_bin_bcast(ggml_add, GGML_TYPE_F32, {4096, 1, 1, 1}, {1, 512, 1, 1}));
 
     test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F16,  {512, 3072, 1, 1}));
-    test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F32,  {8192, 512, 2, 1}, {0, 2, 1, 3}));
-    test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F32,  {3072, 512, 2, 1}, {0, 2, 1, 3}));
+    test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F16,  {8192, 512, 2, 1}, {0, 2, 1, 3}));
+    test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_F16,  {3072, 512, 2, 1}, {0, 2, 1, 3}));
     test_cases.emplace_back(new test_cpy(GGML_TYPE_F32,  GGML_TYPE_Q4_0, {8192, 512, 2, 1}));
     test_cases.emplace_back(new test_cpy(GGML_TYPE_Q4_0, GGML_TYPE_F32,  {8192, 512, 2, 1}));
 

@CISC CISC requested a review from JohannesGaessler October 26, 2025 18:43
@CISC
Copy link
Collaborator Author

CISC commented Oct 26, 2025

Though, does the shape matter here? We already assert that it's the same number of elements...

Also, probably should use int64_t for ne here.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 26, 2025
@JohannesGaessler
Copy link
Collaborator

If the tensors are contiguous, did you try just using cudaMemcpyAsync?

@CISC
Copy link
Collaborator Author

CISC commented Oct 26, 2025

If the tensors are contiguous, did you try just using cudaMemcpyAsync?

That surely only works when types are equal, which is caught at the top.

@CISC CISC changed the title cuda : use fast copy when src and dst are contiguous and same shape cuda : use fast copy when src and dst are contiguous Oct 26, 2025
@CISC CISC changed the title cuda : use fast copy when src and dst are contiguous cuda : use fast copy when src and dst are of different type and contiguous Oct 26, 2025
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right, sorry. Do you need me to click the merge button or do you have the permissions to do it yourself?

@CISC
Copy link
Collaborator Author

CISC commented Oct 26, 2025

Ah, you're right, sorry. Do you need me to click the merge button or do you have the permissions to do it yourself?

I have the power. :)

@CISC CISC merged commit bd562fe into master Oct 26, 2025
72 checks passed
@CISC CISC deleted the cisc/cuda-cont-shape-cpy branch October 26, 2025 20:31
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 27, 2025
…guous (ggml-org#16789)

* use fast copy when src and dst are contiguous and same shape

* use int64_t ne and ignore shape
theo77186 pushed a commit to theo77186/llama.cpp that referenced this pull request Oct 28, 2025
…guous (ggml-org#16789)

* use fast copy when src and dst are contiguous and same shape

* use int64_t ne and ignore shape
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants