Skip to content

Revert "Faster prompt processing on CUDA (#1687)"#1700

Merged
ikawrakow merged 1 commit into
mainfrom
ik/revert_1687
Apr 28, 2026
Merged

Revert "Faster prompt processing on CUDA (#1687)"#1700
ikawrakow merged 1 commit into
mainfrom
ik/revert_1687

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

This reverts commit 3a945af.

Apparently #1687 is causing issues for some people

If #1687 is improving your performance and you don't want it to be reverted, come here to object.

@magikRUKKOLA
Copy link
Copy Markdown

Interestingly, I don't see much difference in PP either way for the full offload of Qwen3.5-397B-IQ4_KSS (that is, mist.bin).

main


 main: n_kv_max = 260096, n_batch = 3072, n_ubatch = 3072, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1

 |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
 |-------|--------|--------|----------|----------|----------|----------|
 |  3072 |    768 |      0 |    1.965 |  1563.50 |   13.081 |    58.71 |
 |  3072 |    768 |   3072 |    2.028 |  1515.02 |   13.270 |    57.87 |
 |  3072 |    768 |   6144 |    2.036 |  1508.78 |   13.411 |    57.27 |
 |  3072 |    768 |   9216 |    2.080 |  1477.09 |   13.588 |    56.52 |
 |  3072 |    768 |  12288 |    2.122 |  1447.61 |   13.855 |    55.43 |
 |  3072 |    768 |  15360 |    2.172 |  1414.67 |   14.047 |    54.67 |
 |  3072 |    768 |  18432 |    2.220 |  1383.85 |   14.178 |    54.17 |
 |  3072 |    768 |  21504 |    2.262 |  1357.81 |   14.467 |    53.09 |
 |  3072 |    768 |  24576 |    2.315 |  1327.24 |   14.592 |    52.63 |
 |  3072 |    768 |  27648 |    2.355 |  1304.30 |   14.663 |    52.38 |
 |  3072 |    768 |  30720 |    2.417 |  1271.17 |   14.751 |    52.07 |
 |  3072 |    768 |  33792 |    2.457 |  1250.21 |   15.013 |    51.16 |
 |  3072 |    768 |  36864 |    2.506 |  1226.01 |   15.121 |    50.79 |
 |  3072 |    768 |  39936 |    2.544 |  1207.62 |   15.281 |    50.26 |
 |  3072 |    768 |  43008 |    2.596 |  1183.19 |   15.579 |    49.30 |
 |  3072 |    768 |  46080 |    2.639 |  1163.92 |   15.691 |    48.95 |
 |  3072 |    768 |  49152 |    2.676 |  1147.78 |   15.839 |    48.49 |
 |  3072 |    768 |  52224 |    2.729 |  1125.65 |   16.076 |    47.77 |
 |  3072 |    768 |  55296 |    2.773 |  1107.90 |   16.215 |    47.36 |
 |  3072 |    768 |  58368 |    2.820 |  1089.26 |   16.430 |    46.74 |
 |  3072 |    768 |  61440 |    2.870 |  1070.41 |   16.552 |    46.40 |
 |  3072 |    768 |  64512 |    2.904 |  1057.85 |   16.882 |    45.49 |

this PR (revert):

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  3072 |    768 |      0 |    1.977 |  1554.02 |   13.093 |    58.66 |
|  3072 |    768 |   3072 |    2.054 |  1495.27 |   13.254 |    57.94 |
|  3072 |    768 |   6144 |    2.049 |  1499.53 |   13.434 |    57.17 |
|  3072 |    768 |   9216 |    2.090 |  1469.65 |   13.596 |    56.49 |
|  3072 |    768 |  12288 |    2.133 |  1439.92 |   13.876 |    55.35 |
|  3072 |    768 |  15360 |    2.174 |  1412.97 |   14.066 |    54.60 |
|  3072 |    768 |  18432 |    2.227 |  1379.72 |   14.201 |    54.08 |
|  3072 |    768 |  21504 |    2.265 |  1356.50 |   14.445 |    53.17 |
|  3072 |    768 |  24576 |    2.323 |  1322.55 |   14.601 |    52.60 |
|  3072 |    768 |  27648 |    2.367 |  1297.60 |   14.710 |    52.21 |
|  3072 |    768 |  30720 |    2.422 |  1268.19 |   14.791 |    51.92 |
|  3072 |    768 |  33792 |    2.467 |  1245.29 |   15.056 |    51.01 |
|  3072 |    768 |  36864 |    2.500 |  1228.94 |   15.193 |    50.55 |
|  3072 |    768 |  39936 |    2.556 |  1201.87 |   15.325 |    50.12 |
|  3072 |    768 |  43008 |    2.596 |  1183.37 |   15.539 |    49.42 |
|  3072 |    768 |  46080 |    2.651 |  1158.74 |   15.688 |    48.95 |
|  3072 |    768 |  49152 |    2.688 |  1143.04 |   15.807 |    48.59 |
|  3072 |    768 |  52224 |    2.727 |  1126.43 |   16.060 |    47.82 |
|  3072 |    768 |  55296 |    2.779 |  1105.40 |   16.229 |    47.32 |
|  3072 |    768 |  58368 |    2.825 |  1087.59 |   16.418 |    46.78 |
|  3072 |    768 |  61440 |    2.852 |  1077.08 |   16.534 |    46.45 |
|  3072 |    768 |  64512 |    2.917 |  1053.21 |   16.821 |    45.66 |

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 27, 2026

Its not great with fully offloaded gemma.

                 CUDA_VISIBLE_DEVICES=3,2,1,0 ./bin/llama-sweep-bench \
-m /Gemma4-31b-it/Gemma4-31b-it-Q8_0.gguf \
--alias gemma \
-t 48 \
-c 65536 \
-ngl 99 \
-ctk q8_0 \
-ctv q8_0 \
-fa 1 \
-ub 2048 \
-sm graph \
--jinja \
--no-mmap \
--verbose \
-khad \
-vhad \
-grt bf16 \
-gr \
-wgt 1 \

gemma bad

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 3.952 518.16 9.453 54.16
2048 512 2048 3.718 550.86 9.839 52.04
2048 512 4096 3.752 545.87 9.953 51.44
2048 512 6144 3.809 537.72 10.044 50.98
2048 512 8192 3.842 533.06 10.136 50.51
2048 512 10240 3.863 530.13 10.221 50.09
2048 512 12288 3.868 529.52 10.330 49.56
2048 512 14336 3.931 520.98 10.426 49.11

gemma good

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.996 2055.59 9.462 54.11
2048 512 2048 0.939 2180.85 9.830 52.09
2048 512 4096 0.969 2112.93 9.954 51.44
2048 512 6144 1.003 2042.47 10.055 50.92
2048 512 8192 1.026 1995.42 10.135 50.52
2048 512 10240 1.056 1940.13 10.216 50.12
2048 512 12288 1.085 1887.40 10.334 49.55
2048 512 14336 1.114 1838.25 10.433 49.07

Simply duplicated main and reverted the commit.

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Apr 28, 2026

@Ph0rk0z

But why would you use more than one thread with full offload ?

[EDIT]:

Anyways.

main:


main: n_kv_max = 65536, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |    0.588 |  3481.88 |    9.377 |    54.60 |
|  2048 |    512 |   2048 |    0.621 |  3296.41 |    9.694 |    52.82 |

pr:


main: n_kv_max = 65536, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |    0.590 |  3470.16 |    9.399 |    54.48 |
|  2048 |    512 |   2048 |    0.623 |  3285.91 |    9.683 |    52.88 |

/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench \
    --warmup-batch \
    --model /opt/DevQuasar/Gemma4-31b-it/Q8_0/google.gemma-4-31B-it.Q8_0.gguf \
    --alias DevQuasar/gemma4-31b \
    -ngl 99 \
    --threads 1 \
    --threads-batch 1 \
    -ctk q8_0 \
    -ctv q8_0 \
    --temp 0.0 --top-k 0 --top-p 1.0 \
    --ctx-size $((64 * 1024)) \
    -fa 1 \
    --jinja \
    --no-mmap \
    --mlock \
    --verbose \
    -ub $((2 * 1024)) \
    -gr \
    --graph-reduce-type bf16 \
    -wgt 1 \
    --split-mode graph

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 28, 2026

But why would you use more than one thread with full offload ?

I don't. I just copy and paste configs. The thread count on fully offloaded models isn't really respected. It will use one unless you enable that setting to have one thread per GPU. I think it was async.

Wonder why your system isn't affected. Extra bandwidth? Fancy SIMD? Driver/cuda version?

@magikRUKKOLA
Copy link
Copy Markdown

Wonder why your system isn't affected. Extra bandwidth?

x16 GPUs. Cuda 13.1

@ikawrakow
Copy link
Copy Markdown
Owner Author

@Ph0rk0z

Wonder why your system isn't affected. Extra bandwidth? Fancy SIMD? Driver/cuda version?

No, the wondering goes the other way around: why is your system affected in this way, while no other system is? Obviously the change is not affected in any way shape or form by PCI-E speed. It might be affected by CUDA version (it wouldn't be the first time that we hear a recent CUDA version miscompiling some code).

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 28, 2026

I am on cuda 12.6 so it's not the newer version. The driver is the latest though.

@resynth
Copy link
Copy Markdown

resynth commented Apr 28, 2026

I'm on CUDA 12.1 with a 3080ti mobile on Ubuntu 24.04. I noticed a significant increase in prompt processing t/s for Qwen3.6 27B with this commit, sad to see it go!

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 28, 2026

You can just revert it in your local copy and keep using it, probably for a while.

@magikRUKKOLA
Copy link
Copy Markdown

@Ph0rk0z

Wonder why your system isn't affected. Extra bandwidth? Fancy SIMD? Driver/cuda version?

Hardware details: #1668 (comment)

Additionally, I have used the following GPUs:

export CUDA_VISIBLE_DEVICES=10,7,6,0

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Apr 29, 2026

@Ph0rk0z

And these are the results for:

export CUDA_VISIBLE_DEVICES=10,0,1,2

main: n_kv_max = 65536, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |    0.997 |  2053.27 |    9.203 |    55.63 |
|  2048 |    512 |   2048 |    1.035 |  1978.50 |    9.536 |    53.69 |
|  2048 |    512 |   4096 |    1.055 |  1940.62 |    9.633 |    53.15 |
|  2048 |    512 |   6144 |    1.078 |  1900.08 |    9.703 |    52.76 |
|  2048 |    512 |   8192 |    1.099 |  1863.40 |    9.788 |    52.31 |
^C

@magikRUKKOLA
Copy link
Copy Markdown

@Ph0rk0z

Are you using the GPUs of different families?

Check out the Qwen3.5 explanation:

Details

1. Architecture-Specific Kernel Launch Configuration (The "Smoking Gun")

The most critical change causing heterogeneous issues is found in ggml/src/ggml-cuda/mmq_id_common.cuh.

In the optimized version (Reverted):

// Optimized Code
const dim3 block_nums_stream_k(GGML_CUDA_CC_IS_NVIDIA(cc) && tiles_efficiency_percent >= 90 ? ntiles_dst : nsm, 1, 1);

In the reverted version:

// Reverted Code
const dim3 block_nums_stream_k(nsm, 1, 1);

Why this breaks heterogeneous setups:

  • cc (Compute Capability): The optimized code queried the GPU's Compute Capability (e.g., 8.6 for 3090, 8.9 for 4090, 9.0 for Blackwell) to decide how many CUDA blocks to launch (ntiles_dst vs nsm).
  • Single Source of Truth: In a multi-GPU llama.cpp setup, this configuration is typically calculated once on the host CPU based on the primary GPU's properties.
  • The Mismatch: If your primary GPU is a 4090 (high efficiency), the code might choose ntiles_dst (a larger, more aggressive grid size). However, if you have a 3090 in the same setup, it receives the same kernel launch configuration. The 3090 might not have the same scheduling resources or SM count to handle that specific grid size efficiently, leading to occupancy drops or work imbalance.
  • The Fix: The revert forces a standard block count (nsm), which is a conservative, robust default that works consistently across different GPU architectures, preventing the faster card from waiting on the slower card due to misconfigured parallelism.

2. Register Pressure and Occupancy Variance

The optimization introduced a custom integer division implementation (fastdiv.cuh) to replace standard / and % operators.

In the optimized version:

  • It precomputed "magic numbers" (mp, L) on the CPU.
  • It passed these as uint3 structures to the GPU kernel.
  • It used __umulhi (multiply high) instructions inside the kernel.

Why this hurts heterogeneous setups:

  • Register Usage: While __umulhi is faster than division, the logic required to unpack the uint3 struct and manage the intermediate variables (hi, mp, L) increases register pressure (the number of registers each thread needs).
  • Hardware Differences: Different GPU families have different register file sizes per Streaming Multiprocessor (SM).
    • RTX 4090/5090: Have massive register files. The extra registers required by fastdiv might not impact them at all.
    • RTX 3090: Has a smaller register file per SM. The extra pressure might force the compiler to spill registers to local memory (slow) or reduce occupancy (fewer active warps).
  • The Bottleneck: In tensor-parallel inference, all GPUs must synchronize at every layer. If the fastdiv optimization causes the 3090 to run at 50% occupancy while the 4090 runs at 100%, the entire system waits for the 3090.
  • The Fix: Standard integer division (/, %) is handled by the CUDA compiler (NVCC). The compiler is highly optimized to balance register usage vs. instruction latency for the specific architecture it is compiling for. Reverting to standard operators allows the compiler to make the best trade-off for each specific GPU in the mix, rather than forcing a manual optimization that favors newer architectures.

Summary of the Regression

Feature Optimized Version (#1687) Reverted Version (#1700) Impact on Heterogeneous Setup
Division Math Manual fastdiv (Multiply + Shift) Standard / and % fastdiv increased register pressure, hurting older/smaller GPUs (3090) more than newer ones (4090/5090).
Grid Size Dynamic based on Compute Capability Static (nsm) Tuning for one GPU's architecture caused load imbalance on other GPU families in the same rig.
Kernel Args Passed uint3 magic values Passed plain int Reduced kernel argument overhead and complexity.

Conclusion:
The author reverted the commit because the "optimization" was non-portable. It assumed a homogeneous environment where all GPUs could benefit equally from aggressive register usage and architecture-specific grid sizing. In a heterogeneous setup, the "weakest link" (usually the older GPU suffering from register spilling or bad grid config) dictated the overall speed, making the system slower than if it had just used standard, compiler-optimized division.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 29, 2026

Nope, 4x 3090. I have exactly what I had before. There is a 2080ti 22g but I use it to run co-models like image gen.

The only thing I noticed is that when nvcc was compiling with this commit it mentioned something about being unable to determine "native" arch and falling back to default. I don't think I saw it happen when I recompiled after removing the commit.

Relating back to your AI explanation, it's possible that an issue arises since I'm compiling SM75 code in addition to SM86, where people who had success with it were only compiling ampere+? I know this matters on other engines that use python and sometimes I have to do CUDA_VISIBLE_DEVICES but was under the impression IK compiles for all arch by default.

@magikRUKKOLA
Copy link
Copy Markdown

@Ph0rk0z

There is a tool called nsys.

You could run the ik_llama.cpp with say like:

nsys profile --stats=true -o profile_report \

Then:

 nsys stats profile_report.nsys-rep > nsys.txt

You should be able to get something like this:

(Qwen3.5)

Details
File: /root/nsys.txt
```txt

NOTICE: Existing SQLite export found: profile_report.sqlite
        It is assumed file was previously exported from: profile_report.nsys-rep
        Consider using --force-export=true if needed.

Processing [profile_report.sqlite] with [/opt/nvidia/nsight-systems/2025.5.2/host-linux-x64/reports/nvtx_sum.py]... 

 ** NVTX Range Summary (nvtx_sum):

 Time (%)  Total Time (ns)  Instances     Avg (ns)         Med (ns)        Min (ns)       Max (ns)     StdDev (ns)   Style          Range        
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  -----------  -------  --------------------
    100.0    2,272,780,509          1  2,272,780,509.0  2,272,780,509.0  2,272,780,509  2,272,780,509          0.0  PushPop  NCCL:ncclCommInitAll

Processing [profile_report.sqlite] with [/opt/nvidia/nsight-systems/2025.5.2/host-linux-x64/reports/osrt_sum.py]... 

 ** OS Runtime Summary (osrt_sum):

 Time (%)   Total Time (ns)    Num Calls      Avg (ns)          Med (ns)        Min (ns)       Max (ns)        StdDev (ns)              Name         
 --------  ------------------  ---------  ----------------  -----------------  -----------  ---------------  ----------------  ----------------------
     66.2  15,501,550,699,372        175  88,580,289,710.7  119,358,760,327.0      127,606  119,396,891,292  49,401,577,097.6  pthread_cond_wait     
     25.3   5,921,719,394,809     43,563     135,934,609.5      100,131,793.0        1,001    1,001,646,475     177,680,497.6  poll                  
      7.5   1,758,339,394,874      3,201     549,309,401.7      500,092,414.0  500,026,246    4,000,581,967     412,122,448.2  pthread_cond_timedwait
      0.5     119,326,496,862          3  39,775,498,954.0   26,570,833,337.0    2,480,972   92,753,182,553  47,764,480,071.3  accept4               
      0.3      66,533,425,147      2,191      30,366,693.4       21,298,186.0      139,199    1,000,098,671      91,201,274.1  pthread_cond_clockwait
      0.1      21,307,771,608      2,198       9,694,163.6           73,081.0        1,002      134,270,142      31,176,538.4  fread                 
      0.0       7,056,768,832     14,314         492,997.7           12,995.0        1,001      546,046,279      11,927,893.2  pthread_mutex_lock    
      0.0       5,246,132,863     31,284         167,693.8           86,507.0        1,002      393,578,889       2,744,196.8  ioctl                 
      0.0       2,143,254,025     31,070          68,981.5            1,764.0        1,042    2,082,159,912      11,812,533.3  accept                
      0.0         974,544,420     17,558          55,504.3           55,146.0        7,865          130,291           3,643.4  usleep                
      0.0         458,032,939        239       1,916,455.8           15,661.0        1,002       44,890,445       8,487,567.9  pthread_rwlock_wrlock 
      0.0         204,298,554        520         392,881.8            4,609.0        1,452      120,489,446       6,318,981.8  mmap                  
      0.0         160,673,326        761         211,134.5          202,219.0      144,298          684,329          39,673.0  pthread_create        
      0.0         143,855,395      2,447          58,788.5            5,060.0        1,001        4,110,213         185,269.4  recv                  
      0.0         137,720,979      3,446          39,965.5            7,515.0        1,012        3,910,789         181,087.2  send                  
      0.0         133,427,748        671         198,849.1           12,274.0        8,125       59,572,457       2,808,856.4  pthread_join          
      0.0          55,659,866         11       5,059,987.8        5,057,219.0    5,056,007        5,090,242          10,051.1  nanosleep             
      0.0          38,478,797        445          86,469.2            3,777.0        1,633       36,528,346       1,731,403.0  open64                
      0.0          32,906,539        822          40,032.3            3,337.0        1,002       29,565,623       1,031,082.0  fopen                 
      0.0          28,616,474          1      28,616,474.0       28,616,474.0   28,616,474       28,616,474               0.0  mlock                 
      0.0          26,060,056          4       6,515,014.0           15,074.0        1,012       26,028,896      13,009,256.6  pthread_rwlock_rdlock 
      0.0          12,837,818        254          50,542.6            8,061.0        4,128        1,228,488         161,632.1  mmap64                
      0.0           8,947,655        101          88,590.6           49,626.0       29,336          482,810         111,063.6  sem_timedwait         
      0.0           8,040,737      2,175           3,696.9            3,536.0        1,042          193,573           4,252.7  pthread_cond_broadcast
      0.0           1,752,550         81          21,636.4           15,961.0       10,210           63,532          12,773.2  connect               
      0.0           1,465,576        409           3,583.3            2,886.0        1,964           32,613           3,192.9  munmap                
      0.0           1,325,431        265           5,001.6            4,960.0        1,002           15,350           2,556.4  fputs                 
      0.0           1,108,889        152           7,295.3            6,032.0        1,042           46,630           6,876.6  shutdown              
      0.0           1,082,664        306           3,538.1            2,274.0        1,001           18,636           2,647.0  fclose                
      0.0             912,289         11          82,935.4            3,507.0        1,032          422,965         166,857.8  futex                 
      0.0             808,713        213           3,796.8            2,665.0        1,002           55,487           5,004.7  close                 
      0.0             723,805         59          12,267.9           10,430.0        1,252           35,027           8,665.0  open                  
      0.0             685,331        136           5,039.2            4,053.0        1,533           15,169           2,704.3  socket                
      0.0             589,864        426           1,384.7            1,122.0        1,002            8,065             880.1  read                  
      0.0             251,673        111           2,267.3            1,894.0        1,222            6,803           1,097.4  write                 
      0.0             122,066         49           2,491.1            1,783.0        1,102            9,708           1,947.6  bind                  
      0.0             122,038         46           2,653.0            2,716.0        1,243            5,331           1,037.9  pthread_cond_signal   
      0.0             109,812          4          27,453.0           26,982.0       21,722           34,126           6,501.3  fgets                 
      0.0              54,064          9           6,007.1            1,122.0        1,022           23,565           8,328.0  fflush                
      0.0              33,863         25           1,354.5            1,402.0        1,002            1,713             222.8  fcntl                 
      0.0              33,093         11           3,008.5            3,196.0        1,102            3,667             695.3  fwrite                
      0.0              30,720         18           1,706.7            1,363.0        1,022            3,757             811.7  listen                
      0.0              14,458          4           3,614.5            3,562.0        2,184            5,150           1,395.8  pipe2                 
      0.0               6,403          1           6,403.0            6,403.0        6,403            6,403               0.0  pipe                  
      0.0               1,533          1           1,533.0            1,533.0        1,533            1,533               0.0  signal                
      0.0               1,022          1           1,022.0            1,022.0        1,022            1,022               0.0  sigaction             

Processing [profile_report.sqlite] with [/opt/nvidia/nsight-systems/2025.5.2/host-linux-x64/reports/cuda_api_sum.py]... 

 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)    Max (ns)     StdDev (ns)                      Name                   
 --------  ---------------  ---------  -------------  -----------  ---------  -----------  -------------  ------------------------------------------
     53.6   44,750,711,629    167,886      266,554.2      3,917.0        721  360,418,475    4,826,715.0  cudaStreamSynchronize                     
     26.3   21,910,533,898    723,057       30,302.6      4,599.0        130   44,618,771      496,649.7  cudaMemcpyAsync                           
      7.1    5,961,609,486    520,180       11,460.7     10,199.0      5,681      818,388        4,883.3  cudaGraphLaunch_v10000                    
      4.2    3,497,717,194    699,431        5,000.8      4,268.0        802   16,467,364       54,804.9  cudaLaunchKernel                          
      4.2    3,492,644,193  1,070,058        3,264.0      3,006.0        861   10,430,747       19,868.0  cudaStreamWaitEvent                       
      1.4    1,161,651,156          7  165,950,165.1  2,453,560.0     30,128  752,001,317  298,537,958.4  cudaMallocHost                            
      1.0      859,379,576  1,071,054          802.4        701.0        591      845,219        1,646.0  cudaEventRecord                           
      0.6      509,333,392      4,186      121,675.4    114,525.5      3,617    3,043,416      234,708.0  cudaMalloc                                
      0.3      285,405,233          7   40,772,176.1  1,132,443.0     27,793  178,824,759   71,971,883.7  cudaFreeHost                              
      0.3      210,656,795        886      237,761.6    130,381.5     13,637   41,705,064    1,708,539.6  cudaGraphInstantiate_v12000               
      0.2      196,708,266     23,904        8,229.1      7,855.0        230       93,500        2,986.4  cudaMemcpyPeerAsync                       
      0.1      114,213,406    699,431          163.3        150.0        110      773,792          940.8  cuKernelGetName                           
      0.1      112,648,580        150      750,990.5    444,196.5     93,411    2,918,566      669,131.3  cuMemSetAccess                            
      0.1       96,184,663         62    1,551,365.5    112,922.5      3,096   12,629,055    3,120,644.3  cudaFree                                  
      0.1       71,828,172         10    7,182,817.2  6,363,777.0  6,296,658   11,979,322    1,835,779.5  cuMemUnmap                                
      0.1       57,235,731         28    2,044,133.3  1,424,521.0    318,975    8,046,832    1,932,919.8  cuLibraryLoadData                         
      0.1       42,480,484      2,618       16,226.3     13,055.0        521    1,959,878       38,885.0  cudaGraphDestroy_v10000                   
      0.0       31,810,420         40      795,260.5    620,876.5     35,919    2,988,380      689,942.7  cudaHostAlloc                             
      0.0       29,190,202        886       32,946.1     28,269.5      5,380      186,469       14,872.1  cudaGraphExecDestroy_v10000               
      0.0       23,409,602      2,618        8,941.8      7,755.0        682      157,774        7,026.3  cudaGraphExecUpdate_v10020                
      0.0        9,087,966      2,618        3,471.3      2,504.5      1,413       45,457        3,011.5  cudaStreamBeginCapture_v10000             
      0.0        7,770,997        151       51,463.6     47,732.0     18,666      243,389       31,438.9  cuMemCreate                               
      0.0        3,301,508      2,618        1,261.1      1,143.0        411        7,875          558.6  cudaStreamEndCapture_v10000               
      0.0        1,953,121        450        4,340.3      3,246.0      2,866       21,772        2,444.6  cudaMemset                                
      0.0        1,098,841         90       12,209.3      8,081.0      5,220       40,839        8,983.5  cuMemAddressReserve                       
      0.0        1,015,674         80       12,695.9      7,795.0      5,922       56,549       11,596.9  cudaMemsetAsync                           
      0.0        1,003,567         10      100,356.7    110,458.0     42,732      167,533       44,120.8  cudaMemPoolCreate_v11020                  
      0.0          970,256        150        6,468.4      3,932.5      1,604      189,075       15,704.8  cuMemMap                                  
      0.0          772,890        794          973.4        902.0        831       30,910        1,106.7  cudaEventSynchronize                      
      0.0          394,548         20       19,727.4     13,160.5     11,582       83,681       18,596.7  cudaMemGetInfo                            
      0.0          386,817         10       38,681.7     22,593.5     17,134      163,645       44,569.6  cuMemAddressFree                          
      0.0          321,800         41        7,848.8      6,051.0      2,425       27,483        4,940.3  cudaStreamCreateWithFlags                 
      0.0          316,974        720          440.2        331.0        270        5,861          488.2  cudaThreadExchangeStreamCaptureMode_v10010
      0.0          303,853      1,347          225.6        181.0        130        1,854          111.9  cuGetProcAddress_v2                       
      0.0          303,666         40        7,591.6      4,343.5      1,923       40,348        8,030.0  cudaDeviceSynchronize                     
      0.0          299,624        254        1,179.6        501.0        420       19,879        1,777.3  cudaEventCreateWithFlags                  
      0.0          212,990      1,180          180.5        150.0        120        1,442           85.7  cuStreamGetCaptureInfo_v2                 
      0.0          134,819        204          660.9        455.5        370        6,883          704.4  cudaEventDestroy                          
      0.0          125,762         11       11,432.9     11,061.0      6,592       15,330        2,545.1  cudaStreamDestroy                         
      0.0          104,617         71        1,473.5        310.0        190       82,139        9,710.4  cuMemRelease                              
      0.0           83,684         62        1,349.7        757.0        270       14,849        2,062.9  cuLibraryGetKernel                        
      0.0           43,766         67          653.2        381.0        300        2,355          558.7  cudaGetDriverEntryPointByVersion_v12050   
      0.0           38,283         10        3,828.3      3,902.5      2,514        5,080          760.8  cudaEventQuery                            
      0.0           26,507        101          262.4        211.0        130        2,023          226.9  cuMemGetAllocationGranularity             
      0.0           21,613         10        2,161.3      1,257.5        872        9,689        2,676.6  cudaMemPoolSetAttribute_v11020            
      0.0            9,972         20          498.6        481.5        140        1,052          353.3  cudaGetDeviceProperties_v12000            
      0.0            7,533          5        1,506.6      1,713.0        751        2,295          694.1  cuInit                                    
      0.0            3,188          4          797.0        256.0        141        2,535        1,161.5  cuModuleGetLoadingMode                    

Processing [profile_report.sqlite] with [/opt/nvidia/nsight-systems/2025.5.2/host-linux-x64/reports/cuda_gpu_kern_sum.py]... 

 ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

 Time (%)  Total Time (ns)  Instances   Avg (ns)      Med (ns)    Min (ns)    Max (ns)   StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  -----------  ------------  ---------  ----------  -----------  ----------------------------------------------------------------------------------------------------
     26.2    6,828,456,505      1,062  6,429,808.4   6,411,069.0  5,145,335   9,315,232    598,981.4  void mul_mat_q_id<(ggml_type)146, (int)128, (bool)0>(const char *, const int *, const int *, const …
     21.6    5,618,343,082      1,062  5,290,341.9   5,303,069.0  4,463,050   6,257,717    270,351.6  void mul_mat_q_id<(ggml_type)144, (int)128, (bool)0>(const char *, const int *, const int *, const …
     11.4    2,963,785,195    522,000      5,677.7       5,792.0      4,000      57,696        762.0  void k_reduce_add_T<float, (int)256, (int)2>(copy_task)                                             
      8.2    2,143,309,810      2,596    825,620.1   1,020,516.5      1,760   1,246,468    392,892.1  void mmq_ids_helper<(int)0>(const int *, int *, int *, int *, int, int, int, int, int)              
      6.9    1,793,200,482        180  9,962,224.9  11,827,276.0     80,736  15,318,717  4,773,104.8  void flash_attn_mma_ext_f16<(int)256, (int)8, (int)8, (int)4, (int)32, (int)2, (bool)0>(const char …
      5.6    1,458,343,304      4,376    333,259.4     124,705.0     51,200   1,895,819    444,086.3  void mul_mat_q<(ggml_type)133, (int)128, (int)8, (bool)0>(const char *, const char *, float *, floa…
      4.3    1,122,358,207        300  3,741,194.0   3,834,938.0  2,995,324   4,275,034    309,977.9  void delta_net_recurrent_f32<(int)128, (int)128>(const float *, const float *, const float *, const…
      2.4      626,839,077      5,642    111,102.3       2,400.0      2,303     653,699    226,370.2  mul_multi_add_f32(int, long, long, long, long, long, long, long, const char *, const char *, char *)
      2.3      603,128,370      2,596    232,329.9      90,048.0      1,472     537,376    209,936.2  void quantize_mmq_q8_1<(mmq_q8_1_ds_layout)0>(const float *, const int *, void *, long, long, long,…
      1.4      369,572,924      1,362    271,345.8     112,176.5     86,688     915,746    305,112.0  void mul_mat_q<(ggml_type)8, (int)128, (int)8, (bool)0>(const char *, const char *, float *, float …
      1.2      309,847,110     10,900     28,426.3       1,248.0      1,087     189,632     61,158.6  void k_add_same<(int)256, float>(long, const T2 *, const T2 *, T2 *)                                
      1.0      269,408,494      4,344     62,018.5      62,017.0     61,632      62,496        133.0  void iqk_mul_mat_vec_q<(ggml_type)144, (int)4, &vec_dot_iq4_ks_q8_1, (int)1, (int)1>(const void *, …
      0.8      217,376,402      4,076     53,330.8      69,184.0      5,824      77,600     26,575.6  void quantize_mmq_q8_1<(mmq_q8_1_ds_layout)0>(const float *, void *, long, long, long)              
      0.8      211,850,171      1,298    163,212.8     203,072.5      1,600     221,312     77,116.1  fused_mul_silu_f32(const float *, float *, int, int)                                                
      0.7      183,124,395      6,182     29,622.2       3,360.0      3,008     123,681     46,336.7  void fused_rms_norm_f32<(int)1024, float>(const T2 *, const float *, float *, int, float)           
      0.7      176,020,697      2,124     82,872.3      84,513.0     66,496      92,320      5,523.4  void k_add<float, (int)256>(int, const T1 *, T1 *)                                                  
      0.5      119,424,665      1,298     92,006.7     113,825.0      1,152     127,937     43,391.5  k_mul_fast(int, int, const float *, const float *, float *)                                         
      0.4      116,140,964      4,344     26,735.9      26,720.0     26,208      27,488        192.1  void iqk_fused_mul_mat_vec_q<(ggml_type)146, (int)4, &vec_dot_iq4_kss_q8_1, (int)1, (int)1>(const v…
      0.3       76,584,946      4,376     17,501.1      16,768.0     12,512      27,008      2,530.1  void mul_mat_q_stream_k_fixup<(ggml_type)133, (int)128, (int)8, (bool)0>(float *, const float *, in…
      0.3       76,316,468      1,080     70,663.4      11,808.0      1,920     261,569    103,497.6  void cpy_flt<&cpy_1_flt<float, float>>(const char *, char *, int, int, int, int, int, int, int, int…
      0.3       68,205,036      1,658     41,136.9      18,976.0      1,120     169,313     56,830.5  fused_mul_silu_f32(const float *, const float *, float *, int)                                      
      0.3       65,454,575      5,642     11,601.3       7,744.0      7,552      34,464      8,058.6  void topk_moe_cuda<(unsigned long)512, (bool)1>(const float *, float *, int *, const float *, int, …
      0.2       62,896,611      1,062     59,224.7      60,256.0     48,352      66,176      3,728.1  std::enable_if<!T7, void>::type internal::gemvx::kernel<int, int, float, float, float, float, (bool…
      0.2       52,625,878        360    146,183.0     180,737.0      3,328     189,249     65,384.2  void ssm_conv_single_seq_f32_nc4<(int)32>(const float *, const float *, const float *, float *, int…
      0.2       47,956,050        360    133,211.3     166,081.0      1,184     171,457     60,425.1  silu_f32(const float *, float *, int)                                                               
      0.2       47,862,038        236    202,805.2     189,296.5    124,449     307,394     40,672.2  void mul_mat_q_id<(ggml_type)146, (int)8, (bool)0>(const char *, const int *, const int *, const in…
      0.1       37,053,856        236    157,007.9     154,560.5    140,288     205,473     12,341.5  void mul_mat_q_id<(ggml_type)144, (int)8, (bool)0>(const char *, const int *, const int *, const in…
      0.1       35,767,713        360     99,354.8     122,144.5      2,016     130,752     44,416.2  void fused_rms_norm_f32<(int)128, float>(const T2 *, const float *, float *, int, float)            
      0.1       30,665,119     22,608      1,356.4       1,248.0      1,216       2,048        213.6  quantize_q8_1(const float *, void *, long, long)                                                    
      0.1       29,069,077      4,346      6,688.7       6,688.0      6,528      18,273        252.0  void mul_mat_vec_q<(ggml_type)133, (int)1, (int)4>(const void *, const void *, float *, const char …
      0.1       26,087,325      4,344      6,005.4       6,016.0      5,824       6,240         64.7  void fused_mul_mat_vec_q<(ggml_type)133, (int)1, (int)4>(const void *, const void *, const void *, …
      0.1       24,133,066        180    134,072.6     166,673.0      1,600     168,961     60,565.0  void k_bin_bcast<&op_mul, float, float, float>(const T2 *, const T3 *, T4 *, int, int, int, int, in…
      0.1       23,964,905      1,362     17,595.4      16,928.0     15,776      21,568      1,490.8  void mul_mat_q_stream_k_fixup<(ggml_type)8, (int)128, (int)8, (bool)0>(float *, const float *, int,…
      0.1       22,729,377        600     37,882.3      37,248.0     29,696      43,392      3,222.4  void mul_mat_q<(ggml_type)8, (int)128, (int)8, (bool)1>(const char *, const char *, float *, float …
      0.1       19,881,474        360     55,226.3      10,464.0      1,536     131,489     56,724.3  void fused_rms_norm_f32<(int)256, float>(const T2 *, const float *, float *, int, float)            
      0.1       18,896,432        720     26,245.0      32,128.0      1,376      33,793     11,337.7  void l2_norm_f32_nc<(int)32>(const float *, float *, int, long, long, long, float)                  
      0.1       18,708,424      4,344      4,306.7       4,288.0      4,064       4,608        126.2  void mul_mat_vec_q<(ggml_type)8, (int)1, (int)4>(const void *, const void *, float *, const char *,…
      0.1       18,176,103        180    100,978.4     125,280.5      1,792     130,752     45,447.1  void rope_rope_multi<(bool)1, (bool)0, float>(const T3 *, const T3 *, T3 *, T3 *, int, int, int, in…
      0.1       17,886,071      1,478     12,101.5       1,280.0      1,088     113,985     32,331.5  sigmoid_f32(const float *, float *, int)                                                            
      0.0       11,699,696      1,062     11,016.7      10,720.0      8,288      15,200      1,828.3  void mul_mat_q_stream_k_fixup_id<(ggml_type)146, (int)128, (bool)0>(const int *, const int *, float…
      0.0       11,075,457      4,344      2,549.6       2,528.0      2,496       2,944         40.6  void mul_mat_row<float, (int)256>(int, const T1 *, const float *, float *)                          
      0.0        9,461,156        180     52,562.0      50,880.0     49,536      64,672      3,495.1  void flash_attn_mma_stream_k_fixup<(int)256, (int)8, (int)8, (int)32>(float *, const float2 *, int,…
      0.0        9,047,118        360     25,130.9      30,240.0      1,696      33,344     10,695.1  void cpy_f32_q<&cpy_blck_f32_q8_0, (int)32>(const char *, char *, int, int, int, int, int, int, int…
      0.0        8,985,186      1,062      8,460.6       7,968.0      6,400      13,184      1,742.8  void mul_mat_q_stream_k_fixup_id<(ggml_type)144, (int)128, (bool)0>(const int *, const int *, float…
      0.0        8,893,738        600     14,822.9      14,720.0     14,048      16,608        403.1  void mul_mat_q_stream_k_fixup<(ggml_type)8, (int)128, (int)8, (bool)1>(float *, const float *, int,…
      0.0        7,475,328        592     12,627.2       7,648.0      5,376      36,641      9,460.1  void mul_mat_vec_q<(ggml_type)133, (int)5, (int)2>(const void *, const void *, float *, const char …
      0.0        7,113,536        360     19,759.8      20,432.0     12,896      25,920      3,886.9  void dequantize_block_q8_0_f16<(bool)0>(const void *, __half *, long)                               
      0.0        5,708,754      4,344      1,314.2       1,312.0      1,280       1,440         16.9  fused_mul_sigmoid_f32(int, const float *, const float *, float *, int)                              
      0.0        3,139,405        298     10,534.9       6,720.0      5,184      28,288      8,684.0  void mul_mat_vec_q<(ggml_type)8, (int)5, (int)2>(const void *, const void *, float *, const char *,…
      0.0        2,418,925        150     16,126.2      16,720.0     12,256      18,209      1,864.6  void flash_attn_mask_to_KV_min_max<(int)8, (bool)0>(const __half2 *, int2 *, int, int, int)         
      0.0        1,798,022        354      5,079.2       4,384.0      3,872       6,816      1,100.2  void mul_mat_vec_q<(ggml_type)133, (int)2, (int)4>(const void *, const void *, float *, const char …
      0.0        1,353,706        360      3,760.3       3,744.0      3,232       4,608        292.2  concat_f32_dim0(const float *, const float *, float *, long, long)                                  
      0.0          872,616        360      2,423.9       2,592.0      1,312       2,945        494.9  k_fused_softplus(int, int, const float *, const float *, const float *, float *)                    
      0.0          850,818        236      3,605.2       3,616.0      3,264       3,936        126.4  void mul_mat_q_stream_k_fixup_id<(ggml_type)146, (int)8, (bool)0>(const int *, const int *, float *…
      0.0          830,179        360      2,306.1       2,208.0      1,760       3,296        394.1  ssm_conv_single_seq_final_state_f32(const float *, const float *, float *, int, int, int, int, int,…
      0.0          777,318        236      3,293.7       3,296.0      3,040       3,584         76.5  void mul_mat_q_stream_k_fixup_id<(ggml_type)144, (int)8, (bool)0>(const int *, const int *, float *…
      0.0          711,171         60     11,852.9      11,840.0     11,488      12,256        148.8  void delta_net_recurrent_f32<(int)128, (int)256>(const float *, const float *, const float *, const…
      0.0          532,801        118      4,515.3       4,480.0      4,224       4,928        176.0  void mul_mat_vec_q<(ggml_type)8, (int)2, (int)4>(const void *, const void *, float *, const char *,…
      0.0          417,668        236      1,769.8       1,760.0      1,664       1,952         69.4  void dot_kernel<float, (int)128, (int)0, cublasDotParams<cublasGemvTensorStridedBatched<const float…
      0.0          384,132        236      1,627.7       1,632.0      1,568       1,760         34.8  void reduce_1Block_kernel<float, (int)128, (int)7, cublasGemvTensorStridedBatched<float>, cublasGem…
      0.0            4,992          3      1,664.0       1,504.0      1,504       1,984        277.1  void k_get_rows_float<float, float>(const T1 *, const int *, T2 *, long, long, long, unsigned long,…

Processing [profile_report.sqlite] with [/opt/nvidia/nsight-systems/2025.5.2/host-linux-x64/reports/cuda_gpu_mem_time_sum.py]... 

 ** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):

 Time (%)  Total Time (ns)   Count   Avg (ns)   Med (ns)  Min (ns)   Max (ns)   StdDev (ns)           Operation          
 --------  ---------------  -------  ---------  --------  --------  ----------  -----------  ----------------------------
     64.4   14,169,797,872  460,532   30,768.3     416.0       384  44,591,702    623,818.0  [CUDA memcpy Host-to-Device]
     34.0    7,493,966,990   23,895  313,620.7   2,016.0     1,056   5,150,304    726,277.1  [CUDA memcpy Peer-to-Peer]  
      1.6      350,517,643    3,885   90,223.3  51,776.0    51,200     208,962     46,534.5  [CUDA memcpy Device-to-Host]
      0.0          414,561      530      782.2     736.0       544       1,920        225.5  [CUDA memset]               

Processing [profile_report.sqlite] with [/opt/nvidia/nsight-systems/2025.5.2/host-linux-x64/reports/cuda_gpu_mem_size_sum.py]... 

 ** CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum):

 Total (MB)    Count   Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
 -----------  -------  --------  --------  --------  --------  -----------  ----------------------------
 236,058.421  460,532     0.513     0.000     0.000   578.814        9.465  [CUDA memcpy Host-to-Device]
 108,315.132   23,895     4.533     0.016     0.016    50.332        9.733  [CUDA memcpy Peer-to-Peer]  
   5,872.589    3,885     1.512     0.993     0.993     2.171        0.585  [CUDA memcpy Device-to-Host]
       1.099      530     0.002     0.002     0.000     0.024        0.003  [CUDA memset]               



@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 29, 2026

Thanks for the suggestion, I'm gonna chase some bottlenecks now :P Have to make sure my nsight is working.

A simpler thing would be to simply see if excluding the 2080ti from compile with this commit makes a difference.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Apr 29, 2026

Well here is a log of when it's "good".
nsys.txt

Need to try more models, especially hybrid.

@resynth
Copy link
Copy Markdown

resynth commented May 2, 2026

@Ph0rk0z - did you ever get to the bottom of why it was slowing things down for you?
I notice 1 of the 2 crash reports was AMD GPU and CPU so wonder if that was the issue there.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 3, 2026

Not yet. Possibly a cuda bug due to compile ignoring GPU architecture. I have been meaning to test with explicitly defined CUDA_ARCH like I have now and with less threads but haven't had the time. The motivation isn't super high because I only see one report of it improving speed, but I am curious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants