Revert "Faster prompt processing on CUDA (#1687)"#1700
Conversation
This reverts commit 3a945af.
|
Interestingly, I don't see much difference in PP either way for the full offload of
this PR (revert): |
|
Its not great with fully offloaded gemma. gemma bad
gemma good
Simply duplicated main and reverted the commit. |
|
But why would you use more than one thread with full offload ? [EDIT]: Anyways. main: pr: |
I don't. I just copy and paste configs. The thread count on fully offloaded models isn't really respected. It will use one unless you enable that setting to have one thread per GPU. I think it was async. Wonder why your system isn't affected. Extra bandwidth? Fancy SIMD? Driver/cuda version? |
x16 GPUs. Cuda 13.1 |
No, the wondering goes the other way around: why is your system affected in this way, while no other system is? Obviously the change is not affected in any way shape or form by PCI-E speed. It might be affected by CUDA version (it wouldn't be the first time that we hear a recent CUDA version miscompiling some code). |
|
I am on cuda 12.6 so it's not the newer version. The driver is the latest though. |
|
I'm on CUDA 12.1 with a 3080ti mobile on Ubuntu 24.04. I noticed a significant increase in prompt processing t/s for Qwen3.6 27B with this commit, sad to see it go! |
|
You can just revert it in your local copy and keep using it, probably for a while. |
Hardware details: #1668 (comment) Additionally, I have used the following GPUs: export CUDA_VISIBLE_DEVICES=10,7,6,0 |
|
And these are the results for: export CUDA_VISIBLE_DEVICES=10,0,1,2 |
|
Are you using the GPUs of different families? Check out the Qwen3.5 explanation: Details1. Architecture-Specific Kernel Launch Configuration (The "Smoking Gun")The most critical change causing heterogeneous issues is found in ggml/src/ggml-cuda/mmq_id_common.cuh. In the optimized version (Reverted): // Optimized Code
const dim3 block_nums_stream_k(GGML_CUDA_CC_IS_NVIDIA(cc) && tiles_efficiency_percent >= 90 ? ntiles_dst : nsm, 1, 1);In the reverted version: // Reverted Code
const dim3 block_nums_stream_k(nsm, 1, 1);Why this breaks heterogeneous setups:
2. Register Pressure and Occupancy VarianceThe optimization introduced a custom integer division implementation (fastdiv.cuh) to replace standard / and % operators. In the optimized version:
Why this hurts heterogeneous setups:
Summary of the Regression
Conclusion: |
|
Nope, 4x 3090. I have exactly what I had before. There is a 2080ti 22g but I use it to run co-models like image gen. The only thing I noticed is that when nvcc was compiling with this commit it mentioned something about being unable to determine "native" arch and falling back to default. I don't think I saw it happen when I recompiled after removing the commit. Relating back to your AI explanation, it's possible that an issue arises since I'm compiling SM75 code in addition to SM86, where people who had success with it were only compiling ampere+? I know this matters on other engines that use python and sometimes I have to do CUDA_VISIBLE_DEVICES but was under the impression IK compiles for all arch by default. |
|
There is a tool called You could run the nsys profile --stats=true -o profile_report \Then: nsys stats profile_report.nsys-rep > nsys.txtYou should be able to get something like this: (Qwen3.5) Details |
|
Thanks for the suggestion, I'm gonna chase some bottlenecks now :P Have to make sure my nsight is working. A simpler thing would be to simply see if excluding the 2080ti from compile with this commit makes a difference. |
|
Well here is a log of when it's "good". Need to try more models, especially hybrid. |
|
@Ph0rk0z - did you ever get to the bottom of why it was slowing things down for you? |
|
Not yet. Possibly a cuda bug due to compile ignoring GPU architecture. I have been meaning to test with explicitly defined CUDA_ARCH like I have now and with less threads but haven't had the time. The motivation isn't super high because I only see one report of it improving speed, but I am curious. |
This reverts commit 3a945af.
Apparently #1687 is causing issues for some people
If #1687 is improving your performance and you don't want it to be reverted, come here to object.