Skip to content

Conversation

@ikawrakow
Copy link
Owner

To help the quest for the world's smallest DeepSeek model, this PR adds CUDA implementation for IQ1_M_R4.

GEMM is done via dequantize+cuBLAS, so may require cmake -DGGML_CUDA_IQK_FORCE_BF16=ON.

Performance is on par or even tiny bit better than IQ1_M.

Here sweep bench for LlaMA-3-8B on RTX-4080

IQ1_M

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.347 5909.51 2.466 207.66
2048 512 2048 0.329 6216.59 2.657 192.69
2048 512 4096 0.356 5745.00 2.928 174.88
2048 512 6144 0.384 5332.11 3.162 161.91
2048 512 8192 0.411 4983.68 3.380 151.50
2048 512 10240 0.438 4678.79 3.634 140.88
2048 512 12288 0.466 4398.46 3.830 133.68
2048 512 14336 0.494 4149.40 4.095 125.03

IQ1_M_R4 (PR)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.338 6058.78 2.440 209.81
2048 512 2048 0.323 6337.42 2.639 193.99
2048 512 4096 0.350 5859.50 2.914 175.71
2048 512 6144 0.379 5409.73 3.151 162.47
2048 512 8192 0.405 5054.63 3.371 151.90
2048 512 10240 0.432 4742.62 3.618 141.52
2048 512 12288 0.458 4471.08 3.804 134.59
2048 512 14336 0.486 4210.13 4.067 125.90

@ubergarm
Copy link
Contributor

ubergarm commented Jun 5, 2025

Amazing, you've done it! The pieces of the puzzle are in place. Congrats, ik, on the world's smallest working DeepSeek-R1-0528 quant! 🎉

With the new DDR5 2x64GB DIMM kits becoming available, an AM5 gaming class rig + GPU can barely fit this little beast!

thud-sweep-R1-0528-IQ1_S_R4-PR494

I'm going to double check that llama-perplexity still runs clean, but great speed with partial offload is now working!

👈 Commands and Logs

Pull and Build

git branch | grep '*'
* ik/cuda_iq1_m_r4

git rev-parse --short HEAD
8ed7825f

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build ./build --config Release -j $(nproc)

llama-sweep-bench

model=/mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf

./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 16384 \
  -ctk f16 \
  -mla 3 -fa \
  -amb 512 \
  -fmoe \
  -ngl 99 \
  -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA0" \
  -ot "blk\.(21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38)\.ffn_.*=CUDA1" \
  -ot exps=CPU \
  -b 4096 -ub 4096 \
  --warmup-batch \
  --threads 24

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q4_0:   61 tensors
llama_model_loader: - type iq4_ks:  551 tensors
llama_model_loader: - type iq1_s_r4:  116 tensors
llama_model_loader: - type iq1_m_r4:   58 tensors

llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = IQ1_S_R4 - 1.5 bpw
llm_load_print_meta: model params     = 672.050 B
llm_load_print_meta: model size       = 130.203 GiB (1.664 BPW)
llm_load_print_meta: repeating layers = 129.285 GiB (1.657 BPW, 670.196 B parameters)
llm_load_print_meta: general.name     = DeepSeek R1 0528

llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size =  5994.06 MiB
llm_load_tensors:        CPU buffer size = 44211.82 MiB
llm_load_tensors:        CPU buffer size =   469.99 MiB
llm_load_tensors:      CUDA0 buffer size = 42859.65 MiB
llm_load_tensors:      CUDA1 buffer size = 43061.37 MiB

llama_kv_cache_init:      CUDA0 KV buffer size =   576.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   522.00 MiB
llama_new_context_with_model: KV self size  = 1098.00 MiB, c^KV (f16): 1098.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  2824.02 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2520.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   368.05 MiB
llama_new_context_with_model: graph nodes  = 5500
llama_new_context_with_model: graph splits = 111
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 9.959 411.28 70.744 14.47
4096 1024 4096 12.460 328.73 73.277 13.97
4096 1024 8192 14.947 274.04 76.418 13.40
4096 1024 12288 17.442 234.84 78.654 13.02

@ubergarm
Copy link
Contributor

ubergarm commented Jun 5, 2025

4.8805 +/- 0.02876 perplexity, not great, not terrible.

Importantly, it runs clean with no nans!!! Ship it! 🚢 🐿️ 🚀

@ikawrakow ikawrakow merged commit eded4e2 into main Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants