IQ1_M_R4 CUDA implementation #494

ikawrakow · 2025-06-05T06:43:06Z

To help the quest for the world's smallest DeepSeek model, this PR adds CUDA implementation for IQ1_M_R4.

GEMM is done via dequantize+cuBLAS, so may require cmake -DGGML_CUDA_IQK_FORCE_BF16=ON.

Performance is on par or even tiny bit better than IQ1_M.

Here sweep bench for LlaMA-3-8B on RTX-4080

IQ1_M

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.347	5909.51	2.466	207.66
2048	512	2048	0.329	6216.59	2.657	192.69
2048	512	4096	0.356	5745.00	2.928	174.88
2048	512	6144	0.384	5332.11	3.162	161.91
2048	512	8192	0.411	4983.68	3.380	151.50
2048	512	10240	0.438	4678.79	3.634	140.88
2048	512	12288	0.466	4398.46	3.830	133.68
2048	512	14336	0.494	4149.40	4.095	125.03

IQ1_M_R4 (PR)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.338	6058.78	2.440	209.81
2048	512	2048	0.323	6337.42	2.639	193.99
2048	512	4096	0.350	5859.50	2.914	175.71
2048	512	6144	0.379	5409.73	3.151	162.47
2048	512	8192	0.405	5054.63	3.371	151.90
2048	512	10240	0.432	4742.62	3.618	141.52
2048	512	12288	0.458	4471.08	3.804	134.59
2048	512	14336	0.486	4210.13	4.067	125.90

ubergarm · 2025-06-05T15:26:27Z

Amazing, you've done it! The pieces of the puzzle are in place. Congrats, ik, on the world's smallest working DeepSeek-R1-0528 quant! 🎉

With the new DDR5 2x64GB DIMM kits becoming available, an AM5 gaming class rig + GPU can barely fit this little beast!

I'm going to double check that llama-perplexity still runs clean, but great speed with partial offload is now working!

👈 Commands and Logs

Pull and Build

git branch | grep '*'
* ik/cuda_iq1_m_r4

git rev-parse --short HEAD
8ed7825f

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build ./build --config Release -j $(nproc)

llama-sweep-bench

model=/mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf

./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 16384 \
  -ctk f16 \
  -mla 3 -fa \
  -amb 512 \
  -fmoe \
  -ngl 99 \
  -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA0" \
  -ot "blk\.(21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38)\.ffn_.*=CUDA1" \
  -ot exps=CPU \
  -b 4096 -ub 4096 \
  --warmup-batch \
  --threads 24

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q4_0:   61 tensors
llama_model_loader: - type iq4_ks:  551 tensors
llama_model_loader: - type iq1_s_r4:  116 tensors
llama_model_loader: - type iq1_m_r4:   58 tensors

llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = IQ1_S_R4 - 1.5 bpw
llm_load_print_meta: model params     = 672.050 B
llm_load_print_meta: model size       = 130.203 GiB (1.664 BPW)
llm_load_print_meta: repeating layers = 129.285 GiB (1.657 BPW, 670.196 B parameters)
llm_load_print_meta: general.name     = DeepSeek R1 0528

llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size =  5994.06 MiB
llm_load_tensors:        CPU buffer size = 44211.82 MiB
llm_load_tensors:        CPU buffer size =   469.99 MiB
llm_load_tensors:      CUDA0 buffer size = 42859.65 MiB
llm_load_tensors:      CUDA1 buffer size = 43061.37 MiB

llama_kv_cache_init:      CUDA0 KV buffer size =   576.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   522.00 MiB
llama_new_context_with_model: KV self size  = 1098.00 MiB, c^KV (f16): 1098.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  2824.02 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2520.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   368.05 MiB
llama_new_context_with_model: graph nodes  = 5500
llama_new_context_with_model: graph splits = 111

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	9.959	411.28	70.744	14.47
4096	1024	4096	12.460	328.73	73.277	13.97
4096	1024	8192	14.947	274.04	76.418	13.40
4096	1024	12288	17.442	234.84	78.654	13.02

ubergarm · 2025-06-05T16:12:10Z

4.8805 +/- 0.02876 perplexity, not great, not terrible.

Importantly, it runs clean with no nans!!! Ship it! 🚢 🐿️ 🚀

Iwan Kawrakow added 2 commits June 5, 2025 09:08

iq1_m_r4: CUDA dequantize

62f1422

iq1_m_r4: CUDA dequantize

8ed7825

ikawrakow merged commit eded4e2 into main Jun 5, 2025

saood06 mentioned this pull request Jul 8, 2025

Another minor readme update #592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IQ1_M_R4 CUDA implementation #494

IQ1_M_R4 CUDA implementation #494

Uh oh!

ikawrakow commented Jun 5, 2025

Uh oh!

ubergarm commented Jun 5, 2025

Pull and Build

llama-sweep-bench

Uh oh!

ubergarm commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IQ1_M_R4 CUDA implementation #494

IQ1_M_R4 CUDA implementation #494

Uh oh!

Conversation

ikawrakow commented Jun 5, 2025

IQ1_M

IQ1_M_R4 (PR)

Uh oh!

ubergarm commented Jun 5, 2025

Pull and Build

llama-sweep-bench

Uh oh!

ubergarm commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants