CUDA: fix q_nope_absorbed precision for Deepseek 2 Lite f16 by JohannesGaessler · Pull Request #13137 · ggml-org/llama.cpp

JohannesGaessler · 2025-04-27T15:08:48Z

I noticed that Deepseek 2 Lite when calculating perplexity on Wikitext was returning worse results for FP16 weights than with q4_0 weights: with 10 512 token chunks FP16 resulted in 27.6215 while q4_0 resulted in 8.1775. The problem seems to be numerical issues in the calculation of q_nope_absorbed, specifically with CUDA and batch sizes > 1. If FP32 precision is used the perplexity becomes 7.9094.

On master ggml_cuda_mul_mat does not always respect the precision set via ggml_mul_mat_set_prec. ggml_cuda_mul_mat_batched_cublas only supports FP16, FP16 -> FP16 GEMM but is used regardless of the requested precision. This PR makes it so that if higher precision is requested ggml_cuda_op_mul_mat_cublas is used instead (which supports FP32 precision). Long-term I think we should aim to remove ggml_cuda_op_mul_mat and refactor the cuBLAS code. I'm currently working towards the former; what I think is specifically needed is MMQ support for batched and non-contiguous inputs and backend-agnostic support for tensor parallelism.

jukofyork · 2025-04-27T21:00:10Z

Yeah, q_nope_absorbed is basically doing the KQ multiplication so likely it will suffer from the same overflow problems - I had similar problems when I tried to use FP16 for it too.

CUDA: fix q_nope_absorbed prec for DS 2 Lite f16

ac5cc2f

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 27, 2025

ggerganov approved these changes Apr 28, 2025

View reviewed changes

JohannesGaessler merged commit 69699be into ggml-org:master Apr 28, 2025
48 checks passed

jukofyork mentioned this pull request Apr 28, 2025

DeepSeek V2/V3 MLA implementation #12801

Merged

This was referenced Apr 28, 2025

CUDA: fix non-cont. inputs for batched mat mul #13155

Merged

test: non-cont. b in test-backend-ops -o MUL_MAT #13187

Merged

timwu pushed a commit to timwu/llama.cpp that referenced this pull request Dec 20, 2025

CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (ggml-org#13137)

d8b4ba7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fix q_nope_absorbed precision for Deepseek 2 Lite f16#13137

CUDA: fix q_nope_absorbed precision for Deepseek 2 Lite f16#13137
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:ds2l-prec

JohannesGaessler commented Apr 27, 2025

Uh oh!

jukofyork commented Apr 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohannesGaessler commented Apr 27, 2025

Uh oh!

jukofyork commented Apr 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants