Skip to content

NvFP4 quantized LM head support#23046

Merged
CISC merged 5 commits into
ggml-org:masterfrom
ynankani:ynankani/nvfp4_lm_head_quant_support
May 16, 2026
Merged

NvFP4 quantized LM head support#23046
CISC merged 5 commits into
ggml-org:masterfrom
ynankani:ynankani/nvfp4_lm_head_quant_support

Conversation

@ynankani
Copy link
Copy Markdown
Contributor

Overview

Add support for quantized LM head.

  1. Create output_scale tensor with weight tying check.
  2. Pass output_s in build_lora_mm, consistent with other scale tensors like wq_s.

Additional information

vLLM is also supporting Quantized LM head with this PR

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for redundant work and testing the change

Signed-off-by: ynankani <ynankani@nvidia.com>
@ynankani ynankani requested review from CISC and ggerganov as code owners May 14, 2026 10:28
Comment thread src/llama-model.cpp Outdated
Signed-off-by: ynankani <ynankani@nvidia.com>
Comment thread src/llama-model.cpp Outdated
@github-actions github-actions Bot added the model Model specific label May 14, 2026
@CISC
Copy link
Copy Markdown
Member

CISC commented May 14, 2026

Looks like this somehow broke phimoe?

Signed-off-by: ynankani <ynankani@nvidia.com>
@ynankani
Copy link
Copy Markdown
Contributor Author

ynankani commented May 14, 2026

like this somehow broke phimoe?

Issue here is the pointer for tok_embd and output are different and create tensor on test path creates tensor by default if dimensions are given ignoring the TENSOR_NOT_REQUIRED. So there is a count mismatch .

  1. one option was to compare the name instead of pointer for output and tok_embd
  2. Check dtype for LM head if NvFP4 then only create output_s tensor called at least on test path.
  3. Remove the weight tying check just add assert when weight tying and lm head in NvFP4

Comment thread src/llama-model.cpp Outdated
Comment thread src/llama-model.cpp Outdated
Comment thread src/llama-model.cpp Outdated
Signed-off-by: ynankani <ynankani@nvidia.com>
@ynankani
Copy link
Copy Markdown
Contributor Author

ynankani commented May 15, 2026

Test cases are failing due to Vulkan tanh shader doesn't handle near zero values. If i revert #10723 i don't see the test failures.
Option 1

use: data_d[i] = D_TYPE(tanh(float(data_a[i]))); 
instead of : data_d[i] = D_TYPE(1. - 2. / (exp(2.*data_a[i]) + 1.));

Other solution can be to use

float x = data_d[i]
if (abs(x) < 2.4414e-4) {
    // tanh(x) ≈ x - x^3/3 + 2x^5/15 , no exp, no cancellation
    float x2 = x * x;
    result = x * (1.0 - x2 * (1.0/3.0 - x2 * (2.0/15.0)));
} else {
    result = 1.0 - 2.0 / (exp(2.0 * x) + 1.0);
}

@0cc4m Please let me know if the #10723 PR is still needed for Vulkan

Signed-off-by: ynankani <ynankani@nvidia.com>
@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 15, 2026

I don't understand this change yet. How does it cause different use of tanh?

@ynankani
Copy link
Copy Markdown
Contributor Author

I don't understand this change yet. How does it cause different use of tanh?

This change adds output_s tensor for LM head in NvFp4. In the test path as there doesn't a gguf file it creates a tensor with the given dimension and a value is set using Normal distribution for the tensor. The output_s is then used in mat-mul. The tanh has different behavior for Vulkan and CPU for near zero values which causes high NMSE for Vulkan path.

  1. logits = output @ x (normal magnitudes)
  2. logits *= output_s (N(0, 0.01), so ~0.001-0.1)
  3. logits /= f_final_logit_softcapping (30) (~3e-5 to 3e-3)
  4. logits = tanh(logits) <= Need to handle near zero value
  5. logits *= 30 (back to ~0.001-0.1)

@gaugarg-nv
Copy link
Copy Markdown
Contributor

IMO, the current change to add the tensors only when the type is NVFP4 should be fine. Running the CI again to see if it fixes the Vulkan failure.

@CISC CISC merged commit 42928bc into ggml-org:master May 16, 2026
50 checks passed
dandm1 pushed a commit to dandm1/llama.cpp that referenced this pull request May 16, 2026
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 16, 2026
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
jimbothigpen pushed a commit to jimbothigpen/llama.cpp that referenced this pull request May 21, 2026
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
jimbothigpen added a commit to jimbothigpen/llama.cpp that referenced this pull request May 21, 2026
…gml-org#23046 follow-up)

Ygg-local file not touched by mainline cherry-pick; mirror the same
signature update applied to all 100 mainline model files in 42928bc.
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
jimbothigpen added a commit to jimbothigpen/llama.cpp that referenced this pull request May 25, 2026
…gml-org#23046 follow-up)

Ygg-local file not touched by mainline cherry-pick; mirror the same
signature update applied to all 100 mainline model files in 42928bc.
jimbothigpen added a commit to jimbothigpen/llama.cpp that referenced this pull request May 25, 2026
…gml-org#23046 follow-up)

Ygg-local file not touched by mainline cherry-pick; mirror the same
signature update applied to all 100 mainline model files in 42928bc.
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants