NvFP4 quantized LM head support by ynankani · Pull Request #23046 · ggml-org/llama.cpp

ynankani · 2026-05-14T10:28:31Z

Overview

Add support for quantized LM head.

Create output_scale tensor with weight tying check.
Pass output_s in build_lora_mm, consistent with other scale tensors like wq_s.

Additional information

vLLM is also supporting Quantized LM head with this PR

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, for redundant work and testing the change

Signed-off-by: ynankani <ynankani@nvidia.com>

CISC · 2026-05-14T14:16:35Z

Looks like this somehow broke phimoe?

Signed-off-by: ynankani <ynankani@nvidia.com>

ynankani · 2026-05-14T14:33:30Z

like this somehow broke phimoe?

Issue here is the pointer for tok_embd and output are different and create tensor on test path creates tensor by default if dimensions are given ignoring the TENSOR_NOT_REQUIRED. So there is a count mismatch .

one option was to compare the name instead of pointer for output and tok_embd
Check dtype for LM head if NvFP4 then only create output_s tensor called at least on test path.
Remove the weight tying check just add assert when weight tying and lm head in NvFP4

Signed-off-by: ynankani <ynankani@nvidia.com>

ynankani · 2026-05-15T09:16:31Z

Test cases are failing due to Vulkan tanh shader doesn't handle near zero values. If i revert #10723 i don't see the test failures.
Option 1

use: data_d[i] = D_TYPE(tanh(float(data_a[i]))); 
instead of : data_d[i] = D_TYPE(1. - 2. / (exp(2.*data_a[i]) + 1.));

Other solution can be to use

float x = data_d[i]
if (abs(x) < 2.4414e-4) {
    // tanh(x) ≈ x - x^3/3 + 2x^5/15 , no exp, no cancellation
    float x2 = x * x;
    result = x * (1.0 - x2 * (1.0/3.0 - x2 * (2.0/15.0)));
} else {
    result = 1.0 - 2.0 / (exp(2.0 * x) + 1.0);
}

@0cc4m Please let me know if the #10723 PR is still needed for Vulkan

Signed-off-by: ynankani <ynankani@nvidia.com>

0cc4m · 2026-05-15T09:44:49Z

I don't understand this change yet. How does it cause different use of tanh?

ynankani · 2026-05-15T09:57:37Z

I don't understand this change yet. How does it cause different use of tanh?

This change adds output_s tensor for LM head in NvFp4. In the test path as there doesn't a gguf file it creates a tensor with the given dimension and a value is set using Normal distribution for the tensor. The output_s is then used in mat-mul. The tanh has different behavior for Vulkan and CPU for near zero values which causes high NMSE for Vulkan path.

logits = output @ x (normal magnitudes)
logits *= output_s (N(0, 0.01), so ~0.001-0.1)
logits /= f_final_logit_softcapping (30) (~3e-5 to 3e-3)
logits = tanh(logits) <= Need to handle near zero value
logits *= 30 (back to ~0.001-0.1)

gaugarg-nv · 2026-05-15T10:58:05Z

IMO, the current change to add the tensors only when the type is NVFP4 should be fine. Running the CI again to see if it fixes the Vulkan failure.

* NvFP4 quantized LM head support Signed-off-by: ynankani <ynankani@nvidia.com> * Address review commnets Signed-off-by: ynankani <ynankani@nvidia.com> * Add assert for NvFp4 lm head and tied embeddings Signed-off-by: ynankani <ynankani@nvidia.com> * Address review commnets Signed-off-by: ynankani <ynankani@nvidia.com> * Create output_s tensor only when LM head NvFp4 Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com>

…gml-org#23046 follow-up) Ygg-local file not touched by mainline cherry-pick; mirror the same signature update applied to all 100 mainline model files in 42928bc.

* NvFP4 quantized LM head support Signed-off-by: ynankani <ynankani@nvidia.com> * Address review commnets Signed-off-by: ynankani <ynankani@nvidia.com> * Add assert for NvFp4 lm head and tied embeddings Signed-off-by: ynankani <ynankani@nvidia.com> * Address review commnets Signed-off-by: ynankani <ynankani@nvidia.com> * Create output_s tensor only when LM head NvFp4 Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com>

…gml-org#23046 follow-up) Ygg-local file not touched by mainline cherry-pick; mirror the same signature update applied to all 100 mainline model files in 42928bc.

* NvFP4 quantized LM head support Signed-off-by: ynankani <ynankani@nvidia.com> * Address review commnets Signed-off-by: ynankani <ynankani@nvidia.com> * Add assert for NvFp4 lm head and tied embeddings Signed-off-by: ynankani <ynankani@nvidia.com> * Address review commnets Signed-off-by: ynankani <ynankani@nvidia.com> * Create output_s tensor only when LM head NvFp4 Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com>

NvFP4 quantized LM head support

d84f3c5

Signed-off-by: ynankani <ynankani@nvidia.com>

ynankani requested review from CISC and ggerganov as code owners May 14, 2026 10:28

CISC approved these changes May 14, 2026

View reviewed changes

Comment thread src/llama-model.cpp Outdated

Address review commnets

2b46cda

Signed-off-by: ynankani <ynankani@nvidia.com>

gaugarg-nv reviewed May 14, 2026

View reviewed changes

Comment thread src/llama-model.cpp Outdated

github-actions Bot added the model Model specific label May 14, 2026

Add assert for NvFp4 lm head and tied embeddings

b88eca2

Signed-off-by: ynankani <ynankani@nvidia.com>

gaugarg-nv reviewed May 14, 2026

View reviewed changes

Comment thread src/llama-model.cpp Outdated

Comment thread src/llama-model.cpp Outdated

Comment thread src/llama-model.cpp Outdated

Address review commnets

db614c4

Signed-off-by: ynankani <ynankani@nvidia.com>

gaugarg-nv approved these changes May 15, 2026

View reviewed changes

Create output_s tensor only when LM head NvFp4

55618a3

Signed-off-by: ynankani <ynankani@nvidia.com>

gaugarg-nv approved these changes May 16, 2026

View reviewed changes

CISC merged commit 42928bc into ggml-org:master May 16, 2026
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NvFP4 quantized LM head support#23046

NvFP4 quantized LM head support#23046
CISC merged 5 commits into
ggml-org:masterfrom
ynankani:ynankani/nvfp4_lm_head_quant_support

ynankani commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

CISC commented May 14, 2026

Uh oh!

ynankani commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ynankani commented May 15, 2026 •

edited

Loading

Uh oh!

0cc4m commented May 15, 2026

Uh oh!

ynankani commented May 15, 2026

Uh oh!

gaugarg-nv commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ynankani commented May 14, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Uh oh!

CISC commented May 14, 2026

Uh oh!

ynankani commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ynankani commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented May 15, 2026

Uh oh!

ynankani commented May 15, 2026

Uh oh!

gaugarg-nv commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ynankani commented May 14, 2026 •

edited

Loading

ynankani commented May 15, 2026 •

edited

Loading