Skip to content

fix(graph): remove duplicate wo_s scale after build_attn (Qwen3, LLaMA)#22421

Merged
CISC merged 1 commit into
ggml-org:masterfrom
ynankani:ynankani/qwen3_nvfp4_fix
Apr 27, 2026
Merged

fix(graph): remove duplicate wo_s scale after build_attn (Qwen3, LLaMA)#22421
CISC merged 1 commit into
ggml-org:masterfrom
ynankani:ynankani/qwen3_nvfp4_fix

Conversation

@ynankani
Copy link
Copy Markdown
Contributor

Overview

Observed that build_attn present in llama-graph already applies NVFP4 per tensor scale (wo_s) via
llama-graph.cpp (build_lora_mm(wo, cur, wo_s) or explicit wo_s mul).
Also observed these model builders(qwen3, qwen3moe, llama) are also multiplied the
result by wo_s again, so wo_s was applied twice whenever companion
blk.*.attn_output.scale tensors were present.
That crushed the attention residual (roughly wo_s^2 per layer) and broke
NVFP4 GGUFs for LLM_ARCH_QWEN3, LLM_ARCH_QWEN3MOE, and LLM_ARCH_LLAMA /
LLM_ARCH_LLAMA_EMBED.
Remove the redundant ggml_mul blocks in:

  • src/models/qwen3.cpp
  • src/models/qwen3moe.cpp
  • src/models/llama.cpp
    Non-NVFP4 GGUFs keep wo_s == nullptr, so behavior is unchanged there.

Additional information

This issue was observed when running inference on https://huggingface.co/nvidia/Qwen3-8B-NVFP4 converted GGUF model

Requirements

Signed-off-by: Yash Nankani <ynankani@nvidia.com>
@ynankani ynankani requested a review from CISC as a code owner April 27, 2026 07:02
@ynankani ynankani changed the title fix(graph): remove duplicate wo_s scale after build_attn (Qwen1, LLaMA) fix(graph): remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) Apr 27, 2026
@github-actions github-actions Bot added the model Model specific label Apr 27, 2026
Copy link
Copy Markdown
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, nice catch. :)

@CISC CISC added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Apr 27, 2026
@CISC CISC merged commit 0f1bb60 into ggml-org:master Apr 27, 2026
44 of 46 checks passed
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants