Skip to content

UPSTREAM PR #20505: convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions#1267

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-20505-nvfp4-fix-qwen-conversions
Open

UPSTREAM PR #20505: convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions#1267
loci-dev wants to merge 2 commits intomainfrom
loci/pr-20505-nvfp4-fix-qwen-conversions

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#20505

This PR fixes several errors that occur when attempting to convert Qwen3.5/Qwen3.5Moe models. To keep this PR scope in check and specific, a separate PR ggml-org/llama.cpp#20506 allows loading of these newly converted models.

Bug:
When attempting to use convert_hf_to_gguf.py on various Qwen3.5 and Qwen3.5 MoE models, it would abort with the following error(s):

ValueError: Can not map tensor 'model.language_model.layers.0.mlp.shared_expert.down_proj.weight'
ValueError: Can not map tensor 'model.language_model.layers.0.linear_attn.in_proj_a.weight'

This occurred because these models now have model.language_model or language_model prefixes. The fix strips the wrappers instead of failing, which allows it to continue.
But just stripping the names and continuing was not enough to get the models converted properly, so it would cause a new error:

RuntimeError: shape '[16, 3, 1]' is invalid for input of size 1

This is because Qwen3.5's linear attention weights get reordered in modify_tensors():

# original order:  [q, k, v, z] * head_count
# corrected order: [q * head_count, k * head_count, v * head_count, z * head_count]

However NVFP4 bypasses modify_tensors() and has its own repacking, and linear_attn.in_proj_a.input_scale was seen by as a [num_v_heads] tensor and tried to reshape it into [16, 3, 1].
This is fixed by skipping tensors in the write loop that already were repacked

if self._is_nvfp4:
                if name.endswith(".weight") and name.replace(".weight", ".weight_scale") in self.model_tensors:
                    continue
                if name.endswith((".weight_scale", ".weight_scale_2", ".input_scale", "k_scale", ".v_scale"))
                    continue
 Updated: added k_scale and v_scale above

and by applying the same reordering for :

linear_attn.in_proj_qkv
linear_attn.in_proj_z
linear_attn.in_proj_a
linear_attn.in_proj_b
linear_attn.out_proj

This will now produce the correct Qwen3.5/Qwen3.5MoE NVFP4 GGUF file. A separate PR must be applied to load these files.
This fixed the issue with both Qwen3.5-122B-A10B-NVFP4 and Qwen3.5-27B-NVFP4 and correctly produced proper output.
Qwen3.5-35B-A3B-NVFP4.gguf was also tested after returning k_scale and v_scale to the skip list.

Note, some Qwen3.5 NVFP4 HF models produce this tokenizer error and others don't for the same model:

ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

Workaround:
Edit the model's tokenizer_config.json and change tokenizer_class from TokenizersBackend to Qwen2Tokenizer

@loci-review
Copy link

loci-review bot commented Mar 18, 2026

No meaningful performance changes were detected across 120755 analyzed functions in the following binaries: build.bin.llama-tts, build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 5 times, most recently from 0e8e1d6 to 7dcdda5 Compare March 21, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants