UPSTREAM PR #20505: convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions#1267
Open
UPSTREAM PR #20505: convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions#1267
Conversation
|
No meaningful performance changes were detected across 120755 analyzed functions in the following binaries: build.bin.llama-tts, build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli. 🔎 Full breakdown: Loci Inspector |
0e8e1d6 to
7dcdda5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#20505
This PR fixes several errors that occur when attempting to convert Qwen3.5/Qwen3.5Moe models. To keep this PR scope in check and specific, a separate PR ggml-org/llama.cpp#20506 allows loading of these newly converted models.
Bug:
When attempting to use
convert_hf_to_gguf.pyon various Qwen3.5 and Qwen3.5 MoE models, it would abort with the following error(s):This occurred because these models now have
model.language_modelorlanguage_modelprefixes. The fix strips the wrappers instead of failing, which allows it to continue.But just stripping the names and continuing was not enough to get the models converted properly, so it would cause a new error:
This is because Qwen3.5's linear attention weights get reordered in
modify_tensors():However NVFP4 bypasses
modify_tensors()and has its own repacking, andlinear_attn.in_proj_a.input_scalewas seen by as a [num_v_heads] tensor and tried to reshape it into [16, 3, 1].This is fixed by skipping tensors in the write loop that already were repacked
and by applying the same reordering for :
This will now produce the correct Qwen3.5/Qwen3.5MoE NVFP4 GGUF file. A separate PR must be applied to load these files.
This fixed the issue with both Qwen3.5-122B-A10B-NVFP4 and Qwen3.5-27B-NVFP4 and correctly produced proper output.
Qwen3.5-35B-A3B-NVFP4.gguf was also tested after returning k_scale and v_scale to the skip list.
Note, some Qwen3.5 NVFP4 HF models produce this tokenizer error and others don't for the same model:
Workaround:
Edit the model's tokenizer_config.json and change tokenizer_class from TokenizersBackend to Qwen2Tokenizer