[Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference #588
Conversation
Signed-off-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for Qwen 3.5 and Gated Delta Net (GDN) architectures, including new configuration attributes, layer extraction logic for GDN, and enhanced vLLM-to-Hugging Face conversion. Key changes include the implementation of extract_gdn_layers, improved lm_head detection, and clearing rope_deltas during RL training. Review feedback suggests using .detach() instead of .data for safer tensor extraction and notes a semantic naming issue where conv1d was added to a list of normalization layers.
| conv_w = gdn.conv1d.weight.data | ||
| state_dict[f"{prefix}.conv1d.weight"] = conv_w | ||
| quant_state_dict[f"{prefix}.conv1d.weight"] = conv_w | ||
|
|
||
| state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.data | ||
| quant_state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.data | ||
| state_dict[f"{prefix}.A_log"] = gdn.A_log.data | ||
| quant_state_dict[f"{prefix}.A_log"] = gdn.A_log.data |
There was a problem hiding this comment.
Using .data is generally discouraged in modern PyTorch as it can bypass safety checks and lead to unexpected behavior. It is recommended to use .detach() instead to obtain a tensor that does not track gradients, which is sufficient for state dict extraction.
| conv_w = gdn.conv1d.weight.data | |
| state_dict[f"{prefix}.conv1d.weight"] = conv_w | |
| quant_state_dict[f"{prefix}.conv1d.weight"] = conv_w | |
| state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.data | |
| quant_state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.data | |
| state_dict[f"{prefix}.A_log"] = gdn.A_log.data | |
| quant_state_dict[f"{prefix}.A_log"] = gdn.A_log.data | |
| conv_w = gdn.conv1d.weight.detach() | |
| state_dict[f"{prefix}.conv1d.weight"] = conv_w | |
| quant_state_dict[f"{prefix}.conv1d.weight"] = conv_w | |
| state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.detach() | |
| quant_state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.detach() | |
| state_dict[f"{prefix}.A_log"] = gdn.A_log.detach() | |
| quant_state_dict[f"{prefix}.A_log"] = gdn.A_log.detach() |
| "norm1", # Qwen2.5-VL vision encoder | ||
| "norm2", # Qwen2.5-VL vision encoder | ||
| "norm", | ||
| "conv1d", |
There was a problem hiding this comment.
Adding "conv1d" to the layernorm_names list is semantically incorrect as it is a convolutional layer, not a normalization layer. While the current implementation reuses the direct parameter assignment logic in the else block (line 1421) which works for conv1d.weight, this naming is misleading and could lead to maintenance issues. Consider renaming the list to something more generic like simple_parameter_modules or adding a comment to explain this exception.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b7052b55a5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| qkv_weight = weight[offsets[0]:offsets[3]] | ||
| z_weight = weight[offsets[3]:offsets[4]] | ||
| store(f"{prefix}.in_proj_qkv.weight", qkv_weight) | ||
| store(f"{prefix}.in_proj_z.weight", z_weight) |
There was a problem hiding this comment.
Preserve quant states when splitting GDN fused qkvz
When linear_attn.in_proj_qkvz is present, this path slices proj.weight into in_proj_qkv/in_proj_z and stores only raw weights, but it never carries over per-shard bnb_quant_state. For 4-bit/BnB checkpoints this means these layers lose their quant metadata, and convert_vllm_to_huggingface will no longer take the Linear4bit branch for them, producing incorrect dense modules (often with uint8-backed weights) and broken inference/training numerics on Qwen3.5 GDN models.
Useful? React with 👍 / 👎.
| if hasattr( | ||
| BitsAndBytesModelLoader._stack_quantization_states, | ||
| "_unsloth_gemma4_k_eq_v_patch", |
There was a problem hiding this comment.
Guard Gemma4 k_eq_v patch for missing loader method
This accesses BitsAndBytesModelLoader._stack_quantization_states directly before verifying the attribute exists. Because load_vllm now calls this patch unconditionally for Gemma4 vision models, environments with a different vLLM loader API will raise AttributeError during model load (even when not using 4-bit quantization). Add a defensive getattr/version gate and no-op when the method is unavailable.
Useful? React with 👍 / 👎.
|
Possible duplicate of a trusted maintainer's PR. This PR looks like it solves the same underlying problem as unslothai/unsloth-zoo#580 by @danielhanchen (trusted maintainer).
Canonical PR summary: This PR fixes Gemma-4 inference and generation crashes by patching the temporary compatibility layer so zero The auto-review is still running against this PR — reviewers will factor in the canonical above. If this PR is genuinely different, call out the delta in the review discussion so the maintainer can decide which to merge. |
…e_huggingface_model
- patch_gemma4_vllm_lora_support: use functools.wraps on patched_create_lora_manager so
_call_create_lora_manager's signature inspection still sees vllm_config; pass model
positionally to lora_manager_cls to avoid "multiple values for 'model'".
- patch_gemma4_vllm_k_eq_v_support: also handle split k_proj/v_proj layout (current
upstream Gemma4) by duplicating k quant-state to synthetic v entry; keep packed
qkv_proj path as fallback.
- load_vllm: gate Gemma4 patches on enable_lora / use_bitsandbytes (not is_vision_model),
so text-only Gemma4 + LoRA / BnB also works.
- extract_gdn_layers: derive qkvz offsets from gdn.key_dim/value_dim when
ColumnParallelLinear has no output_sizes; manually split in_proj_ba into b/a instead
of calling get_state_dict with kk=1 (IndexError); preserve BnB quant_state sidecars;
handle FP8 weight_scale (not only weight_scale_inv) and dynamic/row-wise FP8;
export linear_attn.norm.weight.
- finalize_huggingface_model: fix layer_idx for standard causal LMs (not only VLM path);
rebuild Gemma4 vision rotary_emb from vision_config with fp32 buffers; guard
rotary_pos_emb on vision_config availability; mirror language_model detection from
set_additional_modules.
- get_model_layer_config: register Gemma4 per_layer_input_gate / per_layer_projection /
post_per_layer_input_norm; add Qwen3.5 visual.merger.linear_fc1 / linear_fc2 and drop
the broken linear_fc{kk} template.
- set_dtype_in_config (hf_utils): prefer the modern 'dtype' field; fall back to
'torch_dtype' only when 'dtype' is absent, avoiding the deprecation warning on
current transformers.
- vllm_utils state-dict loop: skip layer.mlp extraction for linear-attn-only layers
(defensive) while still capturing layer_scalar.
- _normalize_state_dict_tensor: guard is_sparse behind isinstance(value, torch.Tensor)
so non-tensor state-dict values pass through.
…n, finalize_huggingface_model
- patch_gemma4_vllm_lora_support: use functools.wraps on patched_create_lora_manager so
_call_create_lora_manager's signature inspection still sees vllm_config; pass model
positionally to lora_manager_cls to avoid "multiple values for 'model'".
- patch_gemma4_vllm_k_eq_v_support: also handle split k_proj/v_proj layout (current
upstream Gemma4) by duplicating k quant-state to synthetic v entry; keep packed
qkv_proj path as fallback.
- load_vllm: gate Gemma4 patches on enable_lora / use_bitsandbytes (not is_vision_model),
so text-only Gemma4 + LoRA / BnB also works.
- extract_gdn_layers: derive qkvz offsets from gdn.key_dim/value_dim when
ColumnParallelLinear has no output_sizes; manually split in_proj_ba into b/a instead
of calling get_state_dict with kk=1 (IndexError); preserve BnB quant_state sidecars;
handle FP8 weight_scale (not only weight_scale_inv) and dynamic/row-wise FP8;
export linear_attn.norm.weight.
- finalize_huggingface_model: fix layer_idx for standard causal LMs (not only VLM path);
rebuild Gemma4 vision rotary_emb from vision_config with fp32 buffers; guard
rotary_pos_emb on vision_config availability; mirror language_model detection from
set_additional_modules.
- get_model_layer_config: register Gemma4 per_layer_input_gate / per_layer_projection /
post_per_layer_input_norm; add Qwen3.5 visual.merger.linear_fc1 / linear_fc2 and drop
the broken linear_fc{kk} template.
- set_dtype_in_config (hf_utils): prefer the modern 'dtype' field; fall back to
'torch_dtype' only when 'dtype' is absent, avoiding the deprecation warning on
current transformers.
- vllm_utils state-dict loop: skip layer.mlp extraction for linear-attn-only layers
(defensive) while still capturing layer_scalar.
- _normalize_state_dict_tensor: guard is_sparse behind isinstance(value, torch.Tensor)
so non-tensor state-dict values pass through.
|
Auto-review verdict: Approved PR #588 adds vLLM-to-HF fast inference support for Qwen3.5 (Gated Delta Net linear attention) and Gemma4 VLMs (LoRA + k_eq_v BnB patches), plus consolidates post-conversion rotary/device setup into a new finalize_huggingface_model helper. It meaningfully extends Unsloth's model coverage for two important next-gen architectures with correct conversion behavior after the review fixes. Reason: All 20 accepted findings fixed on pr-588-head (commit e4f530c); 14 consolidated tests pass on fixed code and catch the pre-fix regressions. |
# Conflicts: # unsloth_zoo/empty_model.py # unsloth_zoo/vllm_utils.py
|
Auto-review verdict: Approved PR adds vLLM-to-HF conversion support for Qwen 3.5 Gated Delta Net and Gemma4 (LoRA, k_eq_v BnB, per-layer inputs) plus a unified finalize_huggingface_model helper; it unblocks fast inference for both new model families and, with the applied fixes to rotary handling, GDN dequantize slicing, and lm_head lookup, is correct and ready to merge. Reason: All real review findings were fixed; no unfixed regressions remain and 43/43 tests pass. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 086aa52c82
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "model.visual.merger.linear_fc1", | ||
| "model.visual.merger.linear_fc2", |
There was a problem hiding this comment.
Keep merger FCs on quant-aware conversion path
Moving model.visual.merger.linear_fc1/linear_fc2 into non_layered_components routes them through set_additional_modules, which assigns raw tensors as nn.Parameter and never consults *.weight.quant_state (see the generic assignment path in set_additional_modules). For prequantized 4-bit checkpoints, these FCs are therefore rebuilt as plain nn.Linear with packed uint8 weights instead of Linear4bit, which can trigger dtype/runtime errors or silently wrong numerics in Qwen3-VL conversion.
Useful? React with 👍 / 👎.
|
Possible duplicate of a trusted maintainer's PR. This PR looks like it solves the same underlying problem as unslothai/unsloth-zoo#580 by @danielhanchen (trusted maintainer).
Canonical PR summary: This PR fixes Gemma-4 inference and generation crashes by patching the temporary compatibility layer so zero Other related PRs:
The auto-review is still running against this PR — reviewers will factor in the canonical above. If this PR is genuinely different, call out the delta in the review discussion so the maintainer can decide which to merge. |
- vllm_utils.py: convert_vllm_to_huggingface parameter-list path regex now handles trailing-digit segments (e.g. `embed_tokens_per_layer.0`) via the same anchor-or-end pattern used below for `exec` assignment. - empty_model.py: finalize_huggingface_model rotary reinit no longer swallows failures silently; the float32 buffer lift is skipped when the reinit raised, and the exception is logged so wrong-shape rotary state does not propagate. - empty_model.py: patch_gemma4_vllm_lora_support guards both the `gemma4_mm` and `gemma4` imports, does not clobber a pre-existing `embedding_modules` registry, and delegates to the original `create_lora_manager` so vLLM shim kwargs (e.g. `vllm_config`) reach the manager constructor correctly. - empty_model.py: patch_gemma4_vllm_k_eq_v_support uses getattr for `_stack_quantization_states` so absent / renamed private attrs do not crash, and adds `model.language_model` to the k/v name-prefix search so HF-style Gemma4 multimodal parameters match. - empty_model.py + vllm_utils.py: both Gemma4 gates share a new `_is_gemma4_config` helper that accepts `gemma4` and the text-only `gemma4_text` model_type. - vllm_utils.py: Gemma4 patch block runs after the BnB autodetect so that `-bnb-4bit` / `quant_method == bitsandbytes` models trigger the k_eq_v patch even when the caller did not pass use_bitsandbytes=True.
…in set_additional_modules
Adds 12 regression tests covering the iter-1 hardening (trailing-digit regex path, rotary reinit success guard, _is_gemma4_config helper, gemma4 gate migration, gemma4_mm import guard, private loader attr guard, HF-style k_eq_v prefix, lora manager delegation, behavioral no-op tests that stub missing vLLM modules). Updates test_gemma4_lora_patch_preserves_signature_for_inspect and test_gemma4_k_eq_v_set_hoists_constant_check to match the new source shape.
# Conflicts: # unsloth_zoo/empty_model.py # unsloth_zoo/vllm_utils.py
|
Auto-review verdict: Approved Adds vLLM <-> HF fast-inference support for Qwen 3.5 (Gated Delta Net linear-attention extraction) and Gemma 4 (per-layer input modules, k_eq_v attention, LoRA patch), plus a factored-out finalize_huggingface_model that fixes layer_idx, rotary reinit, and dtype propagation. Useful because it unlocks conversion/inference for two newly-supported architectures while the refactor cleans up rotary / dtype / lm_head paths that were already fragile. Reason: All 9 accepted review findings fixed and verified by 55/55 passing tests; merge-back landed cleanly on upstream. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ba822fcb87
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rotary_emb = getattr(module, "rotary_emb", None) | ||
| if rotary_emb is None: | ||
| continue | ||
| for buffer_name, buffer in list(rotary_emb._buffers.items()): | ||
| if torch.is_tensor(buffer) and buffer.is_floating_point(): |
There was a problem hiding this comment.
Keep rotary_emb_local buffers in float32 after final cast
finalize_huggingface_model reinitializes rotary_emb_local earlier, but in the final non-quantized path it calls new_model.to(..., dtype=dtype) and then only restores float32 buffers for rotary_emb. Models that expose rotary_emb_local (local/global RoPE split variants) will therefore have local RoPE buffers downcast to bf16/fp16, which regresses RoPE precision versus the pre-cast state and can subtly change generation quality.
Useful? React with 👍 / 👎.
|
Auto-review verdict: Approved Adds vLLM fast-inference support for Qwen3.5 (Gated Delta Net linear-attention extraction/rebuild) and Gemma4 (k_eq_v attention, per-layer input modules, LoRA + BnB vLLM patches), plus a centralized finalize_huggingface_model for dtype/rotary/layer_idx cleanup. Review iterations added defensive GDN quant-state handling, restored in_proj_z emission on the multi-quant path, routed merger.linear_fc1/fc2 through the quantized loop, and extended fp32 lift to rotary_emb_local and rotary_pos_emb. Reason: All real issues found during review were fixed and verified by tests; no remaining regressions. |
|
danielhanchen#5 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93f1143594
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if kk not in gemma4_k_eq_v_layers and kk not in gemma4_kv_shared_layers: | ||
| get_state_dict(f"{prefix}.v_proj", 2, state_dict, qkv_proj) |
There was a problem hiding this comment.
Extract v_proj weights for Gemma4 k_eq_v layers
Do not skip v_proj extraction for attention_k_eq_v layers here: Gemma4’s split-attention layout still exposes v_proj weights in many builds, and this guard drops them from quant_state_dict. During convert_vllm_to_huggingface, missing ...self_attn.v_proj.weight entries are silently skipped, so those layers keep the tiny empty-model placeholders instead of real weights, which corrupts converted Gemma4 checkpoints (especially full-attention layers with attention_k_eq_v=True).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1680006be3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if _is_gemma4_config(config) and getattr(text_config, "attention_k_eq_v", False): | ||
| gemma4_k_eq_v_layers = { | ||
| kk | ||
| for kk, layer_type in enumerate(getattr(text_config, "layer_types", ())) | ||
| if layer_type == "full_attention" |
There was a problem hiding this comment.
Guard Gemma4 layer_types iteration against None
When attention_k_eq_v is enabled, this comprehension iterates text_config.layer_types directly; on Gemma4 configs where layer_types is present but set to None (the documented optional/default form), enumerate(None) raises TypeError and aborts state-dict extraction. This makes conversion fail before any weights are exported for those checkpoints, so this path should coerce None to an empty/default iterable (for example getattr(..., None) or ()).
Useful? React with 👍 / 👎.
|
|
||
| param_names = set(name for name, _ in model.named_parameters()) | ||
| pairs = [] | ||
| for layer_idx, layer_type in enumerate(getattr(text_config, "layer_types", ())): |
There was a problem hiding this comment.
Handle None layer_types in Gemma4 k_eq_v loader patch
The Gemma4 bitsandbytes patch assumes text_config.layer_types is iterable whenever attention_k_eq_v is true; if that field is None, this loop throws TypeError during vLLM quant-state stacking, causing Gemma4 4-bit loading to fail before model initialization completes. The loop should defensively treat a None value as empty/default instead of iterating it directly.
Useful? React with 👍 / 👎.
Tried to make sure the changes are minimal so when we detect linear_attn, we hand it off to a separate function