[Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference (clean-base mirror of #588)#10
[Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference (clean-base mirror of #588)#10danielhanchen wants to merge 4 commits into
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Gemma 4 and Qwen 3.5 (GDN) architectures, adding specialized vLLM patches for LoRA and k_eq_v weight handling and a model finalization utility. It also expands layer configuration templates and improves dtype management. Reviewer feedback recommends simplifying helper functions like _get_model_device and _normalize_state_dict_tensor, and streamlining attribute-setting logic to improve code readability.
| def _get_model_device(model): | ||
| for tensor in model.parameters(): | ||
| return tensor.device | ||
| for tensor in model.buffers(): | ||
| return tensor.device | ||
| return torch.device("cpu") | ||
| pass |
There was a problem hiding this comment.
| try: | ||
| setattr(config, field, runtime_dtype) | ||
| success = True | ||
| continue | ||
| except Exception: | ||
| pass | ||
|
|
||
| try: | ||
| config.__dict__[field] = runtime_dtype | ||
| success = True | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The nested try-except blocks for setting attributes are redundant. You can simplify this by attempting the standard setattr first and falling back to __dict__ only if it fails, or simply using setattr if the object is a standard Python object.
| try: | |
| setattr(config, field, runtime_dtype) | |
| success = True | |
| continue | |
| except Exception: | |
| pass | |
| try: | |
| config.__dict__[field] = runtime_dtype | |
| success = True | |
| except Exception: | |
| pass | |
| for field in target_fields: | |
| try: | |
| setattr(config, field, runtime_dtype) | |
| success = True | |
| break | |
| except Exception: | |
| try: | |
| config.__dict__[field] = runtime_dtype | |
| success = True | |
| break | |
| except Exception: | |
| pass |
| def _normalize_state_dict_tensor(value): | ||
| if isinstance(value, torch.nn.Parameter): | ||
| value = value.detach() | ||
| if value.is_sparse: | ||
| value = value.to_dense() | ||
| return value.contiguous() |
There was a problem hiding this comment.
4b80cac to
b7052b5
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for Gemma 4 and Qwen 3.5 (GDN) architectures, featuring vLLM patches for LoRA and k_eq_v support, a new finalize_huggingface_model utility, and extract_gdn_layers for Gated Delta Net. It also improves dtype configuration and tensor unwrapping across the codebase. Feedback highlights opportunities to improve the robustness of language model detection, ensure safer attribute checks for rotary embeddings, and align with modern PyTorch practices by using .detach() instead of .data.
| device = target_device, | ||
| dtype = dtype, | ||
| ) | ||
| if hasattr(module, "rotary_pos_emb"): |
There was a problem hiding this comment.
hasattr(module, "rotary_pos_emb") will return True even if the attribute is set to None. If it is None, the subsequent call to module.rotary_pos_emb.__class__(...) on line 713 will raise a TypeError. It is safer to check if the attribute is not None.
| if hasattr(module, "rotary_pos_emb"): | |
| if getattr(module, "rotary_pos_emb", None) is not None: |
| if original_meta_model is not None: | ||
| copy_attributes(original_meta_model, new_model) | ||
|
|
||
| language_model = getattr(getattr(new_model, "model", None), "language_model", None) |
There was a problem hiding this comment.
The detection of language_model here is less robust than the logic used in set_additional_modules (lines 541-549). It may fail to identify the language model component for standard causal LM models or models where language_model is a direct attribute of new_model. Using a more robust detection logic ensures that layer_idx is correctly patched across different model architectures.
| language_model = getattr(getattr(new_model, "model", None), "language_model", None) | |
| language_model = getattr(new_model, "language_model", None) | |
| if language_model is None: | |
| language_model = getattr(new_model, "model", None) | |
| if language_model is not None: | |
| language_model = getattr(language_model, "language_model", language_model) |
| store(f"{prefix}.conv1d.weight", gdn.conv1d.weight.data) | ||
| store(f"{prefix}.dt_bias", gdn.dt_bias.data) | ||
| store(f"{prefix}.A_log", gdn.A_log.data) |
There was a problem hiding this comment.
Using .data is discouraged in modern PyTorch as it can lead to silent bugs by bypassing autograd tracking and safety checks. It is recommended to use .detach() instead when a view of the tensor is needed without gradient tracking.
| store(f"{prefix}.conv1d.weight", gdn.conv1d.weight.data) | |
| store(f"{prefix}.dt_bias", gdn.dt_bias.data) | |
| store(f"{prefix}.A_log", gdn.A_log.data) | |
| store(f"{prefix}.conv1d.weight", gdn.conv1d.weight.detach()) | |
| store(f"{prefix}.dt_bias", gdn.dt_bias.detach()) | |
| store(f"{prefix}.A_log", gdn.A_log.detach()) |
empty_model.py (patch_gemma4_vllm_lora_support, line 349): Drop the `model=model` keyword from the `lora_manager_cls(...)` call so model is passed only positionally. Blame: 5d07504 "[WIP] gemma 4 dense fast inference" introduced `lora_manager_cls(model = model, *args, **kwargs)`. Upstream vllm.lora.model_manager.create_lora_manager is `(model, max_num_seqs, max_num_batched_tokens, vocab_size, lora_config, vllm_config, device, lora_manager_cls=..., **kwargs)` -- model is positional. When vLLM calls the patched wrapper, `*args` captures the remaining positional arguments. Re-expanding them with a `model=model` keyword raises `TypeError: got multiple values for argument 'model'` because the first positional in `args` (max_num_seqs) also binds to the `model` formal parameter of LoRAModelManager.__init__. Passing model positionally preserves the intended WIP behavior without the double-bind. Not a behavioral deletion -- the call signature is the same, only the argument-passing style changed. vllm_utils.py (assert_same_state_dict._normalize_state_dict_tensor): Add `isinstance(value, torch.Tensor)` early-return guard so non-tensor entries in a state_dict are passed through instead of raising AttributeError on `.is_sparse` / `.contiguous()`. Safe because the outer try/except already treats the call as a lookup and the tensor path below requires a real tensor anyway.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Gemma 4 and Qwen 3.5 (Gated Delta Net) architectures, featuring vLLM to Hugging Face conversion utilities, LoRA patches, and quantization support. Key additions include the finalize_huggingface_model function for post-conversion cleanup, robust dtype configuration handling, and expanded layer mapping. Feedback identifies a critical issue where skipping v_proj extraction for Gemma 4 k_eq_v layers causes missing keys in the Hugging Face state dict; a suggestion is provided to duplicate the k_proj shard instead. Additionally, the slicing logic in extract_gdn_layers should be validated to prevent potential IndexError when handling different fused projection counts.
| if kk not in gemma4_k_eq_v_layers: | ||
| get_state_dict(f"{prefix}.v_proj", 2, state_dict, qkv_proj) |
There was a problem hiding this comment.
For Gemma 4 layers where attention_k_eq_v is enabled, vLLM's qkv_proj typically contains only Q and K shards. However, the Hugging Face model architecture still expects a v_proj weight. Skipping the extraction of v_proj here will result in a missing key in the converted state dictionary, which can cause errors when loading the model in Transformers (e.g., missing keys or uninitialized weights).
Instead of skipping, you should duplicate the k_proj shard (index 1) into v_proj to ensure the Hugging Face model is fully populated and functional.
| if kk not in gemma4_k_eq_v_layers: | |
| get_state_dict(f"{prefix}.v_proj", 2, state_dict, qkv_proj) | |
| if kk not in gemma4_k_eq_v_layers: | |
| get_state_dict(f"{prefix}.v_proj", 2, state_dict, qkv_proj) | |
| else: | |
| # For k_eq_v layers, v_proj is a copy of k_proj | |
| get_state_dict(f"{prefix}.v_proj", 1, state_dict, qkv_proj) |
| qkv_weight = weight[offsets[0]:offsets[3]] | ||
| z_weight = weight[offsets[3]:offsets[4]] |
There was a problem hiding this comment.
The slicing logic for qkv_weight and z_weight assumes that offsets has at least 5 elements (derived from 4 output sizes: Q, K, V, and Z). If a model variant has a different number of fused projections, this will raise an IndexError. It would be safer to validate the length of output_sizes or offsets before performing these slices.
empty_model.py extract_gdn_layers (~line 1086): store gdn.norm.weight when the GDN module exposes a norm submodule. Upstream Qwen3_5GatedDeltaNet constructs `self.norm` as Qwen3_5RMSNormGated / FusedRMSNormGated (modeling_qwen3_5.py:391-401) and get_model_layer_config already lists `linear_attn.norm` under layernorms, so the extractor was silently leaving the empty-model placeholder weight in place. Addition only; no existing code removed. empty_model.py get_model_layer_config: add Gemma4 per-layer-input entries that conversion drops. Upstream Gemma4DecoderLayer creates `per_layer_input_gate`, `per_layer_projection`, and `post_per_layer_input_norm` whenever `hidden_size_per_layer_input > 0` (default 256 per configuration_gemma4.py:169). Addition only; no existing entries removed. vllm_utils.py _get_vllm_state_dict per-layer-input extraction (line 1170-1176): reuses the existing `get_state_dict` helper already used by surrounding self_attn / mlp extraction -- no new helper written, no duplicate logic. FA6 self-check: grepped workdir for per_layer_input / _buffers patterns, confirmed get_state_dict is the correct existing helper for Linear extraction, and that no set_additional_modules-style helper for these specific modules already exists. vllm_utils.py convert_vllm_to_huggingface bare-tensor branch (line ~1405): wrap the existing Parameter-assign path in an if/else that first checks `_buffers[attr_name]`. FA3 blame analysis: - line 1401 (the Parameter-wrap call) blames 5d07504 "[WIP] gemma 4 dense fast inference" -- the original code was written to handle Gemma4 layer_scalar as an nn.Parameter, but upstream Gemma4DecoderLayer registers it as a buffer (modeling_gemma4.py:1337: `self.register_buffer("layer_scalar", torch.ones(1))`). The original intent of 5d07504 was to move these tensors onto the model; preserving that intent requires honoring the source module's buffer registration rather than forcing nn.Parameter on all bare tensors. The existing Parameter-wrap path is retained for the non-buffer case and routes through the same exec(...) assignment, so no historical code is deleted -- only a pre-branch was added to select the buffer target when appropriate. - line 1402 (the exec-assign call) blames 2afbcc1 "Fast Inference with vLLM for VLMs (unslothai#202)". This line is moved verbatim into the `else` arm of the new if/else; not deleted. FA4 note: nn.Parameter(...) is still invoked, but only for the non-buffer case; the buffer case uses `_buffers[name] = value`, matching the pattern already used in empty_model.py:706 and 743 inside finalize_huggingface_model. vllm_utils.py convert_vllm_to_huggingface LayerNorm branch (line ~1465): special-case `.conv1d` before the weight-only path. FA3 blame: line 1453 (the "# LayerNorms (including vision norms)" comment) blames 2afbcc1 "Fast Inference with vLLM for VLMs (unslothai#202)". The comment was edited to acknowledge the additional conv1d case; the original LayerNorm weight-only logic is retained for non-conv1d layer_names. Upstream Qwen3_5GatedDeltaNet builds conv1d as a depthwise Conv1d with `kernel_size=linear_conv_kernel_dim` and `groups=conv_dim` (modeling_qwen3_5.py:375-382). The pre-existing LayerNorm-only path only wrote `.weight`, so Conv1d kept placeholder `kernel_size=1 / padding=0 / groups=1` and F.conv1d produced a wrong output length for any `linear_conv_kernel_dim > 1`. Fixing conv1d is the minimum change; the surrounding LayerNorm code is unchanged. rl_replacements.py grpo_accumulated_loss (lines 762-763): fix pre-existing typo `io_same_decice` -> `io_same_device` to match accelerate's AlignDevicesHook attribute (accelerate/hooks.py:266,275, 347). FA3 blame: both lines blame 8ac4171 "GPT OSS RL (unslothai#303)". The original commit introduced the typo, which made the hasattr check always False and the intended hook reset never run. This commit restores the intended behavior and does not remove the guard or the assignment -- only the attribute name is corrected. The PR's newly added `rope_deltas` reset on the immediately following lines places the overlap inside the changed region, which is why the typo is being fixed in this PR rather than in a separate cleanup.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Gemma4 and Qwen 3.5 (Gated Delta Net) architectures, implementing specialized vLLM patches for LoRA and quantization state management. Key additions include the finalize_huggingface_model function for centralized model post-processing, expanded layer configuration templates for new architectures, and enhanced robustness in dtype configuration and state dict conversion. Review feedback identifies potential TypeError vulnerabilities when re-initializing rotary embeddings if attributes are None, and suggests more robust string parsing for layer names to prevent potential ValueError during model conversion.
| vision_config = getattr(config, "vision_config", None) | ||
|
|
||
| for module in new_model.modules(): | ||
| if hasattr(module, "rotary_emb"): |
There was a problem hiding this comment.
The hasattr check will return True even if module.rotary_emb is None. Accessing module.rotary_emb.__class__ on line 699 when it is None will result in a TypeError because NoneType is not callable. It is safer to check if the attribute is not None.
| if hasattr(module, "rotary_emb"): | |
| if getattr(module, "rotary_emb", None) is not None: |
| device = target_device, | ||
| dtype = dtype, | ||
| ) | ||
| if hasattr(module, "rotary_pos_emb"): |
| assert vision_config is not None, "Unsloth: vision_config is required for models with vision rotary_pos_emb" | ||
| head_dim = vision_config.hidden_size // vision_config.num_heads | ||
| module.rotary_pos_emb = module.rotary_pos_emb.__class__(head_dim//2).to(target_device) | ||
| if hasattr(module, "rotary_emb_local"): |
| exec(f"new_model.{layer_name_br} = layer") | ||
| layer_name_br = re.sub(r"\.([\d]{1,})\.", r"[\1].", layer_name) | ||
| value = _unwrap_tensor(weight) | ||
| parent_expr, attr_name = layer_name_br.rsplit(".", 1) |
There was a problem hiding this comment.
Summary: four minimal fixes, no logic deleted -- each hunk either
adds a missing branch or corrects an index-mismatch while preserving
the original author's intent. Blame-cited per line.
1. empty_model.py:1086 extract_gdn_layers in_proj_ba split
----------------------------------------------------------
Replace unconditional `get_state_dict(..., gdn.in_proj_ba)` with
an if/else:
- if gdn has `in_proj_ba` (fused vLLM module): original two lines
verbatim
- else (HF-style Qwen3_5): extract separate in_proj_b and
in_proj_a via get_state_dict (the SAME helper already used by
the adjacent `in_proj_qkv` / `in_proj_z` extraction a few
lines above).
Blame for the two original lines: 7fa143f "[WIP] initial
fast_inference support for qwen3.5". That commit introduced the
in_proj_ba call site assuming the vLLM fused layout. Upstream HF
Qwen3_5GatedDeltaNet explicitly `del self.in_proj_ba` in
transformers modular_qwen3_5.py:196-197 and defines separate
in_proj_b / in_proj_a at modeling_qwen3_5.py:419-420. codex4
reproduced AttributeError on the HF path. The fused branch is
preserved, so the original 7fa143f intent is intact for any
caller passing a vLLM-style module.
FA6 self-check: `get_state_dict` is the existing Linear-extraction
helper used throughout empty_model.py and vllm_utils.py; no new
helper introduced.
2. empty_model.py:934 / 972 merger.linear_fc entries moved
--------------------------------------------------------
Blame for the deleted `"model.visual.merger.linear_fc{kk}"` line:
72262f3 "Bug fixes (unslothai#344)". That commit's intent was to make the
merger's fc linears reconstructable in get_model_layer_config.
The chosen template produced `linear_fc0`, `linear_fc1`,
`linear_fc2`, ... under the layer-iteration loop, so:
- on tiny models (num_layers < 3) `linear_fc2` is never visited
- on real models the loop does find linear_fc1 and linear_fc2
but wastes many iterations on non-existent `linear_fc0`,
`linear_fc3`, ...
Upstream Qwen3_5VisionPatchMerger (modeling_qwen3_5.py:854-856)
and Qwen3VLVisionPatchMerger (modeling_qwen3_vl.py:114-116) have
exactly two fc linears at fixed names. I preserve the 72262f3
intent by ADDING two explicit entries
`model.visual.merger.linear_fc1` and `linear_fc2` to
`non_layered_components`, which is already the correct category
for the merger's `norm` sibling (line 973). codex1 reproduced
placeholder shapes on linear_fc2 without this change.
3. empty_model.py:346 patched_create_lora_manager decorator
---------------------------------------------------------
Add `@functools.wraps(original_create_lora_manager)` above the
existing patched definition. Blame for the patched block:
5d07504 "[WIP] gemma 4 dense fast inference". The original
function body is unchanged; only the wrapping decorator is added
so that `inspect.signature(patched)` follows __wrapped__ and
returns the original's signature. Without it,
unsloth_zoo/vllm_lora_worker_manager.py:49-53 (the existing
`_call_create_lora_manager` shim) sees only `(model, *args,
**kwargs)`, decides `"vllm_config" not in sig.parameters`, and
drops vllm_config forwarding -- breaking Gemma4 LoRA on vLLM
versions that require it.
FA6 self-check: `functools.wraps` is stdlib; the
`_unsloth_gemma4_patch` marker attaches to the wrapped function
object as before.
4. vllm_utils.py:1469 drop dead `.conv1d.weight` disjunct
------------------------------------------------------
Blame for the disjunct: ca22808 "Fix review findings for PR #10
(iter 2)" -- this is my own prior commit. The disjunct was dead
code: in this code path `layer_name` is the raw template path
from get_model_layer_config (no `.weight` suffix). Evidence: the
`if layer_name in quant_state_dict: ... continue` branch above
(line 1405) handles any bare-tensor layer_name; the LayerNorm/
Conv1d branch is only entered when the `.weight`-suffixed lookup
succeeded. Removing only dead code; correct behavior of the
`.conv1d` special-case (added in ca22808) is unchanged.
FA6 global note: no new helpers were introduced across these four
fixes. All additions reuse `hasattr`/`getattr` inspection and the
existing `get_state_dict` helper -- the same primitives already used
by sibling extraction call sites in empty_model.py and vllm_utils.py.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements support for Gemma 4 and Qwen 3.5 (Gated Delta Net) architectures, including vLLM integration, LoRA patching, and conversion utilities. Key changes include the addition of finalize_huggingface_model for model post-processing, extract_gdn_layers for linear attention support, and more robust dtype configuration handling. Review feedback highlights opportunities to improve code quality by using isinstance for type checking, adopting more idiomatic Python patterns like next() for device detection, and simplifying conditional checks. Additionally, it is recommended to remove redundant pass statements and avoid the use of eval() for security and maintainability.
| def patched_supports_lora(model): | ||
| if model.__class__.__name__ == "Gemma4ForConditionalGeneration": | ||
| return True | ||
| return original_supports_lora(model) |
There was a problem hiding this comment.
Using isinstance is more robust than comparing class names as strings. It correctly handles subclassing and is generally considered better practice.
| def patched_supports_lora(model): | |
| if model.__class__.__name__ == "Gemma4ForConditionalGeneration": | |
| return True | |
| return original_supports_lora(model) | |
| def patched_supports_lora(model): | |
| if isinstance(model, Gemma4ForConditionalGeneration): | |
| return True | |
| return original_supports_lora(model) |
| def patched_create_lora_manager(model, *args, **kwargs): | ||
| if model.__class__.__name__ == "Gemma4ForConditionalGeneration": | ||
| lora_manager_cls = kwargs.pop("lora_manager_cls", vllm_lora_model_manager.LoRAModelManager) | ||
| return lora_manager_cls(model, *args, **kwargs) | ||
| return original_create_lora_manager(model, *args, **kwargs) |
There was a problem hiding this comment.
Similar to the patched_supports_lora function, using isinstance here is more robust than comparing class names as strings.
| def patched_create_lora_manager(model, *args, **kwargs): | |
| if model.__class__.__name__ == "Gemma4ForConditionalGeneration": | |
| lora_manager_cls = kwargs.pop("lora_manager_cls", vllm_lora_model_manager.LoRAModelManager) | |
| return lora_manager_cls(model, *args, **kwargs) | |
| return original_create_lora_manager(model, *args, **kwargs) | |
| def patched_create_lora_manager(model, *args, **kwargs): | |
| if isinstance(model, Gemma4ForConditionalGeneration): | |
| lora_manager_cls = kwargs.pop("lora_manager_cls", vllm_lora_model_manager.LoRAModelManager) | |
| return lora_manager_cls(model, *args, **kwargs) | |
| return original_create_lora_manager(model, *args, **kwargs) |
| if layer_name.endswith(".conv1d"): | ||
| target = eval(f"new_model.{layer_name_br}") | ||
| w = _unwrap_tensor(weight) | ||
| out_channels = w.shape[0] | ||
| kernel_size = w.shape[-1] | ||
| target.out_channels = out_channels | ||
| target.in_channels = out_channels | ||
| target.groups = out_channels | ||
| target.kernel_size = (kernel_size,) | ||
| target.padding = (kernel_size - 1,) | ||
| target.weight = weight_param | ||
| if bias is not None: | ||
| target.bias = bias | ||
| continue |
There was a problem hiding this comment.
The use of eval() can be a security risk if the string being evaluated can be influenced by external input. While layer_name_br is constructed from internal data here, it's a good practice to avoid eval() for maintainability and security. Consider refactoring to use a safer method for accessing nested attributes, even though it might be more complex with attribute names containing [].
| def _get_model_device(model): | ||
| for tensor in model.parameters(): | ||
| return tensor.device | ||
| for tensor in model.buffers(): | ||
| return tensor.device | ||
| return torch.device("cpu") | ||
| pass |
There was a problem hiding this comment.
For conciseness, you can use next with a generator expression to find the model's device. This avoids explicit loops and is a more idiomatic way to achieve this in Python.
| def _get_model_device(model): | |
| for tensor in model.parameters(): | |
| return tensor.device | |
| for tensor in model.buffers(): | |
| return tensor.device | |
| return torch.device("cpu") | |
| pass | |
| def _get_model_device(model): | |
| try: | |
| return next(model.parameters()).device | |
| except StopIteration: | |
| try: | |
| return next(model.buffers()).device | |
| except StopIteration: | |
| return torch.device("cpu") |
| for tensor in model.buffers(): | ||
| return tensor.device | ||
| return torch.device("cpu") | ||
| pass |
| ) | ||
| del local_rope_config | ||
|
|
||
| if (quantization_config or {}) == {} and bnb_config is None: |
There was a problem hiding this comment.
| @@ -1063,6 +1063,12 @@ def get_state_dict(prefix, kk, state_dict, proj, slice_weights=True, slice_index | |||
| quant_state_dict[prefix + ".bias"] = bias_tensor | |||
| pass | |||
Staging mirror of unslothai#603
Original PR: unslothai#603
Author: danielhanchen
This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.
Original description
Summary
Clean-base mirror of unslothai#588. Contains only the original author's 9 commits from @Datta0 (7fa143f through b7052b5, 2026-03-30 to 2026-04-17), with none of the later auto-review fix commits that accumulated on that PR's branch.
Why
Provides a fresh starting point to re-run the PR-review pipeline against the author's original diff without interference from prior review-loop commits layered on top of unslothai#588. Useful for validating review-pipeline changes end-to-end.
Branch contents
Nine commits from @Datta0:
Note
Not a replacement for unslothai#588. See unslothai#588 for the actively-maintained PR and associated review history.