Skip to content

[Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference (clean-base mirror of #588)#10

Open
danielhanchen wants to merge 4 commits into
mainfrom
pr-603-head
Open

[Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference (clean-base mirror of #588)#10
danielhanchen wants to merge 4 commits into
mainfrom
pr-603-head

Conversation

@danielhanchen

Copy link
Copy Markdown
Owner

Staging mirror of unslothai#603

Original PR: unslothai#603
Author: danielhanchen

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

Summary

Clean-base mirror of unslothai#588. Contains only the original author's 9 commits from @Datta0 (7fa143f through b7052b5, 2026-03-30 to 2026-04-17), with none of the later auto-review fix commits that accumulated on that PR's branch.

Why

Provides a fresh starting point to re-run the PR-review pipeline against the author's original diff without interference from prior review-loop commits layered on top of unslothai#588. Useful for validating review-pipeline changes end-to-end.

Branch contents

Nine commits from @Datta0:

  • 7fa143f [WIP] initial fast_inference support for qwen3.5
  • 31c1bc3 [WIP] fixes for rope deltas
  • 112f1b5 cleanup
  • a31dee8 fix lm_head detection and remove moe
  • 5d07504 [WIP] gemma 4 dense fast inference
  • a85a4f4 Fix dtype setting
  • d2f7983 Merge remote-tracking branch 'origin/main' into qwen35-shared-final
  • 43ed24f Fix gemma4 load on vllm 0.19.0
  • b7052b5 fix bnb loader for gemam4

Note

Not a replacement for unslothai#588. See unslothai#588 for the actively-maintained PR and associated review history.

@danielhanchen

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Gemma 4 and Qwen 3.5 (GDN) architectures, adding specialized vLLM patches for LoRA and k_eq_v weight handling and a model finalization utility. It also expands layer configuration templates and improves dtype management. Reviewer feedback recommends simplifying helper functions like _get_model_device and _normalize_state_dict_tensor, and streamlining attribute-setting logic to improve code readability.

Comment on lines +313 to +319
def _get_model_device(model):
for tensor in model.parameters():
return tensor.device
for tensor in model.buffers():
return tensor.device
return torch.device("cpu")
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The _get_model_device function can be simplified using next() with a generator expression to avoid explicit loops and improve readability.

def _get_model_device(model):
    return next((t.device for t in model.parameters()), next((t.device for t in model.buffers()), torch.device("cpu")))
pass

Comment thread unsloth_zoo/hf_utils.py
Comment on lines +66 to +77
try:
setattr(config, field, runtime_dtype)
success = True
continue
except Exception:
pass

try:
config.__dict__[field] = runtime_dtype
success = True
except Exception:
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The nested try-except blocks for setting attributes are redundant. You can simplify this by attempting the standard setattr first and falling back to __dict__ only if it fails, or simply using setattr if the object is a standard Python object.

Suggested change
try:
setattr(config, field, runtime_dtype)
success = True
continue
except Exception:
pass
try:
config.__dict__[field] = runtime_dtype
success = True
except Exception:
pass
for field in target_fields:
try:
setattr(config, field, runtime_dtype)
success = True
break
except Exception:
try:
config.__dict__[field] = runtime_dtype
success = True
break
except Exception:
pass

Comment thread unsloth_zoo/vllm_utils.py
Comment on lines +1216 to +1221
def _normalize_state_dict_tensor(value):
if isinstance(value, torch.nn.Parameter):
value = value.detach()
if value.is_sparse:
value = value.to_dense()
return value.contiguous()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The _normalize_state_dict_tensor function can be simplified by chaining the operations directly.

    def _normalize_state_dict_tensor(value):
        if isinstance(value, torch.nn.Parameter):
            value = value.detach()
        return value.to_dense().contiguous() if value.is_sparse else value.contiguous()

@danielhanchen

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Gemma 4 and Qwen 3.5 (GDN) architectures, featuring vLLM patches for LoRA and k_eq_v support, a new finalize_huggingface_model utility, and extract_gdn_layers for Gated Delta Net. It also improves dtype configuration and tensor unwrapping across the codebase. Feedback highlights opportunities to improve the robustness of language model detection, ensure safer attribute checks for rotary embeddings, and align with modern PyTorch practices by using .detach() instead of .data.

device = target_device,
dtype = dtype,
)
if hasattr(module, "rotary_pos_emb"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

hasattr(module, "rotary_pos_emb") will return True even if the attribute is set to None. If it is None, the subsequent call to module.rotary_pos_emb.__class__(...) on line 713 will raise a TypeError. It is safer to check if the attribute is not None.

Suggested change
if hasattr(module, "rotary_pos_emb"):
if getattr(module, "rotary_pos_emb", None) is not None:

if original_meta_model is not None:
copy_attributes(original_meta_model, new_model)

language_model = getattr(getattr(new_model, "model", None), "language_model", None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The detection of language_model here is less robust than the logic used in set_additional_modules (lines 541-549). It may fail to identify the language model component for standard causal LM models or models where language_model is a direct attribute of new_model. Using a more robust detection logic ensures that layer_idx is correctly patched across different model architectures.

Suggested change
language_model = getattr(getattr(new_model, "model", None), "language_model", None)
language_model = getattr(new_model, "language_model", None)
if language_model is None:
language_model = getattr(new_model, "model", None)
if language_model is not None:
language_model = getattr(language_model, "language_model", language_model)

Comment on lines +1083 to +1085
store(f"{prefix}.conv1d.weight", gdn.conv1d.weight.data)
store(f"{prefix}.dt_bias", gdn.dt_bias.data)
store(f"{prefix}.A_log", gdn.A_log.data)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Using .data is discouraged in modern PyTorch as it can lead to silent bugs by bypassing autograd tracking and safety checks. It is recommended to use .detach() instead when a view of the tensor is needed without gradient tracking.

Suggested change
store(f"{prefix}.conv1d.weight", gdn.conv1d.weight.data)
store(f"{prefix}.dt_bias", gdn.dt_bias.data)
store(f"{prefix}.A_log", gdn.A_log.data)
store(f"{prefix}.conv1d.weight", gdn.conv1d.weight.detach())
store(f"{prefix}.dt_bias", gdn.dt_bias.detach())
store(f"{prefix}.A_log", gdn.A_log.detach())

empty_model.py (patch_gemma4_vllm_lora_support, line 349):
Drop the `model=model` keyword from the `lora_manager_cls(...)` call
so model is passed only positionally.

Blame: 5d07504 "[WIP] gemma 4 dense fast inference" introduced
`lora_manager_cls(model = model, *args, **kwargs)`. Upstream
vllm.lora.model_manager.create_lora_manager is
`(model, max_num_seqs, max_num_batched_tokens, vocab_size, lora_config,
vllm_config, device, lora_manager_cls=..., **kwargs)` -- model is
positional. When vLLM calls the patched wrapper, `*args` captures the
remaining positional arguments. Re-expanding them with a `model=model`
keyword raises `TypeError: got multiple values for argument 'model'`
because the first positional in `args` (max_num_seqs) also binds to
the `model` formal parameter of LoRAModelManager.__init__. Passing
model positionally preserves the intended WIP behavior without the
double-bind. Not a behavioral deletion -- the call signature is the
same, only the argument-passing style changed.

vllm_utils.py (assert_same_state_dict._normalize_state_dict_tensor):
Add `isinstance(value, torch.Tensor)` early-return guard so non-tensor
entries in a state_dict are passed through instead of raising
AttributeError on `.is_sparse` / `.contiguous()`. Safe because the
outer try/except already treats the call as a lookup and the tensor
path below requires a real tensor anyway.
@danielhanchen

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Gemma 4 and Qwen 3.5 (Gated Delta Net) architectures, featuring vLLM to Hugging Face conversion utilities, LoRA patches, and quantization support. Key additions include the finalize_huggingface_model function for post-conversion cleanup, robust dtype configuration handling, and expanded layer mapping. Feedback identifies a critical issue where skipping v_proj extraction for Gemma 4 k_eq_v layers causes missing keys in the Hugging Face state dict; a suggestion is provided to duplicate the k_proj shard instead. Additionally, the slicing logic in extract_gdn_layers should be validated to prevent potential IndexError when handling different fused projection counts.

Comment thread unsloth_zoo/vllm_utils.py
Comment on lines +1116 to +1117
if kk not in gemma4_k_eq_v_layers:
get_state_dict(f"{prefix}.v_proj", 2, state_dict, qkv_proj)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For Gemma 4 layers where attention_k_eq_v is enabled, vLLM's qkv_proj typically contains only Q and K shards. However, the Hugging Face model architecture still expects a v_proj weight. Skipping the extraction of v_proj here will result in a missing key in the converted state dictionary, which can cause errors when loading the model in Transformers (e.g., missing keys or uninitialized weights).

Instead of skipping, you should duplicate the k_proj shard (index 1) into v_proj to ensure the Hugging Face model is fully populated and functional.

Suggested change
if kk not in gemma4_k_eq_v_layers:
get_state_dict(f"{prefix}.v_proj", 2, state_dict, qkv_proj)
if kk not in gemma4_k_eq_v_layers:
get_state_dict(f"{prefix}.v_proj", 2, state_dict, qkv_proj)
else:
# For k_eq_v layers, v_proj is a copy of k_proj
get_state_dict(f"{prefix}.v_proj", 1, state_dict, qkv_proj)

Comment on lines +1057 to +1058
qkv_weight = weight[offsets[0]:offsets[3]]
z_weight = weight[offsets[3]:offsets[4]]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The slicing logic for qkv_weight and z_weight assumes that offsets has at least 5 elements (derived from 4 output sizes: Q, K, V, and Z). If a model variant has a different number of fused projections, this will raise an IndexError. It would be safer to validate the length of output_sizes or offsets before performing these slices.

empty_model.py extract_gdn_layers (~line 1086): store gdn.norm.weight
when the GDN module exposes a norm submodule. Upstream
Qwen3_5GatedDeltaNet constructs `self.norm` as Qwen3_5RMSNormGated /
FusedRMSNormGated (modeling_qwen3_5.py:391-401) and
get_model_layer_config already lists `linear_attn.norm` under
layernorms, so the extractor was silently leaving the empty-model
placeholder weight in place. Addition only; no existing code removed.

empty_model.py get_model_layer_config: add Gemma4 per-layer-input
entries that conversion drops. Upstream Gemma4DecoderLayer creates
`per_layer_input_gate`, `per_layer_projection`, and
`post_per_layer_input_norm` whenever `hidden_size_per_layer_input > 0`
(default 256 per configuration_gemma4.py:169). Addition only; no
existing entries removed.

vllm_utils.py _get_vllm_state_dict per-layer-input extraction
(line 1170-1176): reuses the existing `get_state_dict` helper already
used by surrounding self_attn / mlp extraction -- no new helper
written, no duplicate logic. FA6 self-check: grepped workdir for
per_layer_input / _buffers patterns, confirmed get_state_dict is the
correct existing helper for Linear extraction, and that no
set_additional_modules-style helper for these specific modules already
exists.

vllm_utils.py convert_vllm_to_huggingface bare-tensor branch
(line ~1405): wrap the existing Parameter-assign path in an
if/else that first checks `_buffers[attr_name]`. FA3 blame analysis:

  - line 1401 (the Parameter-wrap call) blames 5d07504
    "[WIP] gemma 4 dense fast inference" -- the original code was
    written to handle Gemma4 layer_scalar as an nn.Parameter, but
    upstream Gemma4DecoderLayer registers it as a buffer
    (modeling_gemma4.py:1337: `self.register_buffer("layer_scalar",
    torch.ones(1))`). The original intent of 5d07504 was to move
    these tensors onto the model; preserving that intent requires
    honoring the source module's buffer registration rather than
    forcing nn.Parameter on all bare tensors. The existing
    Parameter-wrap path is retained for the non-buffer case and
    routes through the same exec(...) assignment, so no historical
    code is deleted -- only a pre-branch was added to select the
    buffer target when appropriate.
  - line 1402 (the exec-assign call) blames 2afbcc1 "Fast Inference
    with vLLM for VLMs (unslothai#202)". This line is moved verbatim into the
    `else` arm of the new if/else; not deleted.

  FA4 note: nn.Parameter(...) is still invoked, but only for the
  non-buffer case; the buffer case uses `_buffers[name] = value`,
  matching the pattern already used in empty_model.py:706 and 743
  inside finalize_huggingface_model.

vllm_utils.py convert_vllm_to_huggingface LayerNorm branch
(line ~1465): special-case `.conv1d` before the weight-only path.
FA3 blame: line 1453 (the "# LayerNorms (including vision norms)"
comment) blames 2afbcc1 "Fast Inference with vLLM for VLMs (unslothai#202)".
The comment was edited to acknowledge the additional conv1d case; the
original LayerNorm weight-only logic is retained for non-conv1d
layer_names. Upstream Qwen3_5GatedDeltaNet builds conv1d as a
depthwise Conv1d with `kernel_size=linear_conv_kernel_dim` and
`groups=conv_dim` (modeling_qwen3_5.py:375-382). The pre-existing
LayerNorm-only path only wrote `.weight`, so Conv1d kept placeholder
`kernel_size=1 / padding=0 / groups=1` and F.conv1d produced a wrong
output length for any `linear_conv_kernel_dim > 1`. Fixing conv1d is
the minimum change; the surrounding LayerNorm code is unchanged.

rl_replacements.py grpo_accumulated_loss (lines 762-763): fix
pre-existing typo `io_same_decice` -> `io_same_device` to match
accelerate's AlignDevicesHook attribute (accelerate/hooks.py:266,275,
347). FA3 blame: both lines blame 8ac4171 "GPT OSS RL (unslothai#303)". The
original commit introduced the typo, which made the hasattr check
always False and the intended hook reset never run. This commit
restores the intended behavior and does not remove the guard or the
assignment -- only the attribute name is corrected. The PR's newly
added `rope_deltas` reset on the immediately following lines places
the overlap inside the changed region, which is why the typo is being
fixed in this PR rather than in a separate cleanup.
@danielhanchen

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Gemma4 and Qwen 3.5 (Gated Delta Net) architectures, implementing specialized vLLM patches for LoRA and quantization state management. Key additions include the finalize_huggingface_model function for centralized model post-processing, expanded layer configuration templates for new architectures, and enhanced robustness in dtype configuration and state dict conversion. Review feedback identifies potential TypeError vulnerabilities when re-initializing rotary embeddings if attributes are None, and suggests more robust string parsing for layer names to prevent potential ValueError during model conversion.

vision_config = getattr(config, "vision_config", None)

for module in new_model.modules():
if hasattr(module, "rotary_emb"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hasattr check will return True even if module.rotary_emb is None. Accessing module.rotary_emb.__class__ on line 699 when it is None will result in a TypeError because NoneType is not callable. It is safer to check if the attribute is not None.

Suggested change
if hasattr(module, "rotary_emb"):
if getattr(module, "rotary_emb", None) is not None:

device = target_device,
dtype = dtype,
)
if hasattr(module, "rotary_pos_emb"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to rotary_emb, module.rotary_pos_emb should be checked for None before accessing its __class__ on line 713 to avoid a TypeError.

Suggested change
if hasattr(module, "rotary_pos_emb"):
if getattr(module, "rotary_pos_emb", None) is not None:

assert vision_config is not None, "Unsloth: vision_config is required for models with vision rotary_pos_emb"
head_dim = vision_config.hidden_size // vision_config.num_heads
module.rotary_pos_emb = module.rotary_pos_emb.__class__(head_dim//2).to(target_device)
if hasattr(module, "rotary_emb_local"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to rotary_emb, module.rotary_emb_local should be checked for None before accessing its __class__ on line 718 to avoid a TypeError.

Suggested change
if hasattr(module, "rotary_emb_local"):
if getattr(module, "rotary_emb_local", None) is not None:

Comment thread unsloth_zoo/vllm_utils.py
exec(f"new_model.{layer_name_br} = layer")
layer_name_br = re.sub(r"\.([\d]{1,})\.", r"[\1].", layer_name)
value = _unwrap_tensor(weight)
parent_expr, attr_name = layer_name_br.rsplit(".", 1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

If layer_name_br does not contain a dot (e.g., if a top-level attribute is added to layer_names), rsplit(".", 1) will raise a ValueError. While current templates include dots, adding a guard or using a safer way to split would be more robust for future changes.

Summary: four minimal fixes, no logic deleted -- each hunk either
adds a missing branch or corrects an index-mismatch while preserving
the original author's intent. Blame-cited per line.

1. empty_model.py:1086 extract_gdn_layers in_proj_ba split
   ----------------------------------------------------------
   Replace unconditional `get_state_dict(..., gdn.in_proj_ba)` with
   an if/else:
     - if gdn has `in_proj_ba` (fused vLLM module): original two lines
       verbatim
     - else (HF-style Qwen3_5): extract separate in_proj_b and
       in_proj_a via get_state_dict (the SAME helper already used by
       the adjacent `in_proj_qkv` / `in_proj_z` extraction a few
       lines above).
   Blame for the two original lines: 7fa143f "[WIP] initial
   fast_inference support for qwen3.5". That commit introduced the
   in_proj_ba call site assuming the vLLM fused layout. Upstream HF
   Qwen3_5GatedDeltaNet explicitly `del self.in_proj_ba` in
   transformers modular_qwen3_5.py:196-197 and defines separate
   in_proj_b / in_proj_a at modeling_qwen3_5.py:419-420. codex4
   reproduced AttributeError on the HF path. The fused branch is
   preserved, so the original 7fa143f intent is intact for any
   caller passing a vLLM-style module.
   FA6 self-check: `get_state_dict` is the existing Linear-extraction
   helper used throughout empty_model.py and vllm_utils.py; no new
   helper introduced.

2. empty_model.py:934 / 972 merger.linear_fc entries moved
   --------------------------------------------------------
   Blame for the deleted `"model.visual.merger.linear_fc{kk}"` line:
   72262f3 "Bug fixes (unslothai#344)". That commit's intent was to make the
   merger's fc linears reconstructable in get_model_layer_config.
   The chosen template produced `linear_fc0`, `linear_fc1`,
   `linear_fc2`, ... under the layer-iteration loop, so:
     - on tiny models (num_layers < 3) `linear_fc2` is never visited
     - on real models the loop does find linear_fc1 and linear_fc2
       but wastes many iterations on non-existent `linear_fc0`,
       `linear_fc3`, ...
   Upstream Qwen3_5VisionPatchMerger (modeling_qwen3_5.py:854-856)
   and Qwen3VLVisionPatchMerger (modeling_qwen3_vl.py:114-116) have
   exactly two fc linears at fixed names. I preserve the 72262f3
   intent by ADDING two explicit entries
   `model.visual.merger.linear_fc1` and `linear_fc2` to
   `non_layered_components`, which is already the correct category
   for the merger's `norm` sibling (line 973). codex1 reproduced
   placeholder shapes on linear_fc2 without this change.

3. empty_model.py:346 patched_create_lora_manager decorator
   ---------------------------------------------------------
   Add `@functools.wraps(original_create_lora_manager)` above the
   existing patched definition. Blame for the patched block:
   5d07504 "[WIP] gemma 4 dense fast inference". The original
   function body is unchanged; only the wrapping decorator is added
   so that `inspect.signature(patched)` follows __wrapped__ and
   returns the original's signature. Without it,
   unsloth_zoo/vllm_lora_worker_manager.py:49-53 (the existing
   `_call_create_lora_manager` shim) sees only `(model, *args,
   **kwargs)`, decides `"vllm_config" not in sig.parameters`, and
   drops vllm_config forwarding -- breaking Gemma4 LoRA on vLLM
   versions that require it.
   FA6 self-check: `functools.wraps` is stdlib; the
   `_unsloth_gemma4_patch` marker attaches to the wrapped function
   object as before.

4. vllm_utils.py:1469 drop dead `.conv1d.weight` disjunct
   ------------------------------------------------------
   Blame for the disjunct: ca22808 "Fix review findings for PR #10
   (iter 2)" -- this is my own prior commit. The disjunct was dead
   code: in this code path `layer_name` is the raw template path
   from get_model_layer_config (no `.weight` suffix). Evidence: the
   `if layer_name in quant_state_dict: ... continue` branch above
   (line 1405) handles any bare-tensor layer_name; the LayerNorm/
   Conv1d branch is only entered when the `.weight`-suffixed lookup
   succeeded. Removing only dead code; correct behavior of the
   `.conv1d` special-case (added in ca22808) is unchanged.

FA6 global note: no new helpers were introduced across these four
fixes. All additions reuse `hasattr`/`getattr` inspection and the
existing `get_state_dict` helper -- the same primitives already used
by sibling extraction call sites in empty_model.py and vllm_utils.py.
@danielhanchen

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for Gemma 4 and Qwen 3.5 (Gated Delta Net) architectures, including vLLM integration, LoRA patching, and conversion utilities. Key changes include the addition of finalize_huggingface_model for model post-processing, extract_gdn_layers for linear attention support, and more robust dtype configuration handling. Review feedback highlights opportunities to improve code quality by using isinstance for type checking, adopting more idiomatic Python patterns like next() for device detection, and simplifying conditional checks. Additionally, it is recommended to remove redundant pass statements and avoid the use of eval() for security and maintainability.

Comment on lines +335 to +338
def patched_supports_lora(model):
if model.__class__.__name__ == "Gemma4ForConditionalGeneration":
return True
return original_supports_lora(model)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using isinstance is more robust than comparing class names as strings. It correctly handles subclassing and is generally considered better practice.

Suggested change
def patched_supports_lora(model):
if model.__class__.__name__ == "Gemma4ForConditionalGeneration":
return True
return original_supports_lora(model)
def patched_supports_lora(model):
if isinstance(model, Gemma4ForConditionalGeneration):
return True
return original_supports_lora(model)

Comment on lines +348 to +352
def patched_create_lora_manager(model, *args, **kwargs):
if model.__class__.__name__ == "Gemma4ForConditionalGeneration":
lora_manager_cls = kwargs.pop("lora_manager_cls", vllm_lora_model_manager.LoRAModelManager)
return lora_manager_cls(model, *args, **kwargs)
return original_create_lora_manager(model, *args, **kwargs)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the patched_supports_lora function, using isinstance here is more robust than comparing class names as strings.

Suggested change
def patched_create_lora_manager(model, *args, **kwargs):
if model.__class__.__name__ == "Gemma4ForConditionalGeneration":
lora_manager_cls = kwargs.pop("lora_manager_cls", vllm_lora_model_manager.LoRAModelManager)
return lora_manager_cls(model, *args, **kwargs)
return original_create_lora_manager(model, *args, **kwargs)
def patched_create_lora_manager(model, *args, **kwargs):
if isinstance(model, Gemma4ForConditionalGeneration):
lora_manager_cls = kwargs.pop("lora_manager_cls", vllm_lora_model_manager.LoRAModelManager)
return lora_manager_cls(model, *args, **kwargs)
return original_create_lora_manager(model, *args, **kwargs)

Comment thread unsloth_zoo/vllm_utils.py
Comment on lines +1469 to +1482
if layer_name.endswith(".conv1d"):
target = eval(f"new_model.{layer_name_br}")
w = _unwrap_tensor(weight)
out_channels = w.shape[0]
kernel_size = w.shape[-1]
target.out_channels = out_channels
target.in_channels = out_channels
target.groups = out_channels
target.kernel_size = (kernel_size,)
target.padding = (kernel_size - 1,)
target.weight = weight_param
if bias is not None:
target.bias = bias
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of eval() can be a security risk if the string being evaluated can be influenced by external input. While layer_name_br is constructed from internal data here, it's a good practice to avoid eval() for maintainability and security. Consider refactoring to use a safer method for accessing nested attributes, even though it might be more complex with attribute names containing [].

Comment on lines +314 to +320
def _get_model_device(model):
for tensor in model.parameters():
return tensor.device
for tensor in model.buffers():
return tensor.device
return torch.device("cpu")
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

For conciseness, you can use next with a generator expression to find the model's device. This avoids explicit loops and is a more idiomatic way to achieve this in Python.

Suggested change
def _get_model_device(model):
for tensor in model.parameters():
return tensor.device
for tensor in model.buffers():
return tensor.device
return torch.device("cpu")
pass
def _get_model_device(model):
try:
return next(model.parameters()).device
except StopIteration:
try:
return next(model.buffers()).device
except StopIteration:
return torch.device("cpu")

for tensor in model.buffers():
return tensor.device
return torch.device("cpu")
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The pass statement here is redundant and can be removed for cleaner code.

)
del local_rope_config

if (quantization_config or {}) == {} and bnb_config is None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

This condition can be simplified. Checking for not quantization_config is more Pythonic for checking if a dictionary is empty or None.

Suggested change
if (quantization_config or {}) == {} and bnb_config is None:
if not quantization_config and bnb_config is None:

Comment thread unsloth_zoo/vllm_utils.py
@@ -1063,6 +1063,12 @@ def get_state_dict(prefix, kk, state_dict, proj, slice_weights=True, slice_index
quant_state_dict[prefix + ".bias"] = bias_tensor
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The pass statement here is redundant and can be removed for cleaner code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant