Skip to content

[Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference #588

Open
Datta0 wants to merge 40 commits into
unslothai:mainfrom
Datta0:qwen35-shared-final
Open

[Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference #588
Datta0 wants to merge 40 commits into
unslothai:mainfrom
Datta0:qwen35-shared-final

Conversation

@Datta0
Copy link
Copy Markdown
Collaborator

@Datta0 Datta0 commented Apr 10, 2026

  • fast inference support for Qwen 3.5 with vLLM :)
    Tried to make sure the changes are minimal so when we detect linear_attn, we hand it off to a separate function

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Qwen 3.5 and Gated Delta Net (GDN) architectures, including new configuration attributes, layer extraction logic for GDN, and enhanced vLLM-to-Hugging Face conversion. Key changes include the implementation of extract_gdn_layers, improved lm_head detection, and clearing rope_deltas during RL training. Review feedback suggests using .detach() instead of .data for safer tensor extraction and notes a semantic naming issue where conv1d was added to a list of normalization layers.

Comment thread unsloth_zoo/empty_model.py Outdated
Comment on lines +839 to +846
conv_w = gdn.conv1d.weight.data
state_dict[f"{prefix}.conv1d.weight"] = conv_w
quant_state_dict[f"{prefix}.conv1d.weight"] = conv_w

state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.data
quant_state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.data
state_dict[f"{prefix}.A_log"] = gdn.A_log.data
quant_state_dict[f"{prefix}.A_log"] = gdn.A_log.data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using .data is generally discouraged in modern PyTorch as it can bypass safety checks and lead to unexpected behavior. It is recommended to use .detach() instead to obtain a tensor that does not track gradients, which is sufficient for state dict extraction.

Suggested change
conv_w = gdn.conv1d.weight.data
state_dict[f"{prefix}.conv1d.weight"] = conv_w
quant_state_dict[f"{prefix}.conv1d.weight"] = conv_w
state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.data
quant_state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.data
state_dict[f"{prefix}.A_log"] = gdn.A_log.data
quant_state_dict[f"{prefix}.A_log"] = gdn.A_log.data
conv_w = gdn.conv1d.weight.detach()
state_dict[f"{prefix}.conv1d.weight"] = conv_w
quant_state_dict[f"{prefix}.conv1d.weight"] = conv_w
state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.detach()
quant_state_dict[f"{prefix}.dt_bias"] = gdn.dt_bias.detach()
state_dict[f"{prefix}.A_log"] = gdn.A_log.detach()
quant_state_dict[f"{prefix}.A_log"] = gdn.A_log.detach()

Comment thread unsloth_zoo/vllm_utils.py Outdated
"norm1", # Qwen2.5-VL vision encoder
"norm2", # Qwen2.5-VL vision encoder
"norm",
"conv1d",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Adding "conv1d" to the layernorm_names list is semantically incorrect as it is a convolutional layer, not a normalization layer. While the current implementation reuses the direct parameter assignment logic in the else block (line 1421) which works for conv1d.weight, this naming is misleading and could lead to maintenance issues. Consider renaming the list to something more generic like simple_parameter_modules or adding a comment to explain this exception.

@Datta0 Datta0 changed the title [Qwen 3.5] Qwen35 fast inference [Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference Apr 15, 2026
@Datta0 Datta0 marked this pull request as ready for review April 17, 2026 10:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7052b55a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth_zoo/empty_model.py Outdated
Comment on lines +1057 to +1060
qkv_weight = weight[offsets[0]:offsets[3]]
z_weight = weight[offsets[3]:offsets[4]]
store(f"{prefix}.in_proj_qkv.weight", qkv_weight)
store(f"{prefix}.in_proj_z.weight", z_weight)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve quant states when splitting GDN fused qkvz

When linear_attn.in_proj_qkvz is present, this path slices proj.weight into in_proj_qkv/in_proj_z and stores only raw weights, but it never carries over per-shard bnb_quant_state. For 4-bit/BnB checkpoints this means these layers lose their quant metadata, and convert_vllm_to_huggingface will no longer take the Linear4bit branch for them, producing incorrect dense modules (often with uint8-backed weights) and broken inference/training numerics on Qwen3.5 GDN models.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/empty_model.py Outdated
Comment on lines +367 to +369
if hasattr(
BitsAndBytesModelLoader._stack_quantization_states,
"_unsloth_gemma4_k_eq_v_patch",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard Gemma4 k_eq_v patch for missing loader method

This accesses BitsAndBytesModelLoader._stack_quantization_states directly before verifying the attribute exists. Because load_vllm now calls this patch unconditionally for Gemma4 vision models, environments with a different vLLM loader API will raise AttributeError during model load (even when not using 4-bit quantization). Add a defensive getattr/version gate and no-op when the method is unavailable.

Useful? React with 👍 / 👎.

@danielhanchen
Copy link
Copy Markdown
Member

Possible duplicate of a trusted maintainer's PR. This PR looks like it solves the same underlying problem as unslothai/unsloth-zoo#580 by @danielhanchen (trusted maintainer).

Directly patches Gemma-4 temporary inference code in the same gemma4.py path and fixes the same runtime failure class.

Canonical PR summary: This PR fixes Gemma-4 inference and generation crashes by patching the temporary compatibility layer so zero num_kv_shared_layers builds caches correctly and KV-sharing models still work with use_cache=False. It does this with a config proxy and cache/forward wrappers that inject and then strip temporary caches in unsloth_zoo/temporary_patches/gemma4.py.

The auto-review is still running against this PR — reviewers will factor in the canonical above. If this PR is genuinely different, call out the delta in the review discussion so the maintainer can decide which to merge.

@danielhanchen danielhanchen added the auto-has-duplicate Pre-flight: similar to a trusted maintainer's PR label Apr 19, 2026
…e_huggingface_model

- patch_gemma4_vllm_lora_support: use functools.wraps on patched_create_lora_manager so
  _call_create_lora_manager's signature inspection still sees vllm_config; pass model
  positionally to lora_manager_cls to avoid "multiple values for 'model'".
- patch_gemma4_vllm_k_eq_v_support: also handle split k_proj/v_proj layout (current
  upstream Gemma4) by duplicating k quant-state to synthetic v entry; keep packed
  qkv_proj path as fallback.
- load_vllm: gate Gemma4 patches on enable_lora / use_bitsandbytes (not is_vision_model),
  so text-only Gemma4 + LoRA / BnB also works.
- extract_gdn_layers: derive qkvz offsets from gdn.key_dim/value_dim when
  ColumnParallelLinear has no output_sizes; manually split in_proj_ba into b/a instead
  of calling get_state_dict with kk=1 (IndexError); preserve BnB quant_state sidecars;
  handle FP8 weight_scale (not only weight_scale_inv) and dynamic/row-wise FP8;
  export linear_attn.norm.weight.
- finalize_huggingface_model: fix layer_idx for standard causal LMs (not only VLM path);
  rebuild Gemma4 vision rotary_emb from vision_config with fp32 buffers; guard
  rotary_pos_emb on vision_config availability; mirror language_model detection from
  set_additional_modules.
- get_model_layer_config: register Gemma4 per_layer_input_gate / per_layer_projection /
  post_per_layer_input_norm; add Qwen3.5 visual.merger.linear_fc1 / linear_fc2 and drop
  the broken linear_fc{kk} template.
- set_dtype_in_config (hf_utils): prefer the modern 'dtype' field; fall back to
  'torch_dtype' only when 'dtype' is absent, avoiding the deprecation warning on
  current transformers.
- vllm_utils state-dict loop: skip layer.mlp extraction for linear-attn-only layers
  (defensive) while still capturing layer_scalar.
- _normalize_state_dict_tensor: guard is_sparse behind isinstance(value, torch.Tensor)
  so non-tensor state-dict values pass through.
…n, finalize_huggingface_model

- patch_gemma4_vllm_lora_support: use functools.wraps on patched_create_lora_manager so
  _call_create_lora_manager's signature inspection still sees vllm_config; pass model
  positionally to lora_manager_cls to avoid "multiple values for 'model'".
- patch_gemma4_vllm_k_eq_v_support: also handle split k_proj/v_proj layout (current
  upstream Gemma4) by duplicating k quant-state to synthetic v entry; keep packed
  qkv_proj path as fallback.
- load_vllm: gate Gemma4 patches on enable_lora / use_bitsandbytes (not is_vision_model),
  so text-only Gemma4 + LoRA / BnB also works.
- extract_gdn_layers: derive qkvz offsets from gdn.key_dim/value_dim when
  ColumnParallelLinear has no output_sizes; manually split in_proj_ba into b/a instead
  of calling get_state_dict with kk=1 (IndexError); preserve BnB quant_state sidecars;
  handle FP8 weight_scale (not only weight_scale_inv) and dynamic/row-wise FP8;
  export linear_attn.norm.weight.
- finalize_huggingface_model: fix layer_idx for standard causal LMs (not only VLM path);
  rebuild Gemma4 vision rotary_emb from vision_config with fp32 buffers; guard
  rotary_pos_emb on vision_config availability; mirror language_model detection from
  set_additional_modules.
- get_model_layer_config: register Gemma4 per_layer_input_gate / per_layer_projection /
  post_per_layer_input_norm; add Qwen3.5 visual.merger.linear_fc1 / linear_fc2 and drop
  the broken linear_fc{kk} template.
- set_dtype_in_config (hf_utils): prefer the modern 'dtype' field; fall back to
  'torch_dtype' only when 'dtype' is absent, avoiding the deprecation warning on
  current transformers.
- vllm_utils state-dict loop: skip layer.mlp extraction for linear-attn-only layers
  (defensive) while still capturing layer_scalar.
- _normalize_state_dict_tensor: guard is_sparse behind isinstance(value, torch.Tensor)
  so non-tensor state-dict values pass through.
danielhanchen added a commit to shimmyshimmer/unsloth-zoo-staging-2 that referenced this pull request Apr 19, 2026
@danielhanchen
Copy link
Copy Markdown
Member

Auto-review verdict: Approved

PR #588 adds vLLM-to-HF fast inference support for Qwen3.5 (Gated Delta Net linear attention) and Gemma4 VLMs (LoRA + k_eq_v BnB patches), plus consolidates post-conversion rotary/device setup into a new finalize_huggingface_model helper. It meaningfully extends Unsloth's model coverage for two important next-gen architectures with correct conversion behavior after the review fixes.

Reason: All 20 accepted findings fixed on pr-588-head (commit e4f530c); 14 consolidated tests pass on fixed code and catch the pre-fix regressions.

@danielhanchen danielhanchen added the auto-approved Auto-review approved the PR label Apr 19, 2026
# Conflicts:
#	unsloth_zoo/empty_model.py
#	unsloth_zoo/vllm_utils.py
@danielhanchen danielhanchen removed the auto-reviewing Auto-review in progress label Apr 20, 2026
@danielhanchen
Copy link
Copy Markdown
Member

Auto-review verdict: Approved

PR adds vLLM-to-HF conversion support for Qwen 3.5 Gated Delta Net and Gemma4 (LoRA, k_eq_v BnB, per-layer inputs) plus a unified finalize_huggingface_model helper; it unblocks fast inference for both new model families and, with the applied fixes to rotary handling, GDN dequantize slicing, and lm_head lookup, is correct and ready to merge.

Reason: All real review findings were fixed; no unfixed regressions remain and 43/43 tests pass.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 086aa52c82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +980 to +981
"model.visual.merger.linear_fc1",
"model.visual.merger.linear_fc2",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep merger FCs on quant-aware conversion path

Moving model.visual.merger.linear_fc1/linear_fc2 into non_layered_components routes them through set_additional_modules, which assigns raw tensors as nn.Parameter and never consults *.weight.quant_state (see the generic assignment path in set_additional_modules). For prequantized 4-bit checkpoints, these FCs are therefore rebuilt as plain nn.Linear with packed uint8 weights instead of Linear4bit, which can trigger dtype/runtime errors or silently wrong numerics in Qwen3-VL conversion.

Useful? React with 👍 / 👎.

@danielhanchen
Copy link
Copy Markdown
Member

Possible duplicate of a trusted maintainer's PR. This PR looks like it solves the same underlying problem as unslothai/unsloth-zoo#580 by @danielhanchen (trusted maintainer).

Gemma4 inference crash fix on the same runtime path; target adds Gemma4 conversion and finalization support.

Canonical PR summary: This PR fixes Gemma-4 inference and generation crashes by patching the temporary compatibility layer so zero num_kv_shared_layers builds caches correctly and KV-sharing models still work with use_cache=False. It does this with a config proxy and cache/forward wrappers that inject and then strip temporary caches in unsloth_zoo/temporary_patches/gemma4.py.

Other related PRs:

The auto-review is still running against this PR — reviewers will factor in the canonical above. If this PR is genuinely different, call out the delta in the review discussion so the maintainer can decide which to merge.

@danielhanchen danielhanchen added the auto-reviewing Auto-review in progress label Apr 20, 2026
- vllm_utils.py: convert_vllm_to_huggingface parameter-list path regex now
  handles trailing-digit segments (e.g. `embed_tokens_per_layer.0`) via the
  same anchor-or-end pattern used below for `exec` assignment.
- empty_model.py: finalize_huggingface_model rotary reinit no longer
  swallows failures silently; the float32 buffer lift is skipped when the
  reinit raised, and the exception is logged so wrong-shape rotary state
  does not propagate.
- empty_model.py: patch_gemma4_vllm_lora_support guards both the
  `gemma4_mm` and `gemma4` imports, does not clobber a pre-existing
  `embedding_modules` registry, and delegates to the original
  `create_lora_manager` so vLLM shim kwargs (e.g. `vllm_config`) reach the
  manager constructor correctly.
- empty_model.py: patch_gemma4_vllm_k_eq_v_support uses getattr for
  `_stack_quantization_states` so absent / renamed private attrs do not
  crash, and adds `model.language_model` to the k/v name-prefix search so
  HF-style Gemma4 multimodal parameters match.
- empty_model.py + vllm_utils.py: both Gemma4 gates share a new
  `_is_gemma4_config` helper that accepts `gemma4` and the text-only
  `gemma4_text` model_type.
- vllm_utils.py: Gemma4 patch block runs after the BnB autodetect so that
  `-bnb-4bit` / `quant_method == bitsandbytes` models trigger the
  k_eq_v patch even when the caller did not pass use_bitsandbytes=True.
Adds 12 regression tests covering the iter-1 hardening (trailing-digit
regex path, rotary reinit success guard, _is_gemma4_config helper,
gemma4 gate migration, gemma4_mm import guard, private loader attr
guard, HF-style k_eq_v prefix, lora manager delegation, behavioral
no-op tests that stub missing vLLM modules). Updates
test_gemma4_lora_patch_preserves_signature_for_inspect and
test_gemma4_k_eq_v_set_hoists_constant_check to match the new source
shape.
# Conflicts:
#	unsloth_zoo/empty_model.py
#	unsloth_zoo/vllm_utils.py
@danielhanchen
Copy link
Copy Markdown
Member

Auto-review verdict: Approved

Adds vLLM <-> HF fast-inference support for Qwen 3.5 (Gated Delta Net linear-attention extraction) and Gemma 4 (per-layer input modules, k_eq_v attention, LoRA patch), plus a factored-out finalize_huggingface_model that fixes layer_idx, rotary reinit, and dtype propagation. Useful because it unlocks conversion/inference for two newly-supported architectures while the refactor cleans up rotary / dtype / lm_head paths that were already fragile.

Reason: All 9 accepted review findings fixed and verified by 55/55 passing tests; merge-back landed cleanly on upstream.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ba822fcb87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth_zoo/empty_model.py Outdated
Comment on lines +783 to +787
rotary_emb = getattr(module, "rotary_emb", None)
if rotary_emb is None:
continue
for buffer_name, buffer in list(rotary_emb._buffers.items()):
if torch.is_tensor(buffer) and buffer.is_floating_point():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep rotary_emb_local buffers in float32 after final cast

finalize_huggingface_model reinitializes rotary_emb_local earlier, but in the final non-quantized path it calls new_model.to(..., dtype=dtype) and then only restores float32 buffers for rotary_emb. Models that expose rotary_emb_local (local/global RoPE split variants) will therefore have local RoPE buffers downcast to bf16/fp16, which regresses RoPE precision versus the pre-cast state and can subtly change generation quality.

Useful? React with 👍 / 👎.

@danielhanchen
Copy link
Copy Markdown
Member

Auto-review verdict: Approved

Adds vLLM fast-inference support for Qwen3.5 (Gated Delta Net linear-attention extraction/rebuild) and Gemma4 (k_eq_v attention, per-layer input modules, LoRA + BnB vLLM patches), plus a centralized finalize_huggingface_model for dtype/rotary/layer_idx cleanup. Review iterations added defensive GDN quant-state handling, restored in_proj_z emission on the multi-quant path, routed merger.linear_fc1/fc2 through the quantized loop, and extended fp32 lift to rotary_emb_local and rotary_pos_emb.

Reason: All real issues found during review were fixed and verified by tests; no remaining regressions.

@Datta0
Copy link
Copy Markdown
Collaborator Author

Datta0 commented Apr 21, 2026

danielhanchen#5
Still have unaddressed comments from here

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93f1143594

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth_zoo/vllm_utils.py
Comment on lines +1156 to +1157
if kk not in gemma4_k_eq_v_layers and kk not in gemma4_kv_shared_layers:
get_state_dict(f"{prefix}.v_proj", 2, state_dict, qkv_proj)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Extract v_proj weights for Gemma4 k_eq_v layers

Do not skip v_proj extraction for attention_k_eq_v layers here: Gemma4’s split-attention layout still exposes v_proj weights in many builds, and this guard drops them from quant_state_dict. During convert_vllm_to_huggingface, missing ...self_attn.v_proj.weight entries are silently skipped, so those layers keep the tiny empty-model placeholders instead of real weights, which corrupts converted Gemma4 checkpoints (especially full-attention layers with attention_k_eq_v=True).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1680006be3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth_zoo/vllm_utils.py
Comment on lines +1092 to +1096
if _is_gemma4_config(config) and getattr(text_config, "attention_k_eq_v", False):
gemma4_k_eq_v_layers = {
kk
for kk, layer_type in enumerate(getattr(text_config, "layer_types", ()))
if layer_type == "full_attention"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard Gemma4 layer_types iteration against None

When attention_k_eq_v is enabled, this comprehension iterates text_config.layer_types directly; on Gemma4 configs where layer_types is present but set to None (the documented optional/default form), enumerate(None) raises TypeError and aborts state-dict extraction. This makes conversion fail before any weights are exported for those checkpoints, so this path should coerce None to an empty/default iterable (for example getattr(..., None) or ()).

Useful? React with 👍 / 👎.


param_names = set(name for name, _ in model.named_parameters())
pairs = []
for layer_idx, layer_type in enumerate(getattr(text_config, "layer_types", ())):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle None layer_types in Gemma4 k_eq_v loader patch

The Gemma4 bitsandbytes patch assumes text_config.layer_types is iterable whenever attention_k_eq_v is true; if that field is None, this loop throws TypeError during vLLM quant-state stacking, causing Gemma4 4-bit loading to fail before model initialization completes. The loop should defensively treat a None value as empty/default instead of iterating it directly.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-approved Auto-review approved the PR auto-has-duplicate Pre-flight: similar to a trusted maintainer's PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants