Fast Inference with vLLM for VLMs#202
Merged
Merged
Conversation
vLLM merged them recently. ref jeejeelee/vllm@a71e476
* Wakeup when generating if needed * Patch vllm only when standby enabled
| "model.vision_model.pre_tile_positional_embedding.embedding", | ||
| "model.vision_model.gated_positional_embedding", | ||
| "model.vision_model.post_tile_positional_embedding.embedding", | ||
| "model.vision_model.pre_tile_positional_embedding.gate" |
danielhanchen
requested changes
Sep 7, 2025
Member
danielhanchen
left a comment
There was a problem hiding this comment.
Nice work - final points
acc1ed7 to
b0a1081
Compare
|
|
||
| logger.info(f"Unsloth: Enabling vLLM standby mode") | ||
|
|
||
| def __init__(self): |
danielhanchen
added a commit
to danielhanchen/unsloth-zoo-staging-1
that referenced
this pull request
Apr 20, 2026
empty_model.py extract_gdn_layers (~line 1086): store gdn.norm.weight when the GDN module exposes a norm submodule. Upstream Qwen3_5GatedDeltaNet constructs `self.norm` as Qwen3_5RMSNormGated / FusedRMSNormGated (modeling_qwen3_5.py:391-401) and get_model_layer_config already lists `linear_attn.norm` under layernorms, so the extractor was silently leaving the empty-model placeholder weight in place. Addition only; no existing code removed. empty_model.py get_model_layer_config: add Gemma4 per-layer-input entries that conversion drops. Upstream Gemma4DecoderLayer creates `per_layer_input_gate`, `per_layer_projection`, and `post_per_layer_input_norm` whenever `hidden_size_per_layer_input > 0` (default 256 per configuration_gemma4.py:169). Addition only; no existing entries removed. vllm_utils.py _get_vllm_state_dict per-layer-input extraction (line 1170-1176): reuses the existing `get_state_dict` helper already used by surrounding self_attn / mlp extraction -- no new helper written, no duplicate logic. FA6 self-check: grepped workdir for per_layer_input / _buffers patterns, confirmed get_state_dict is the correct existing helper for Linear extraction, and that no set_additional_modules-style helper for these specific modules already exists. vllm_utils.py convert_vllm_to_huggingface bare-tensor branch (line ~1405): wrap the existing Parameter-assign path in an if/else that first checks `_buffers[attr_name]`. FA3 blame analysis: - line 1401 (the Parameter-wrap call) blames 5d07504 "[WIP] gemma 4 dense fast inference" -- the original code was written to handle Gemma4 layer_scalar as an nn.Parameter, but upstream Gemma4DecoderLayer registers it as a buffer (modeling_gemma4.py:1337: `self.register_buffer("layer_scalar", torch.ones(1))`). The original intent of 5d07504 was to move these tensors onto the model; preserving that intent requires honoring the source module's buffer registration rather than forcing nn.Parameter on all bare tensors. The existing Parameter-wrap path is retained for the non-buffer case and routes through the same exec(...) assignment, so no historical code is deleted -- only a pre-branch was added to select the buffer target when appropriate. - line 1402 (the exec-assign call) blames 2afbcc1 "Fast Inference with vLLM for VLMs (unslothai#202)". This line is moved verbatim into the `else` arm of the new if/else; not deleted. FA4 note: nn.Parameter(...) is still invoked, but only for the non-buffer case; the buffer case uses `_buffers[name] = value`, matching the pattern already used in empty_model.py:706 and 743 inside finalize_huggingface_model. vllm_utils.py convert_vllm_to_huggingface LayerNorm branch (line ~1465): special-case `.conv1d` before the weight-only path. FA3 blame: line 1453 (the "# LayerNorms (including vision norms)" comment) blames 2afbcc1 "Fast Inference with vLLM for VLMs (unslothai#202)". The comment was edited to acknowledge the additional conv1d case; the original LayerNorm weight-only logic is retained for non-conv1d layer_names. Upstream Qwen3_5GatedDeltaNet builds conv1d as a depthwise Conv1d with `kernel_size=linear_conv_kernel_dim` and `groups=conv_dim` (modeling_qwen3_5.py:375-382). The pre-existing LayerNorm-only path only wrote `.weight`, so Conv1d kept placeholder `kernel_size=1 / padding=0 / groups=1` and F.conv1d produced a wrong output length for any `linear_conv_kernel_dim > 1`. Fixing conv1d is the minimum change; the surrounding LayerNorm code is unchanged. rl_replacements.py grpo_accumulated_loss (lines 762-763): fix pre-existing typo `io_same_decice` -> `io_same_device` to match accelerate's AlignDevicesHook attribute (accelerate/hooks.py:266,275, 347). FA3 blame: both lines blame 8ac4171 "GPT OSS RL (unslothai#303)". The original commit introduced the typo, which made the hasattr check always False and the intended hook reset never run. This commit restores the intended behavior and does not remove the guard or the assignment -- only the attribute name is corrected. The PR's newly added `rope_deltas` reset on the immediately following lines places the overlap inside the changed region, which is why the typo is being fixed in this PR rather than in a separate cleanup.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Needed for: unslothai/unsloth#2975
TODO: (Mostly in a follow up PR later)