Skip to content

Fast Inference with vLLM for VLMs#202

Merged
danielhanchen merged 73 commits into
unslothai:mainfrom
Datta0:vlm_fast_infer
Sep 16, 2025
Merged

Fast Inference with vLLM for VLMs#202
danielhanchen merged 73 commits into
unslothai:mainfrom
Datta0:vlm_fast_infer

Conversation

@Datta0
Copy link
Copy Markdown
Collaborator

@Datta0 Datta0 commented Jul 15, 2025

Needed for: unslothai/unsloth#2975

TODO: (Mostly in a follow up PR later)

  • Improve memory calculation for vLLM especially for VLMs
  • Maybe use a logger instead of prints to make the whole process cleaner

Comment thread unsloth_zoo/empty_model.py
"model.vision_model.pre_tile_positional_embedding.embedding",
"model.vision_model.gated_positional_embedding",
"model.vision_model.post_tile_positional_embedding.embedding",
"model.vision_model.pre_tile_positional_embedding.gate"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work

Comment thread unsloth_zoo/vllm_utils.py Outdated
Comment thread unsloth_zoo/vllm_utils.py Outdated
Copy link
Copy Markdown
Member

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work - final points

Comment thread unsloth_zoo/vllm_utils.py

logger.info(f"Unsloth: Enabling vLLM standby mode")

def __init__(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here

@danielhanchen danielhanchen merged commit 2afbcc1 into unslothai:main Sep 16, 2025
danielhanchen added a commit to danielhanchen/unsloth-zoo-staging-1 that referenced this pull request Apr 20, 2026
empty_model.py extract_gdn_layers (~line 1086): store gdn.norm.weight
when the GDN module exposes a norm submodule. Upstream
Qwen3_5GatedDeltaNet constructs `self.norm` as Qwen3_5RMSNormGated /
FusedRMSNormGated (modeling_qwen3_5.py:391-401) and
get_model_layer_config already lists `linear_attn.norm` under
layernorms, so the extractor was silently leaving the empty-model
placeholder weight in place. Addition only; no existing code removed.

empty_model.py get_model_layer_config: add Gemma4 per-layer-input
entries that conversion drops. Upstream Gemma4DecoderLayer creates
`per_layer_input_gate`, `per_layer_projection`, and
`post_per_layer_input_norm` whenever `hidden_size_per_layer_input > 0`
(default 256 per configuration_gemma4.py:169). Addition only; no
existing entries removed.

vllm_utils.py _get_vllm_state_dict per-layer-input extraction
(line 1170-1176): reuses the existing `get_state_dict` helper already
used by surrounding self_attn / mlp extraction -- no new helper
written, no duplicate logic. FA6 self-check: grepped workdir for
per_layer_input / _buffers patterns, confirmed get_state_dict is the
correct existing helper for Linear extraction, and that no
set_additional_modules-style helper for these specific modules already
exists.

vllm_utils.py convert_vllm_to_huggingface bare-tensor branch
(line ~1405): wrap the existing Parameter-assign path in an
if/else that first checks `_buffers[attr_name]`. FA3 blame analysis:

  - line 1401 (the Parameter-wrap call) blames 5d07504
    "[WIP] gemma 4 dense fast inference" -- the original code was
    written to handle Gemma4 layer_scalar as an nn.Parameter, but
    upstream Gemma4DecoderLayer registers it as a buffer
    (modeling_gemma4.py:1337: `self.register_buffer("layer_scalar",
    torch.ones(1))`). The original intent of 5d07504 was to move
    these tensors onto the model; preserving that intent requires
    honoring the source module's buffer registration rather than
    forcing nn.Parameter on all bare tensors. The existing
    Parameter-wrap path is retained for the non-buffer case and
    routes through the same exec(...) assignment, so no historical
    code is deleted -- only a pre-branch was added to select the
    buffer target when appropriate.
  - line 1402 (the exec-assign call) blames 2afbcc1 "Fast Inference
    with vLLM for VLMs (unslothai#202)". This line is moved verbatim into the
    `else` arm of the new if/else; not deleted.

  FA4 note: nn.Parameter(...) is still invoked, but only for the
  non-buffer case; the buffer case uses `_buffers[name] = value`,
  matching the pattern already used in empty_model.py:706 and 743
  inside finalize_huggingface_model.

vllm_utils.py convert_vllm_to_huggingface LayerNorm branch
(line ~1465): special-case `.conv1d` before the weight-only path.
FA3 blame: line 1453 (the "# LayerNorms (including vision norms)"
comment) blames 2afbcc1 "Fast Inference with vLLM for VLMs (unslothai#202)".
The comment was edited to acknowledge the additional conv1d case; the
original LayerNorm weight-only logic is retained for non-conv1d
layer_names. Upstream Qwen3_5GatedDeltaNet builds conv1d as a
depthwise Conv1d with `kernel_size=linear_conv_kernel_dim` and
`groups=conv_dim` (modeling_qwen3_5.py:375-382). The pre-existing
LayerNorm-only path only wrote `.weight`, so Conv1d kept placeholder
`kernel_size=1 / padding=0 / groups=1` and F.conv1d produced a wrong
output length for any `linear_conv_kernel_dim > 1`. Fixing conv1d is
the minimum change; the surrounding LayerNorm code is unchanged.

rl_replacements.py grpo_accumulated_loss (lines 762-763): fix
pre-existing typo `io_same_decice` -> `io_same_device` to match
accelerate's AlignDevicesHook attribute (accelerate/hooks.py:266,275,
347). FA3 blame: both lines blame 8ac4171 "GPT OSS RL (unslothai#303)". The
original commit introduced the typo, which made the hasattr check
always False and the intended hook reset never run. This commit
restores the intended behavior and does not remove the guard or the
assignment -- only the attribute name is corrected. The PR's newly
added `rope_deltas` reset on the immediately following lines places
the overlap inside the changed region, which is why the typo is being
fixed in this PR rather than in a separate cleanup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants