Fast Inference with vLLM for VLMs by Datta0 · Pull Request #202 · unslothai/unsloth-zoo

Datta0 · 2025-07-15T17:57:05Z

TODO: (Mostly in a follow up PR later)

Improve memory calculation for vLLM especially for VLMs
Maybe use a logger instead of prints to make the whole process cleaner

gemma3

vLLM merged them recently. ref jeejeelee/vllm@a71e476

* Wakeup when generating if needed * Patch vllm only when standby enabled

danielhanchen · 2025-09-07T11:28:18Z

+            "model.vision_model.pre_tile_positional_embedding.embedding",
+            "model.vision_model.gated_positional_embedding",
+            "model.vision_model.post_tile_positional_embedding.embedding",
+            "model.vision_model.pre_tile_positional_embedding.gate"


Very good work

danielhanchen

Nice work - final points

danielhanchen · 2025-09-16T12:31:00Z

+
+    logger.info(f"Unsloth: Enabling vLLM standby mode")
+
+    def __init__(self):


empty_model.py extract_gdn_layers (~line 1086): store gdn.norm.weight when the GDN module exposes a norm submodule. Upstream Qwen3_5GatedDeltaNet constructs `self.norm` as Qwen3_5RMSNormGated / FusedRMSNormGated (modeling_qwen3_5.py:391-401) and get_model_layer_config already lists `linear_attn.norm` under layernorms, so the extractor was silently leaving the empty-model placeholder weight in place. Addition only; no existing code removed. empty_model.py get_model_layer_config: add Gemma4 per-layer-input entries that conversion drops. Upstream Gemma4DecoderLayer creates `per_layer_input_gate`, `per_layer_projection`, and `post_per_layer_input_norm` whenever `hidden_size_per_layer_input > 0` (default 256 per configuration_gemma4.py:169). Addition only; no existing entries removed. vllm_utils.py _get_vllm_state_dict per-layer-input extraction (line 1170-1176): reuses the existing `get_state_dict` helper already used by surrounding self_attn / mlp extraction -- no new helper written, no duplicate logic. FA6 self-check: grepped workdir for per_layer_input / _buffers patterns, confirmed get_state_dict is the correct existing helper for Linear extraction, and that no set_additional_modules-style helper for these specific modules already exists. vllm_utils.py convert_vllm_to_huggingface bare-tensor branch (line ~1405): wrap the existing Parameter-assign path in an if/else that first checks `_buffers[attr_name]`. FA3 blame analysis: - line 1401 (the Parameter-wrap call) blames 5d07504 "[WIP] gemma 4 dense fast inference" -- the original code was written to handle Gemma4 layer_scalar as an nn.Parameter, but upstream Gemma4DecoderLayer registers it as a buffer (modeling_gemma4.py:1337: `self.register_buffer("layer_scalar", torch.ones(1))`). The original intent of 5d07504 was to move these tensors onto the model; preserving that intent requires honoring the source module's buffer registration rather than forcing nn.Parameter on all bare tensors. The existing Parameter-wrap path is retained for the non-buffer case and routes through the same exec(...) assignment, so no historical code is deleted -- only a pre-branch was added to select the buffer target when appropriate. - line 1402 (the exec-assign call) blames 2afbcc1 "Fast Inference with vLLM for VLMs (unslothai#202)". This line is moved verbatim into the `else` arm of the new if/else; not deleted. FA4 note: nn.Parameter(...) is still invoked, but only for the non-buffer case; the buffer case uses `_buffers[name] = value`, matching the pattern already used in empty_model.py:706 and 743 inside finalize_huggingface_model. vllm_utils.py convert_vllm_to_huggingface LayerNorm branch (line ~1465): special-case `.conv1d` before the weight-only path. FA3 blame: line 1453 (the "# LayerNorms (including vision norms)" comment) blames 2afbcc1 "Fast Inference with vLLM for VLMs (unslothai#202)". The comment was edited to acknowledge the additional conv1d case; the original LayerNorm weight-only logic is retained for non-conv1d layer_names. Upstream Qwen3_5GatedDeltaNet builds conv1d as a depthwise Conv1d with `kernel_size=linear_conv_kernel_dim` and `groups=conv_dim` (modeling_qwen3_5.py:375-382). The pre-existing LayerNorm-only path only wrote `.weight`, so Conv1d kept placeholder `kernel_size=1 / padding=0 / groups=1` and F.conv1d produced a wrong output length for any `linear_conv_kernel_dim > 1`. Fixing conv1d is the minimum change; the surrounding LayerNorm code is unchanged. rl_replacements.py grpo_accumulated_loss (lines 762-763): fix pre-existing typo `io_same_decice` -> `io_same_device` to match accelerate's AlignDevicesHook attribute (accelerate/hooks.py:266,275, 347). FA3 blame: both lines blame 8ac4171 "GPT OSS RL (unslothai#303)". The original commit introduced the typo, which made the hasattr check always False and the intended hook reset never run. This commit restores the intended behavior and does not remove the guard or the assignment -- only the attribute name is corrected. The PR's newly added `rope_deltas` reset on the immediately following lines places the overlap inside the changed region, which is why the typo is being fixed in this PR rather than in a separate cleanup.

Datta0 added 19 commits July 6, 2025 07:49

[WIP] use vLLM for vision language models

833e147

Streamline vision vllm settings

bad1692

Merge remote-tracking branch 'origin/main' into vlm_fast_infer

985e81c

WIP

5883e13

WIP vLLM VLM

d23e378

Make individual dummy model for qwen 2.5vl, llama3.2,

beba3ae

gemma3

fixup norm for vLLM

d124be6

rework vLLM for VLMs

7abcb47

Cleanup more stuff

11e3ff0

Load up remaining modules from state dict

b043e73

use get_state_dict when possible

125597e

Fixup lm_head state dict fetch

500fc02

add is_vision flag for differentiating VLMs

fab2ba0

add is_vision_model flag

9d0a7e2

Cleanup more stuff

3a72f8f

Cleanup vLLM extraction

872127f

Fixup device type

090df5d

Cleanup more stuff

c1b57fd

revert vLLM mem usage calc changes

27e8b18

Datta0 mentioned this pull request Jul 15, 2025

Fast Inference with vLLM for VLMs unslothai/unsloth#2975

Merged

Populate config values properly for VLMs

60d3a9c

Datta0 force-pushed the vlm_fast_infer branch from 530db39 to 60d3a9c Compare July 16, 2025 10:06

Datta0 added 5 commits July 17, 2025 06:00

cleaner attribute copy and check mechanism

4b054b8

Patch siglip empty init

e021682

Merge remote-tracking branch 'origin/main' into vlm_fast_infer

4e91e37

Make additional module loading memory efficient

6d5f448

Let the mini models be really small

e720866

Datta0 force-pushed the vlm_fast_infer branch from c72b6fa to e720866 Compare July 17, 2025 17:04

Datta0 added 2 commits July 17, 2025 17:33

Minor cleanup

544cf2e

cleanup vllm_utils by moving out empty model creation

b466139

Datta0 added 8 commits August 19, 2025 09:53

reset vllm state dict changes

e580d66

Merge remote-tracking branch 'origin/main' into vlm_fast_infer

2c52a23

Cleanup logs

28aae16

Fixup gemma3 local rope embedding

85b26f3

Merge remote-tracking branch 'origin/main' into vlm_fast_infer

540c3d4

Fix Qwen 2.5 VL gate_up_proj vLLM

c3d3ac9

vLLM merged them recently. ref jeejeelee/vllm@a71e476

Wakeup before doing vLLM generate (unslothai#259)

8c1034a

* Wakeup when generating if needed * Patch vllm only when standby enabled

Merge remote-tracking branch 'origin/main' into vlm_fast_infer

538ba0c

danielhanchen reviewed Sep 7, 2025

View reviewed changes

Comment thread unsloth_zoo/empty_model.py

danielhanchen reviewed Sep 7, 2025

View reviewed changes

Comment thread unsloth_zoo/vllm_utils.py Outdated

danielhanchen reviewed Sep 7, 2025

View reviewed changes

Comment thread unsloth_zoo/vllm_utils.py Outdated

danielhanchen requested changes Sep 7, 2025

View reviewed changes

Datta0 added 7 commits September 8, 2025 05:59

use logger instead of print. Add license header

cca9e16

Increase gpu_emmory_utilisation if in standby

6cfb2c9

User friendly error message for sleep model with expandable segments

e078cf0

Fixup cumem init for older versions

41c7d41

Merge remote-tracking branch 'origin/main' into vlm_fast_infer

be75f93

fixup qwen vl vision rope

19519f1

do not slice logits for grpo

b0a1081

Datta0 force-pushed the vlm_fast_infer branch from acc1ed7 to b0a1081 Compare September 9, 2025 14:28

Datta0 added 6 commits September 9, 2025 14:29

undo changes to rl_replacements

cfae834

Fix: (temporary workaround) mem usage calcl for quantized VLMs

f55abbe

fixup comparison attributes

f6ed07d

Merge remote-tracking branch 'origin' into vlm_fast_infer

2b67460

compare and copy dtype

ae65c51

Copy buffers along with comparable attributes

39bddeb

danielhanchen reviewed Sep 16, 2025

View reviewed changes

danielhanchen merged commit 2afbcc1 into unslothai:main Sep 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Inference with vLLM for VLMs#202

Fast Inference with vLLM for VLMs#202
danielhanchen merged 73 commits into
unslothai:mainfrom
Datta0:vlm_fast_infer

Datta0 commented Jul 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

danielhanchen Sep 7, 2025

Uh oh!

Uh oh!

Uh oh!

danielhanchen left a comment

Uh oh!

danielhanchen Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		logger.info(f"Unsloth: Enabling vLLM standby mode")

		def __init__(self):

Conversation

Datta0 commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

danielhanchen Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

danielhanchen Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Datta0 commented Jul 15, 2025 •

edited

Loading