feat: native Qwen3-VL video support in MLLM mode by patanet7 · Pull Request #150 · waybarrios/vllm-mlx

patanet7 · 2026-03-10T04:49:02Z

Summary

Fix video_url content type being silently ignored in MLLM chat/stream_chat
Fix video frame tokens not counted in prompt token reporting
Fix has_media always False for MLLM mode in server.py
Add native video pipeline for Qwen3-VL family models (temporal 3D conv + M-RoPE + timestamp interleaving) using mlx-vlm's existing process_vision_info infrastructure

Benchmarks (M-series, ZwZ-8B via mlx-vlm)

Setting	Frames	video_grid_thw	Prompt tokens	Prefill	Gen speed
1 fps (4f)	4	[2, 36, 64]	1,189	~3s	23.9 tok/s
2 fps (4f)	4	[2, 36, 64]	1,200	~3s	23.9 tok/s
60 fps (354f)	354	[5, 36, 64]	2,935	~12s	5.7 tok/s

Old approach (frames-as-images) produced ~32,666 prompt tokens for the same video.

waybarrios

Good feature idea, native video pipeline for Qwen3-VL is definitely the right direction. Found a few things that need attention before merging though.

waybarrios · 2026-03-12T19:37:30Z

vllm_mlx/models/mllm.py

+                translated.append({"role": role, "content": content})
+                continue
+
+            if not isinstance(content, list):


This whole method manually constructs input_ids, pixel_values, mask, video_grid_thw by calling the processor directly and converting to mx.array. That's reimplementing the preprocessing pipeline that mlx-vlm already handles internally. The project convention (CLAUDE.md) is to wrap mlx-vlm's public API, not reimplement its internals.

Would be better to see if mlx_vlm.generate() can accept video inputs directly, or open an upstream issue if it can't.

waybarrios · 2026-03-12T19:37:30Z

vllm_mlx/models/mllm.py

+
+                if not isinstance(item, dict):
+                    continue
+                item_type = item.get("type", "")


This collapses streaming into a single blocking call + one yield. For a 354-frame video with ~12s prefill, the event loop is completely blocked the entire time. The docstring says "Yields incremental text chunks" which isn't true for this path.

Either use mlx-vlm's streaming API here or at least document that native video doesn't support streaming yet and fall back gracefully.

waybarrios · 2026-03-12T19:37:30Z

vllm_mlx/models/mllm.py

+            )
+        return MLLMOutput(text=str(result), finish_reason="stop")
+
+    def _translate_messages_for_native_video(


tools from **kwargs is never extracted or forwarded to apply_chat_template. The regular chat() path routes tools through get_chat_template() properly. Any video request with tool definitions (agent use case) will silently drop tool calling.

This is a regression from the tool support added in PR #124.

waybarrios · 2026-03-12T19:37:30Z

vllm_mlx/models/mllm.py

+        messages: list[dict],
+        video_fps: float,
+        video_max_frames: int,
+    ) -> list[dict]:


from mlx_vlm.video_generate import process_vision_info without a try/except. This submodule doesn't exist in older mlx-vlm versions, so this will crash at inference time (not startup) for users with older installations. The rest of the codebase guards these imports.

waybarrios · 2026-03-12T19:37:30Z

vllm_mlx/models/mllm.py

@@ -729,7 +729,12 @@ def load(self) -> None:
            self.config = load_config(self.model_name)

            self._loaded = True


_video_native is only set here in load() but never initialized in __init__. Every other boolean flag on the class (_loaded, enable_cache) gets a default in __init__. If load fails or hasn't run yet, accessing self._video_native in chat()/stream_chat() raises AttributeError.

Just need self._video_native = False in __init__.

waybarrios · 2026-03-12T19:37:30Z

vllm_mlx/server.py

        )

    has_media = bool(images or videos)
+    if engine.is_mllm and (request.video_fps or request.video_max_frames):


Using request.video_fps or request.video_max_frames as a proxy for video content is fragile. A client that sends a video_url in the message content without setting those params will still have has_media=False. Better to check if any message content actually contains video_url/video types.

waybarrios · 2026-03-12T19:44:20Z

Yo @patanet7, pushed a commit on top of yours to knock out a few things I spotted during review. Here's the rundown:

_video_native wasn't initialized in __init__

Only got set inside load() so anything touching it before that would blow up with AttributeError.

self._loaded = False
self._video_native = False  # now it defaults to False from the start

process_vision_info import had no guard

Older mlx-vlm versions don't have video_generate at all, so this would just die at inference time with a cryptic ImportError instead of telling the user what's up.

try:
    from mlx_vlm.video_generate import process_vision_info
except ImportError:
    raise ImportError(
        "mlx_vlm.video_generate is required for native video support. "
        "Upgrade with: pip install --upgrade mlx-vlm"
    )

Empty video_source was slipping through

If vid_url.get("url", "") came back empty it'd go straight into process_video_input("") and throw a ValueError, 500 to the client. Added a simple bail out:

if not video_source:
    continue

has_media detection was too fragile

Was using request.video_fps as a proxy for whether there was actual video content, which totally missed requests that send a video_url without those params. Rewrote it to actually scan message content for media types instead of guessing from request params.

Couple things still on your plate though:

_generate_native_video doesn't forward tools to apply_chat_template. The regular chat() path does this through get_chat_template() with template_extra_kwargs, but the native video path skips all of that. So any video request that also has tool definitions will just silently not generate tool calls, which kinda breaks what PR #124 added.

The stream_chat native path is also doing a single blocking generate() call and yielding everything at once. For something like your 354-frame benchmark that takes ~12s prefill, that's the event loop completely locked up the whole time. Worth checking if mlx-vlm has a streaming API you could use here, or at minimum throwing it in a thread so the loop stays alive.

Would you mind re-running your benchmarks with this commit on top? I doubt anything shifted since these are mostly guard/init type fixes, but always good to sanity check.

waybarrios · 2026-03-12T19:55:52Z

heads up - PR #124 just got the tool forwarding fix landed (kwargs.pop instead of kwargs.get so tools dont leak into mlx_vlm generate calls). you'll want to rebase on top of that once its merged since the native video path in _generate_native_video has the same pattern where kwargs could carry unexpected stuff through.

also worth double checking that the native video codepath plays nice with tool definitions if someone sends a video request with tools attached (agent use case). right now _generate_native_video doesnt extract or forward tools to apply_chat_template at all, so any tool defs would just get silently dropped for video requests.

waybarrios · 2026-03-12T19:57:49Z

just merged #124 btw, so the tool forwarding fix is on main now. when you rebase you'll get the kwargs.pop pattern for free in chat() and stream_chat(). just make sure _generate_native_video picks up the same approach for extracting tools before spreading kwargs downstream

janhilgard · 2026-03-12T19:58:33Z

+1, @waybarrios' review covers everything important.

Fixes already pushed look good:

_video_native = False init in __init__ — prevents AttributeError before load()
process_vision_info import guard with actionable error message
Empty video_source bail-out — no more 500s from process_video_input("")
has_media rewritten to scan actual message content instead of relying on request.video_fps

Agree with the two remaining items for @patanet7:

tools not forwarded in _generate_native_video — apply_chat_template is called directly without template_extra_kwargs, so tool definitions are silently dropped. This breaks PR fix: forward tool definitions through MLLM code path #124's tool-call support for any video request that also includes tools.
Blocking generate() in stream_chat native path — yields a single chunk after the full generation completes. With 354 frames / ~12s prefill, the event loop is completely locked. At minimum wrapping in asyncio.to_thread() would keep the loop alive; ideally mlx-vlm's streaming API (if available) would give true incremental output.

One minor nit: the first-pass video extraction loop (_msg_video_inputs collection) is duplicated verbatim (~30 lines) between chat() and stream_chat(). Could be a private helper, but that's a refactor for later.

Three bugs fixed: 1. video_url content type silently ignored in MLLM chat() and stream_chat(). The OpenAI API video format uses {"type": "video_url", "video_url": {"url": ...}} but only "video" type was handled. Fixes waybarrios#120. 2. Video frames extracted AFTER chat template built, causing token count mismatch (template has 0 image tokens but vision encoder produces N*frame features). Restructured to two-pass approach: extract video frames first, then build chat template with correct frame counts. 3. server.py has_media always False in MLLM mode because images/videos are extracted from messages internally (set to []). Added MLLM-specific check so video_fps/video_max_frames params still reach chat() via chat_kwargs.

For models with video_token_id (Qwen-family), video inputs now flow through mlx-vlm's native video pipeline instead of being treated as individual images. This activates: - 3D conv frame pairing (temporal_patch_size=2) - M-RoPE temporal position IDs (interleaved layout) - Timestamp-frame interleaving in the prompt - Proper video_grid_thw for the vision encoder Falls back to frame-as-images for non-video models. Adds _generate_native_video() and _translate_messages_for_native_video() to MLXMultimodalLM, plus unit tests for video URL parsing, frame count alignment, and message translation.

…tion

…y, video_generate wiring - Forward tools to apply_chat_template in native video path (fixes silent tool-call drop, regression from PR waybarrios#124) - Pop tools, use_cache, video_fps, video_max_frames from kwargs before native video branch in chat() and stream_chat() to prevent leaking into mlx_vlm.generate() - Extract _collect_video_inputs() to deduplicate video extraction between chat() and stream_chat() - Split _generate_native_video into _prepare_native_video_inputs (preprocessing) + _generate_native_video (generation) wired through mlx_vlm.video_generate for clearer intent and easier adoption of upstream improvements - Add ImportError guard on video_generate import in _generate_native_video to match codebase convention - Document blocking stream_chat native video path — no upstream streaming API, engine wraps in asyncio.to_thread() - Add tests for multi-message videos, multiple videos per message, video_url translation, Pydantic handling, tool forwarding, video_generate import verification

patanet7 · 2026-03-21T22:08:34Z

Hey @waybarrios @janhilgard — rebased on main and addressed all review items. Here's the breakdown:

Previously fixed (your commit, verified)

_video_native = False init in __init__
process_vision_info import guard with actionable error
Empty video_source bail-out
has_media rewritten to scan message content

New in this push

Tool forwarding in native video path

tools is now popped from kwargs before the native video branch in both chat() and stream_chat(), and forwarded explicitly to _generate_native_video → processor.apply_chat_template(). Video requests with tool definitions will no longer silently drop them.

`use_cache` kwargs leak fix

use_cache was being popped ~150 lines after the native video branch, meaning it leaked into mlx_vlm.generate() via **kwargs on the native path. Moved all kwargs pops (video_fps, video_max_frames, tools, use_cache) to a single block before any branching in both chat() and stream_chat().

Deduplicated video extraction

The ~30-line first-pass video collection loop is now _collect_video_inputs(), shared between chat() and stream_chat().

Refactored to wire through `mlx_vlm.video_generate`

Split _generate_native_video into:

_prepare_native_video_inputs — message translation → tensors
_generate_native_video — thin wrapper calling mlx_vlm.video_generate.generate()

This makes intent explicit and means if upstream adds a higher-level video API or streaming support, we just change the import.

Design decisions

On the mlx-vlm public API concern

Investigated thoroughly. mlx_vlm.video_generate.main() does the exact same manual tensor construction we do (processor → input_ids, pixel_values, mask, video_grid_thw → generate). There is no higher-level "pass video paths, get text" API — video_generate.generate is literally mlx_vlm.generate.generate re-exported. Added a docstring noting this is currently Qwen-family-specific.

On blocking `stream_chat`

mlx_vlm.video_generate has no stream_generate function. Its generate() internally calls stream_generate from the base mlx_vlm.generate module, which doesn't know about video tensors. True token-level streaming for native video requires upstream support.

However, the event loop is not blocked at the server level — SimpleEngine.generate_stream() already wraps stream_chat in asyncio.to_thread() (engine/simple.py:348-360), so other requests are served concurrently. The only user-facing impact is that streaming clients see the full response arrive at once rather than token-by-token for native video.

Benchmarks

Re-run on mlx-community/Qwen3-VL-8B-Instruct-4bit (local):

Config	Frame Extract	Native Video	Speedup
fps=1.0, max_frames=4	12.37s (8.1 tok/s)	7.24s (13.8 tok/s)	1.7x
fps=2.0, max_frames=8	21.42s (4.7 tok/s)	6.93s (14.4 tok/s)	3.1x
fps=2.0, max_frames=16	20.78s (4.8 tok/s)	5.89s (15.3 tok/s)	3.5x

Tools + video tested end-to-end — no crash, 5.81s
stream_chat native path tested — 1 chunk, correct response
Performance consistent pre/post guard fixes

Test coverage

Added: multi-message videos, multiple videos per message, video_url type translation, Pydantic v1/v2 model handling, tool forwarding signature verification, video_generate import check.

Full suite: 958 passed, 0 failures.

janhilgard · 2026-03-21T22:12:48Z

Thanks @patanet7 — great work on the rebase and addressing all the review feedback. I went through the updated diff and verified all points from @waybarrios's review are addressed:

_video_native = False init in __init__ ✅
process_vision_info import guard with try/except and actionable error ✅
Tool forwarding through the full native video path (_generate_native_video → _prepare_native_video_inputs → apply_chat_template) ✅
has_media rewritten — scans message content in server.py instead of relying on fps/max_frames proxy ✅
use_cache kwargs leak — all pops centralized before branching in chat() and stream_chat() ✅
_collect_video_inputs() helper — deduplicates video extraction ✅

The asyncio.to_thread() wrapping in SimpleEngine is a pragmatic solution for the streaming limitation — no event loop blocking at the server level.

Benchmarks look great: 3.5x speedup at 16 frames. 958 tests, 0 failures.

LGTM from my side — @waybarrios still needs to re-review to clear the requested changes.

waybarrios · 2026-03-21T22:16:40Z

would reformat /home/runner/work/vllm-mlx/vllm-mlx/tests/test_video.py
would reformat /home/runner/work/vllm-mlx/vllm-mlx/vllm_mlx/engine_core.py

waybarrios · 2026-03-21T22:24:03Z

all review items from the previous rounds have been addressed, @janhilgard gave LGTM

i resolved the merge conflicts with main and fixed the black formatting on test_video.py. CI should pass now

the refactoring looks clean, tool forwarding works through the native video path, kwargs are centralized before branching, and the benchmarks show 3.5x speedup. approving and merging

…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

waybarrios self-requested a review March 12, 2026 19:35

waybarrios assigned waybarrios and janhilgard Mar 12, 2026

waybarrios added enhancement New feature or request UNDER REVIEW labels Mar 12, 2026

waybarrios requested changes Mar 12, 2026

View reviewed changes

patanet7 and others added 5 commits March 21, 2026 13:54

style: ruff format + lint fixes for new code

eb56c7d

Fix video native init, import guard, empty source and has_media detec…

92b3556

…tion

patanet7 force-pushed the feat/native-video-support branch from b97be66 to 7b3f875 Compare March 21, 2026 22:00

waybarrios self-requested a review March 21, 2026 22:17

waybarrios added 2 commits March 21, 2026 17:22

resolve merge conflicts with main

35c77ec

format test_video.py

ede4e30

waybarrios merged commit 2a79216 into waybarrios:main Mar 21, 2026

raullenchai mentioned this pull request Mar 26, 2026

Sync upstream: SpecPrefill, native video, MTP injection raullenchai/Rapid-MLX#58

Open

4 tasks

		@@ -729,7 +729,12 @@ def load(self) -> None:
		self.config = load_config(self.model_name)

		self._loaded = True

Conversation

patanet7 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks (M-series, ZwZ-8B via mlx-vlm)

Uh oh!

waybarrios left a comment

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

janhilgard commented Mar 12, 2026

Uh oh!

patanet7 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Previously fixed (your commit, verified)

New in this push

Tool forwarding in native video path

use_cache kwargs leak fix

Deduplicated video extraction

Refactored to wire through mlx_vlm.video_generate

Design decisions

On the mlx-vlm public API concern

On blocking stream_chat

Benchmarks

Test coverage

Uh oh!

janhilgard commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waybarrios commented Mar 21, 2026

Uh oh!

waybarrios commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

patanet7 commented Mar 10, 2026 •

edited

Loading

patanet7 commented Mar 21, 2026 •

edited

Loading

`use_cache` kwargs leak fix

Refactored to wire through `mlx_vlm.video_generate`

On blocking `stream_chat`

janhilgard commented Mar 21, 2026 •

edited

Loading