feat: native Qwen3-VL video support in MLLM mode#150
feat: native Qwen3-VL video support in MLLM mode#150waybarrios merged 7 commits intowaybarrios:mainfrom
Conversation
waybarrios
left a comment
There was a problem hiding this comment.
Good feature idea, native video pipeline for Qwen3-VL is definitely the right direction. Found a few things that need attention before merging though.
| translated.append({"role": role, "content": content}) | ||
| continue | ||
|
|
||
| if not isinstance(content, list): |
There was a problem hiding this comment.
This whole method manually constructs input_ids, pixel_values, mask, video_grid_thw by calling the processor directly and converting to mx.array. That's reimplementing the preprocessing pipeline that mlx-vlm already handles internally. The project convention (CLAUDE.md) is to wrap mlx-vlm's public API, not reimplement its internals.
Would be better to see if mlx_vlm.generate() can accept video inputs directly, or open an upstream issue if it can't.
vllm_mlx/models/mllm.py
Outdated
|
|
||
| if not isinstance(item, dict): | ||
| continue | ||
| item_type = item.get("type", "") |
There was a problem hiding this comment.
This collapses streaming into a single blocking call + one yield. For a 354-frame video with ~12s prefill, the event loop is completely blocked the entire time. The docstring says "Yields incremental text chunks" which isn't true for this path.
Either use mlx-vlm's streaming API here or at least document that native video doesn't support streaming yet and fall back gracefully.
| ) | ||
| return MLLMOutput(text=str(result), finish_reason="stop") | ||
|
|
||
| def _translate_messages_for_native_video( |
There was a problem hiding this comment.
tools from **kwargs is never extracted or forwarded to apply_chat_template. The regular chat() path routes tools through get_chat_template() properly. Any video request with tool definitions (agent use case) will silently drop tool calling.
This is a regression from the tool support added in PR #124.
| messages: list[dict], | ||
| video_fps: float, | ||
| video_max_frames: int, | ||
| ) -> list[dict]: |
There was a problem hiding this comment.
from mlx_vlm.video_generate import process_vision_info without a try/except. This submodule doesn't exist in older mlx-vlm versions, so this will crash at inference time (not startup) for users with older installations. The rest of the codebase guards these imports.
| @@ -729,7 +729,12 @@ def load(self) -> None: | |||
| self.config = load_config(self.model_name) | |||
|
|
|||
| self._loaded = True | |||
There was a problem hiding this comment.
_video_native is only set here in load() but never initialized in __init__. Every other boolean flag on the class (_loaded, enable_cache) gets a default in __init__. If load fails or hasn't run yet, accessing self._video_native in chat()/stream_chat() raises AttributeError.
Just need self._video_native = False in __init__.
vllm_mlx/server.py
Outdated
| ) | ||
|
|
||
| has_media = bool(images or videos) | ||
| if engine.is_mllm and (request.video_fps or request.video_max_frames): |
There was a problem hiding this comment.
Using request.video_fps or request.video_max_frames as a proxy for video content is fragile. A client that sends a video_url in the message content without setting those params will still have has_media=False. Better to check if any message content actually contains video_url/video types.
|
Yo @patanet7, pushed a commit on top of yours to knock out a few things I spotted during review. Here's the rundown:
Only got set inside self._loaded = False
self._video_native = False # now it defaults to False from the start
Older mlx-vlm versions don't have try:
from mlx_vlm.video_generate import process_vision_info
except ImportError:
raise ImportError(
"mlx_vlm.video_generate is required for native video support. "
"Upgrade with: pip install --upgrade mlx-vlm"
)Empty If if not video_source:
continue
Was using Couple things still on your plate though:
The Would you mind re-running your benchmarks with this commit on top? I doubt anything shifted since these are mostly guard/init type fixes, but always good to sanity check. |
|
heads up - PR #124 just got the tool forwarding fix landed (kwargs.pop instead of kwargs.get so tools dont leak into mlx_vlm generate calls). you'll want to rebase on top of that once its merged since the native video path in _generate_native_video has the same pattern where kwargs could carry unexpected stuff through. also worth double checking that the native video codepath plays nice with tool definitions if someone sends a video request with tools attached (agent use case). right now _generate_native_video doesnt extract or forward tools to apply_chat_template at all, so any tool defs would just get silently dropped for video requests. |
|
just merged #124 btw, so the tool forwarding fix is on main now. when you rebase you'll get the kwargs.pop pattern for free in chat() and stream_chat(). just make sure _generate_native_video picks up the same approach for extracting tools before spreading kwargs downstream |
|
+1, @waybarrios' review covers everything important. Fixes already pushed look good:
Agree with the two remaining items for @patanet7:
One minor nit: the first-pass video extraction loop ( |
Three bugs fixed:
1. video_url content type silently ignored in MLLM chat() and stream_chat().
The OpenAI API video format uses {"type": "video_url", "video_url": {"url": ...}}
but only "video" type was handled. Fixes waybarrios#120.
2. Video frames extracted AFTER chat template built, causing token count
mismatch (template has 0 image tokens but vision encoder produces N*frame
features). Restructured to two-pass approach: extract video frames first,
then build chat template with correct frame counts.
3. server.py has_media always False in MLLM mode because images/videos are
extracted from messages internally (set to []). Added MLLM-specific check
so video_fps/video_max_frames params still reach chat() via chat_kwargs.
For models with video_token_id (Qwen-family), video inputs now flow through mlx-vlm's native video pipeline instead of being treated as individual images. This activates: - 3D conv frame pairing (temporal_patch_size=2) - M-RoPE temporal position IDs (interleaved layout) - Timestamp-frame interleaving in the prompt - Proper video_grid_thw for the vision encoder Falls back to frame-as-images for non-video models. Adds _generate_native_video() and _translate_messages_for_native_video() to MLXMultimodalLM, plus unit tests for video URL parsing, frame count alignment, and message translation.
…y, video_generate wiring - Forward tools to apply_chat_template in native video path (fixes silent tool-call drop, regression from PR waybarrios#124) - Pop tools, use_cache, video_fps, video_max_frames from kwargs before native video branch in chat() and stream_chat() to prevent leaking into mlx_vlm.generate() - Extract _collect_video_inputs() to deduplicate video extraction between chat() and stream_chat() - Split _generate_native_video into _prepare_native_video_inputs (preprocessing) + _generate_native_video (generation) wired through mlx_vlm.video_generate for clearer intent and easier adoption of upstream improvements - Add ImportError guard on video_generate import in _generate_native_video to match codebase convention - Document blocking stream_chat native video path — no upstream streaming API, engine wraps in asyncio.to_thread() - Add tests for multi-message videos, multiple videos per message, video_url translation, Pydantic handling, tool forwarding, video_generate import verification
b97be66 to
7b3f875
Compare
|
Hey @waybarrios @janhilgard — rebased on main and addressed all review items. Here's the breakdown: Previously fixed (your commit, verified)
New in this pushTool forwarding in native video path
|
| Config | Frame Extract | Native Video | Speedup |
|---|---|---|---|
| fps=1.0, max_frames=4 | 12.37s (8.1 tok/s) | 7.24s (13.8 tok/s) | 1.7x |
| fps=2.0, max_frames=8 | 21.42s (4.7 tok/s) | 6.93s (14.4 tok/s) | 3.1x |
| fps=2.0, max_frames=16 | 20.78s (4.8 tok/s) | 5.89s (15.3 tok/s) | 3.5x |
- Tools + video tested end-to-end — no crash, 5.81s
stream_chatnative path tested — 1 chunk, correct response- Performance consistent pre/post guard fixes
Test coverage
Added: multi-message videos, multiple videos per message, video_url type translation, Pydantic v1/v2 model handling, tool forwarding signature verification, video_generate import check.
Full suite: 958 passed, 0 failures.
|
Thanks @patanet7 — great work on the rebase and addressing all the review feedback. I went through the updated diff and verified all points from @waybarrios's review are addressed:
The Benchmarks look great: 3.5x speedup at 16 frames. 958 tests, 0 failures. LGTM from my side — @waybarrios still needs to re-review to clear the requested changes. |
|
would reformat /home/runner/work/vllm-mlx/vllm-mlx/tests/test_video.py |
|
all review items from the previous rounds have been addressed, @janhilgard gave LGTM i resolved the merge conflicts with main and fixed the black formatting on test_video.py. CI should pass now the refactoring looks clean, tool forwarding works through the native video path, kwargs are centralized before branching, and the benchmarks show 3.5x speedup. approving and merging |
…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
video_urlcontent type being silently ignored in MLLM chat/stream_chathas_mediaalways False for MLLM mode in server.pyprocess_vision_infoinfrastructureBenchmarks (M-series, ZwZ-8B via mlx-vlm)
Old approach (frames-as-images) produced ~32,666 prompt tokens for the same video.