Skip to content

feat: native Qwen3-VL video support in MLLM mode#150

Merged
waybarrios merged 7 commits intowaybarrios:mainfrom
patanet7:feat/native-video-support
Mar 21, 2026
Merged

feat: native Qwen3-VL video support in MLLM mode#150
waybarrios merged 7 commits intowaybarrios:mainfrom
patanet7:feat/native-video-support

Conversation

@patanet7
Copy link
Copy Markdown
Contributor

@patanet7 patanet7 commented Mar 10, 2026

Summary

  • Fix video_url content type being silently ignored in MLLM chat/stream_chat
  • Fix video frame tokens not counted in prompt token reporting
  • Fix has_media always False for MLLM mode in server.py
  • Add native video pipeline for Qwen3-VL family models (temporal 3D conv + M-RoPE + timestamp interleaving) using mlx-vlm's existing process_vision_info infrastructure

Benchmarks (M-series, ZwZ-8B via mlx-vlm)

Setting Frames video_grid_thw Prompt tokens Prefill Gen speed
1 fps (4f) 4 [2, 36, 64] 1,189 ~3s 23.9 tok/s
2 fps (4f) 4 [2, 36, 64] 1,200 ~3s 23.9 tok/s
60 fps (354f) 354 [5, 36, 64] 2,935 ~12s 5.7 tok/s

Old approach (frames-as-images) produced ~32,666 prompt tokens for the same video.

Copy link
Copy Markdown
Owner

@waybarrios waybarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feature idea, native video pipeline for Qwen3-VL is definitely the right direction. Found a few things that need attention before merging though.

translated.append({"role": role, "content": content})
continue

if not isinstance(content, list):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole method manually constructs input_ids, pixel_values, mask, video_grid_thw by calling the processor directly and converting to mx.array. That's reimplementing the preprocessing pipeline that mlx-vlm already handles internally. The project convention (CLAUDE.md) is to wrap mlx-vlm's public API, not reimplement its internals.

Would be better to see if mlx_vlm.generate() can accept video inputs directly, or open an upstream issue if it can't.


if not isinstance(item, dict):
continue
item_type = item.get("type", "")
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collapses streaming into a single blocking call + one yield. For a 354-frame video with ~12s prefill, the event loop is completely blocked the entire time. The docstring says "Yields incremental text chunks" which isn't true for this path.

Either use mlx-vlm's streaming API here or at least document that native video doesn't support streaming yet and fall back gracefully.

)
return MLLMOutput(text=str(result), finish_reason="stop")

def _translate_messages_for_native_video(
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tools from **kwargs is never extracted or forwarded to apply_chat_template. The regular chat() path routes tools through get_chat_template() properly. Any video request with tool definitions (agent use case) will silently drop tool calling.

This is a regression from the tool support added in PR #124.

messages: list[dict],
video_fps: float,
video_max_frames: int,
) -> list[dict]:
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from mlx_vlm.video_generate import process_vision_info without a try/except. This submodule doesn't exist in older mlx-vlm versions, so this will crash at inference time (not startup) for users with older installations. The rest of the codebase guards these imports.

@@ -729,7 +729,12 @@ def load(self) -> None:
self.config = load_config(self.model_name)

self._loaded = True
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_video_native is only set here in load() but never initialized in __init__. Every other boolean flag on the class (_loaded, enable_cache) gets a default in __init__. If load fails or hasn't run yet, accessing self._video_native in chat()/stream_chat() raises AttributeError.

Just need self._video_native = False in __init__.

)

has_media = bool(images or videos)
if engine.is_mllm and (request.video_fps or request.video_max_frames):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using request.video_fps or request.video_max_frames as a proxy for video content is fragile. A client that sends a video_url in the message content without setting those params will still have has_media=False. Better to check if any message content actually contains video_url/video types.

@waybarrios
Copy link
Copy Markdown
Owner

Yo @patanet7, pushed a commit on top of yours to knock out a few things I spotted during review. Here's the rundown:

_video_native wasn't initialized in __init__

Only got set inside load() so anything touching it before that would blow up with AttributeError.

self._loaded = False
self._video_native = False  # now it defaults to False from the start

process_vision_info import had no guard

Older mlx-vlm versions don't have video_generate at all, so this would just die at inference time with a cryptic ImportError instead of telling the user what's up.

try:
    from mlx_vlm.video_generate import process_vision_info
except ImportError:
    raise ImportError(
        "mlx_vlm.video_generate is required for native video support. "
        "Upgrade with: pip install --upgrade mlx-vlm"
    )

Empty video_source was slipping through

If vid_url.get("url", "") came back empty it'd go straight into process_video_input("") and throw a ValueError, 500 to the client. Added a simple bail out:

if not video_source:
    continue

has_media detection was too fragile

Was using request.video_fps as a proxy for whether there was actual video content, which totally missed requests that send a video_url without those params. Rewrote it to actually scan message content for media types instead of guessing from request params.

Couple things still on your plate though:

_generate_native_video doesn't forward tools to apply_chat_template. The regular chat() path does this through get_chat_template() with template_extra_kwargs, but the native video path skips all of that. So any video request that also has tool definitions will just silently not generate tool calls, which kinda breaks what PR #124 added.

The stream_chat native path is also doing a single blocking generate() call and yielding everything at once. For something like your 354-frame benchmark that takes ~12s prefill, that's the event loop completely locked up the whole time. Worth checking if mlx-vlm has a streaming API you could use here, or at minimum throwing it in a thread so the loop stays alive.

Would you mind re-running your benchmarks with this commit on top? I doubt anything shifted since these are mostly guard/init type fixes, but always good to sanity check.

@waybarrios
Copy link
Copy Markdown
Owner

heads up - PR #124 just got the tool forwarding fix landed (kwargs.pop instead of kwargs.get so tools dont leak into mlx_vlm generate calls). you'll want to rebase on top of that once its merged since the native video path in _generate_native_video has the same pattern where kwargs could carry unexpected stuff through.

also worth double checking that the native video codepath plays nice with tool definitions if someone sends a video request with tools attached (agent use case). right now _generate_native_video doesnt extract or forward tools to apply_chat_template at all, so any tool defs would just get silently dropped for video requests.

@waybarrios
Copy link
Copy Markdown
Owner

just merged #124 btw, so the tool forwarding fix is on main now. when you rebase you'll get the kwargs.pop pattern for free in chat() and stream_chat(). just make sure _generate_native_video picks up the same approach for extracting tools before spreading kwargs downstream

@janhilgard
Copy link
Copy Markdown
Collaborator

+1, @waybarrios' review covers everything important.

Fixes already pushed look good:

  • _video_native = False init in __init__ — prevents AttributeError before load()
  • process_vision_info import guard with actionable error message
  • Empty video_source bail-out — no more 500s from process_video_input("")
  • has_media rewritten to scan actual message content instead of relying on request.video_fps

Agree with the two remaining items for @patanet7:

  1. tools not forwarded in _generate_native_videoapply_chat_template is called directly without template_extra_kwargs, so tool definitions are silently dropped. This breaks PR fix: forward tool definitions through MLLM code path #124's tool-call support for any video request that also includes tools.

  2. Blocking generate() in stream_chat native path — yields a single chunk after the full generation completes. With 354 frames / ~12s prefill, the event loop is completely locked. At minimum wrapping in asyncio.to_thread() would keep the loop alive; ideally mlx-vlm's streaming API (if available) would give true incremental output.

One minor nit: the first-pass video extraction loop (_msg_video_inputs collection) is duplicated verbatim (~30 lines) between chat() and stream_chat(). Could be a private helper, but that's a refactor for later.

patanet7 and others added 5 commits March 21, 2026 13:54
Three bugs fixed:

1. video_url content type silently ignored in MLLM chat() and stream_chat().
   The OpenAI API video format uses {"type": "video_url", "video_url": {"url": ...}}
   but only "video" type was handled. Fixes waybarrios#120.

2. Video frames extracted AFTER chat template built, causing token count
   mismatch (template has 0 image tokens but vision encoder produces N*frame
   features). Restructured to two-pass approach: extract video frames first,
   then build chat template with correct frame counts.

3. server.py has_media always False in MLLM mode because images/videos are
   extracted from messages internally (set to []). Added MLLM-specific check
   so video_fps/video_max_frames params still reach chat() via chat_kwargs.
For models with video_token_id (Qwen-family), video inputs now flow through
mlx-vlm's native video pipeline instead of being treated as individual images.

This activates:
- 3D conv frame pairing (temporal_patch_size=2)
- M-RoPE temporal position IDs (interleaved layout)
- Timestamp-frame interleaving in the prompt
- Proper video_grid_thw for the vision encoder

Falls back to frame-as-images for non-video models.

Adds _generate_native_video() and _translate_messages_for_native_video()
to MLXMultimodalLM, plus unit tests for video URL parsing, frame count
alignment, and message translation.
…y, video_generate wiring

- Forward tools to apply_chat_template in native video path (fixes
  silent tool-call drop, regression from PR waybarrios#124)
- Pop tools, use_cache, video_fps, video_max_frames from kwargs
  before native video branch in chat() and stream_chat() to prevent
  leaking into mlx_vlm.generate()
- Extract _collect_video_inputs() to deduplicate video extraction
  between chat() and stream_chat()
- Split _generate_native_video into _prepare_native_video_inputs
  (preprocessing) + _generate_native_video (generation) wired
  through mlx_vlm.video_generate for clearer intent and easier
  adoption of upstream improvements
- Add ImportError guard on video_generate import in
  _generate_native_video to match codebase convention
- Document blocking stream_chat native video path — no upstream
  streaming API, engine wraps in asyncio.to_thread()
- Add tests for multi-message videos, multiple videos per message,
  video_url translation, Pydantic handling, tool forwarding,
  video_generate import verification
@patanet7 patanet7 force-pushed the feat/native-video-support branch from b97be66 to 7b3f875 Compare March 21, 2026 22:00
@patanet7
Copy link
Copy Markdown
Contributor Author

patanet7 commented Mar 21, 2026

Hey @waybarrios @janhilgard — rebased on main and addressed all review items. Here's the breakdown:


Previously fixed (your commit, verified)

  • _video_native = False init in __init__
  • process_vision_info import guard with actionable error
  • Empty video_source bail-out
  • has_media rewritten to scan message content

New in this push

Tool forwarding in native video path

tools is now popped from kwargs before the native video branch in both chat() and stream_chat(), and forwarded explicitly to _generate_native_videoprocessor.apply_chat_template(). Video requests with tool definitions will no longer silently drop them.

use_cache kwargs leak fix

use_cache was being popped ~150 lines after the native video branch, meaning it leaked into mlx_vlm.generate() via **kwargs on the native path. Moved all kwargs pops (video_fps, video_max_frames, tools, use_cache) to a single block before any branching in both chat() and stream_chat().

Deduplicated video extraction

The ~30-line first-pass video collection loop is now _collect_video_inputs(), shared between chat() and stream_chat().

Refactored to wire through mlx_vlm.video_generate

Split _generate_native_video into:

  • _prepare_native_video_inputs — message translation → tensors
  • _generate_native_video — thin wrapper calling mlx_vlm.video_generate.generate()

This makes intent explicit and means if upstream adds a higher-level video API or streaming support, we just change the import.


Design decisions

On the mlx-vlm public API concern

Investigated thoroughly. mlx_vlm.video_generate.main() does the exact same manual tensor construction we do (processor → input_ids, pixel_values, mask, video_grid_thwgenerate). There is no higher-level "pass video paths, get text" API — video_generate.generate is literally mlx_vlm.generate.generate re-exported. Added a docstring noting this is currently Qwen-family-specific.

On blocking stream_chat

mlx_vlm.video_generate has no stream_generate function. Its generate() internally calls stream_generate from the base mlx_vlm.generate module, which doesn't know about video tensors. True token-level streaming for native video requires upstream support.

However, the event loop is not blocked at the server levelSimpleEngine.generate_stream() already wraps stream_chat in asyncio.to_thread() (engine/simple.py:348-360), so other requests are served concurrently. The only user-facing impact is that streaming clients see the full response arrive at once rather than token-by-token for native video.


Benchmarks

Re-run on mlx-community/Qwen3-VL-8B-Instruct-4bit (local):

Config Frame Extract Native Video Speedup
fps=1.0, max_frames=4 12.37s (8.1 tok/s) 7.24s (13.8 tok/s) 1.7x
fps=2.0, max_frames=8 21.42s (4.7 tok/s) 6.93s (14.4 tok/s) 3.1x
fps=2.0, max_frames=16 20.78s (4.8 tok/s) 5.89s (15.3 tok/s) 3.5x
  • Tools + video tested end-to-end — no crash, 5.81s
  • stream_chat native path tested — 1 chunk, correct response
  • Performance consistent pre/post guard fixes

Test coverage

Added: multi-message videos, multiple videos per message, video_url type translation, Pydantic v1/v2 model handling, tool forwarding signature verification, video_generate import check.

Full suite: 958 passed, 0 failures.

@janhilgard
Copy link
Copy Markdown
Collaborator

janhilgard commented Mar 21, 2026

Thanks @patanet7 — great work on the rebase and addressing all the review feedback. I went through the updated diff and verified all points from @waybarrios's review are addressed:

  • _video_native = False init in __init__
  • process_vision_info import guard with try/except and actionable error ✅
  • Tool forwarding through the full native video path (_generate_native_video_prepare_native_video_inputsapply_chat_template) ✅
  • has_media rewritten — scans message content in server.py instead of relying on fps/max_frames proxy ✅
  • use_cache kwargs leak — all pops centralized before branching in chat() and stream_chat()
  • _collect_video_inputs() helper — deduplicates video extraction ✅

The asyncio.to_thread() wrapping in SimpleEngine is a pragmatic solution for the streaming limitation — no event loop blocking at the server level.

Benchmarks look great: 3.5x speedup at 16 frames. 958 tests, 0 failures.

LGTM from my side — @waybarrios still needs to re-review to clear the requested changes.

@waybarrios
Copy link
Copy Markdown
Owner

would reformat /home/runner/work/vllm-mlx/vllm-mlx/tests/test_video.py
would reformat /home/runner/work/vllm-mlx/vllm-mlx/vllm_mlx/engine_core.py

@waybarrios waybarrios self-requested a review March 21, 2026 22:17
@waybarrios
Copy link
Copy Markdown
Owner

all review items from the previous rounds have been addressed, @janhilgard gave LGTM

i resolved the merge conflicts with main and fixed the black formatting on test_video.py. CI should pass now

the refactoring looks clean, tool forwarding works through the native video path, kwargs are centralized before branching, and the benchmarks show 3.5x speedup. approving and merging

@waybarrios waybarrios merged commit 2a79216 into waybarrios:main Mar 21, 2026
raullenchai pushed a commit to raullenchai/Rapid-MLX that referenced this pull request Mar 26, 2026
…ection, served-model-name

Merge 16 upstream commits (22dcbf8..d235c37) into our fork:

- feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180)
- feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150)
- fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97)
- feat: Add --served-model-name CLI parameter (waybarrios#125)
- feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127)
- fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157)
- fix: Metal resource leak under high concurrency (waybarrios#92)

Conflict resolution strategy: keep all fork features (DeltaNet snapshots,
fast SSE templates, tool injection, cloud routing, prompt cache, etc.)
while incorporating upstream's new functionality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request UNDER REVIEW

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants