Skip to content

fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal#11

Closed
lubauss wants to merge 6 commits intowaybarrios:mainfrom
lubauss:patch/fixgemma3-remove-auto-slidingwindow0-that-breaks-m
Closed

fix(gemma3): remove auto SLIDING_WINDOW=0 that breaks multimodal#11
lubauss wants to merge 6 commits intowaybarrios:mainfrom
lubauss:patch/fixgemma3-remove-auto-slidingwindow0-that-breaks-m

Conversation

@lubauss
Copy link
Copy Markdown
Contributor

@lubauss lubauss commented Jan 20, 2026

Summary

Synced from local development patches.

Files changed

  • vllm_mlx/engine/batched.py
  • vllm_mlx/api/utils.py

🤖 Generated with Claude Code

lubauss and others added 6 commits January 19, 2026 20:45
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

* feat: Enable continuous batching for MLLM models

This patch enables continuous batching (with prefix caching) for
multimodal LLM models like Qwen3-VL and Gemma 3.

Changes:
- Add MLLMModelWrapper to extract logits from LanguageModelOutput
- Fix tokenizer.encode to work with processors (Qwen3VLProcessor)
- Fix tokenizer.decode to use nested tokenizer for processors
- Fix _get_stop_tokens to check both processor and tokenizer

Performance improvement on M4 Max 128GB with Qwen3-VL-30B:
- First request (cache miss): ~22s for 17K tokens
- Subsequent requests (cache hit): ~0.8-1.2s
- Speedup: 10-28x faster with prefix caching

Multi-turn conversation (6 turns, 90K char document):
- 90.7% faster on average
- 10.76x speedup vs uncached

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…ching (#4)

Gemma 3's model __call__() requires pixel_values as a positional argument,
unlike Qwen2-VL which makes it optional. This caused "missing required
positional argument: 'pixel_values'" errors when using continuous batching
with text-only requests.

The MLLMModelWrapper now injects pixel_values=None for text-only requests,
enabling Gemma 3 to work with continuous batching and prefix caching.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…batch mode (#5)

Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
#6)

Synced from local patches in .venv-vllm-mlx

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Synced from local patches in .venv-vllm-mlx

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@lubauss lubauss closed this Jan 20, 2026
@lubauss lubauss deleted the patch/fixgemma3-remove-auto-slidingwindow0-that-breaks-m branch January 20, 2026 01:09
waybarrios added a commit that referenced this pull request Jan 26, 2026
* Fix --api-key argument for serve command (fixes #7)

* Document --api-key, --rate-limit and --timeout options in CLI reference

* fix: Enable vision and streaming for MLLM models + Gemma 3 support (#2)

* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix: disable skip_prompt_processing for multimodal to prevent garbled output

For MLLM with images, skip_prompt_processing cannot be used because:
- Vision encoder must run each time to provide visual context
- The skip path only calls language_model() which has no vision
- Using it produces garbled output like 'TheTheTheThe...'

Text-only caching still works with 6x+ speedup.
Multimodal correctly gets no speedup but produces coherent output.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Wayner Barrios <waybarrios@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
raullenchai referenced this pull request in raullenchai/Rapid-MLX Feb 25, 2026
Logprobs: Propagate mlx-lm per-token logprobs through the full stack
(StreamingOutput → GenerationOutput → API response). Supports both
streaming and non-streaming chat completions with top_logprobs (0-20).

Structural tags: Extend MiniMaxToolLogitsProcessor with parameter value
schema tracking — biases toward valid JSON types (string/number/boolean/
object/array) at the start of parameter values. SimpleEngine gets
post-generation validation with warning logs for schema mismatches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai referenced this pull request in raullenchai/Rapid-MLX Feb 25, 2026
Logprobs: Propagate mlx-lm per-token logprobs through the full stack
(StreamingOutput → GenerationOutput → API response). Supports both
streaming and non-streaming chat completions with top_logprobs (0-20).

Structural tags: Extend MiniMaxToolLogitsProcessor with parameter value
schema tracking — biases toward valid JSON types (string/number/boolean/
object/array) at the start of parameter values. SimpleEngine gets
post-generation validation with warning logs for schema mismatches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai referenced this pull request in raullenchai/Rapid-MLX Feb 25, 2026
Logprobs: Propagate mlx-lm per-token logprobs through the full stack
(StreamingOutput → GenerationOutput → API response). Supports both
streaming and non-streaming chat completions with top_logprobs (0-20).

Structural tags: Extend MiniMaxToolLogitsProcessor with parameter value
schema tracking — biases toward valid JSON types (string/number/boolean/
object/array) at the start of parameter values. SimpleEngine gets
post-generation validation with warning logs for schema mismatches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai referenced this pull request in raullenchai/Rapid-MLX Feb 25, 2026
Logprobs: Propagate mlx-lm per-token logprobs through the full stack
(StreamingOutput → GenerationOutput → API response). Supports both
streaming and non-streaming chat completions with top_logprobs (0-20).

Structural tags: Extend MiniMaxToolLogitsProcessor with parameter value
schema tracking — biases toward valid JSON types (string/number/boolean/
object/array) at the start of parameter values. SimpleEngine gets
post-generation validation with warning logs for schema mismatches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai referenced this pull request in raullenchai/Rapid-MLX Feb 25, 2026
Logprobs: Propagate mlx-lm per-token logprobs through the full stack
(StreamingOutput → GenerationOutput → API response). Supports both
streaming and non-streaming chat completions with top_logprobs (0-20).

Structural tags: Extend MiniMaxToolLogitsProcessor with parameter value
schema tracking — biases toward valid JSON types (string/number/boolean/
object/array) at the start of parameter values. SimpleEngine gets
post-generation validation with warning logs for schema mismatches.

Co-authored-by: Raullen <raullenstudio@raullenacstudio.lan>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
sean-esk pushed a commit to sean-esk/vllm-mlx that referenced this pull request Mar 3, 2026
…rios#11)

* fix: 5 bugs from code review — init crash, JSON corruption, GC leak, cloud gaps

1. SimpleEngine._inject_shared_model: set missing MLXLanguageModel attributes
   (prefill_step_size, kv_bits, kv_group_size, _prompt_cache, _cached_token_ids,
   _cache_lock) that __new__ skips, preventing AttributeError on first generate

2. Non-streaming chat: guard extract_json_from_response with `if response_format`
   so plain text responses aren't corrupted by JSON extraction

3. stream_chat_completion: wrap generator body in try/finally so gc.enable()
   runs even on client disconnect, preventing permanent GC disable

4. Cloud streaming: wrap with _disconnect_guard like local streaming path

5. Cloud routing: forward response_format to cloud provider so structured
   output works consistently regardless of routing decision

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: cloud routing sends pre-mutation messages and forwards stop/tool_choice

Cloud routing was using locally-mutated messages (tool→user conversion,
developer→system normalization, suffix injection) instead of original
OpenAI-format messages. Also forward stop and tool_choice parameters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prefix cache pin stability and guided.py return types

PrefixCacheManager.pin_prefix was silently undone by store_cache and
_touch_lru re-adding entries to LRU. Added _pinned set to track pinned
entries, ensuring they stay out of LRU. Pinned entries now count toward
capacity to prevent unbounded cache growth.

Fixed generate_json/generate_json_object return type from str to str|None
to match actual behavior (returns None on failure).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rate limiter stale key cleanup and demote user content log to DEBUG

RateLimiter._requests dict grew unbounded with unique client keys that
stopped making requests. Added periodic purge of stale keys when dict
exceeds 100 entries.

Demoted user message preview logging from INFO to DEBUG to prevent
PII/sensitive content from appearing in production logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: cloud response_format passthrough, inject config values, pin cap

1. CloudRouter._build_call_kwargs now forwards response_format to
   litellm so structured output works on cloud-routed requests.

2. _inject_shared_model uses engine config (self._prefill_step_size,
   self._kv_bits, self._kv_group_size) instead of hardcoded defaults.

3. pin_prefix rejects when pinned count reaches max_size, preventing
   capacity from becoming unenforceable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: regression tests for cloud response_format, inject config, pin cap

- test_passes_through_response_format: verifies response_format is
  forwarded through _build_call_kwargs (was silently dropped)
- TestInjectSharedModelConfig: verifies _inject_shared_model propagates
  engine config (prefill_step_size, kv_bits, kv_group_size) instead of
  hardcoded defaults
- TestPrefixCachePinning: verifies pin survives store/touch, capacity
  guard rejects at max_size, unpin restores evictability, clear resets

Also adds docstring note to pin_prefix about capacity policy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: rate limiter stale purge and JSON extraction guard coverage

- test_rate_limiter_stale_key_purge: verifies stale client keys are
  purged when dict exceeds 100 entries
- TestExtractJsonFromResponse: documents why extract_json_from_response
  must be guarded by `if response_format` — it corrupts plain text
  that ends with balanced braces

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant