Skip to content

Fix asyncio.get_event_loop() deprecation for Python 3.10+#2

Merged
waybarrios merged 1 commit intomainfrom
fix/asyncio-event-loop-deprecation
Jan 8, 2026
Merged

Fix asyncio.get_event_loop() deprecation for Python 3.10+#2
waybarrios merged 1 commit intomainfrom
fix/asyncio-event-loop-deprecation

Conversation

@waybarrios
Copy link
Copy Markdown
Owner

Summary

Fixes #1

  • Replace deprecated asyncio.get_event_loop() with asyncio.new_event_loop() + asyncio.set_event_loop() for Python 3.10+ compatibility
  • This resolves the issue when running vllm-mlx serve in simple mode

Changes

  • vllm_mlx/server.py: Use new_event_loop() pattern instead of deprecated get_event_loop()

Test

vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

Thanks @rhychung for reporting and suggesting the fix!

@waybarrios
Copy link
Copy Markdown
Owner Author

Benchmark Results

Tested with the model from the issue report:

============================================================
BENCHMARK RESULTS
============================================================

Model          mlx-community/Llama-3.2-3B-Instruct-4bit
Hardware       M4 Max (128 GB)
Total Runs     3
Input Tokens   21
Output Tokens  283
Total Time     1.58s

Performance Metrics:
Metric                        Mean         P95/Max
----------------------------  -----------  -----------
TTFT (Time to First Token)    58.9 ms      62.1 ms
TPOT (Time Per Output Token)  4.99 ms      5.01 ms
Generation Speed              200.3 tok/s  200.6 tok/s
Processing Speed              118.5 tok/s  -
Latency (per request)         0.52s        0.55s

Throughput:
Total Throughput  192.9 tok/s
Requests/Second   1.90 req/s

Resource Usage:
Process Memory (peak)  2.43 GB
MLX Peak Memory        1.72 GB
MLX Cache Memory       0.03 GB
System Memory          41.3 / 128 GB (32%)
============================================================

All tests pass and benchmark runs successfully with the fix.

@waybarrios waybarrios merged commit 1df74fa into main Jan 8, 2026
waybarrios pushed a commit that referenced this pull request Jan 26, 2026
* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
waybarrios added a commit that referenced this pull request Jan 26, 2026
* Fix --api-key argument for serve command (fixes #7)

* Document --api-key, --rate-limit and --timeout options in CLI reference

* fix: Enable vision and streaming for MLLM models + Gemma 3 support (#2)

* fix: Enable vision and streaming for MLLM models

This patch fixes two critical issues with multimodal language models (MLLM):

## Vision Fix (server.py, simple.py)
- Preserve original messages when calling MLLM models
- The engine was passing only the prompt string, losing image data
- Now passes full message objects with images to MLLM.chat()

## Streaming Fix (mllm.py, simple.py)
- Add stream_chat() method to MLLMMultimodalLM class
- Uses mlx_vlm.stream_generate() for true token-by-token streaming
- Update engine to call stream_chat() for MLLM models
- Properly yields GenerationOutput with new_text for SSE streaming

Tested with:
- mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit
- Text streaming: 5 tokens streamed correctly
- Vision streaming: Image analysis works with streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: Add Gemma 3 to MLLM detection patterns

Gemma 3 models are multimodal but weren't being detected as VLMs.
This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx
correctly loads them with vision support via mlx-vlm.

Tested with mlx-community/gemma-3-27b-it-4bit:
- Vision: ✅ Working (cat, Kali, Ganesha images)
- Streaming: ✅ Working (40 chunks)
- Long context: ✅ Up to ~5K tokens

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Add Gemma 3 support section with long context patch instructions

- Document Gemma 3 MLLM detection (already patched in utils.py)
- Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var
- Include benchmark results showing 5x improvement (10K → 50K tokens)
- Explain Metal GPU timeout limitation and workaround

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix: disable skip_prompt_processing for multimodal to prevent garbled output

For MLLM with images, skip_prompt_processing cannot be used because:
- Vision encoder must run each time to provide visual context
- The skip path only calls language_model() which has no vision
- Using it produces garbled output like 'TheTheTheThe...'

Text-only caching still works with 6x+ speedup.
Multimodal correctly gets no speedup but produces coherent output.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Wayner Barrios <waybarrios@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
@waybarrios waybarrios deleted the fix/asyncio-event-loop-deprecation branch February 3, 2026 01:21
WainWong pushed a commit to WainWong/vllm-mlx that referenced this pull request Mar 2, 2026
feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

# Simple mode (single user, max throughput) failed with % vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

1 participant