Fix asyncio.get_event_loop() deprecation for Python 3.10+ by waybarrios · Pull Request #2 · waybarrios/vllm-mlx

waybarrios · 2026-01-08T22:42:41Z

Summary

Fixes #1

Replace deprecated asyncio.get_event_loop() with asyncio.new_event_loop() + asyncio.set_event_loop() for Python 3.10+ compatibility
This resolves the issue when running vllm-mlx serve in simple mode

Changes

vllm_mlx/server.py: Use new_event_loop() pattern instead of deprecated get_event_loop()

Test

vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

Thanks @rhychung for reporting and suggesting the fix!

waybarrios · 2026-01-08T22:44:14Z

Benchmark Results

Tested with the model from the issue report:

============================================================
BENCHMARK RESULTS
============================================================

Model          mlx-community/Llama-3.2-3B-Instruct-4bit
Hardware       M4 Max (128 GB)
Total Runs     3
Input Tokens   21
Output Tokens  283
Total Time     1.58s

Performance Metrics:
Metric                        Mean         P95/Max
----------------------------  -----------  -----------
TTFT (Time to First Token)    58.9 ms      62.1 ms
TPOT (Time Per Output Token)  4.99 ms      5.01 ms
Generation Speed              200.3 tok/s  200.6 tok/s
Processing Speed              118.5 tok/s  -
Latency (per request)         0.52s        0.55s

Throughput:
Total Throughput  192.9 tok/s
Requests/Second   1.90 req/s

Resource Usage:
Process Memory (peak)  2.43 GB
MLX Peak Memory        1.72 GB
MLX Cache Memory       0.03 GB
System Memory          41.3 / 128 GB (32%)
============================================================

All tests pass and benchmark runs successfully with the fix.

* fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Fix --api-key argument for serve command (fixes #7) * Document --api-key, --rate-limit and --timeout options in CLI reference * fix: Enable vision and streaming for MLLM models + Gemma 3 support (#2) * fix: Enable vision and streaming for MLLM models This patch fixes two critical issues with multimodal language models (MLLM): ## Vision Fix (server.py, simple.py) - Preserve original messages when calling MLLM models - The engine was passing only the prompt string, losing image data - Now passes full message objects with images to MLLM.chat() ## Streaming Fix (mllm.py, simple.py) - Add stream_chat() method to MLLMMultimodalLM class - Uses mlx_vlm.stream_generate() for true token-by-token streaming - Update engine to call stream_chat() for MLLM models - Properly yields GenerationOutput with new_text for SSE streaming Tested with: - mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit - Text streaming: 5 tokens streamed correctly - Vision streaming: Image analysis works with streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: Add Gemma 3 to MLLM detection patterns Gemma 3 models are multimodal but weren't being detected as VLMs. This adds "gemma-3" and "gemma3" to MLLM_PATTERNS so vllm-mlx correctly loads them with vision support via mlx-vlm. Tested with mlx-community/gemma-3-27b-it-4bit: - Vision: ✅ Working (cat, Kali, Ganesha images) - Streaming: ✅ Working (40 chunks) - Long context: ✅ Up to ~5K tokens Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add Gemma 3 support section with long context patch instructions - Document Gemma 3 MLLM detection (already patched in utils.py) - Add mlx-vlm long context patch for GEMMA3_SLIDING_WINDOW env var - Include benchmark results showing 5x improvement (10K → 50K tokens) - Explain Metal GPU timeout limitation and workaround --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> * fix: disable skip_prompt_processing for multimodal to prevent garbled output For MLLM with images, skip_prompt_processing cannot be used because: - Vision encoder must run each time to provide visual context - The skip path only calls language_model() which has no vision - Using it produces garbled output like 'TheTheTheThe...' Text-only caching still works with 6x+ speedup. Multimodal correctly gets no speedup but produces coherent output. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Wayner Barrios <waybarrios@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse

Fix asyncio.get_event_loop() deprecation for Python 3.10+

5452d0e

waybarrios mentioned this pull request Jan 8, 2026

# Simple mode (single user, max throughput) failed with % vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 #1

Closed

waybarrios merged commit 1df74fa into main Jan 8, 2026

waybarrios deleted the fix/asyncio-event-loop-deprecation branch February 3, 2026 01:21

janhilgard mentioned this pull request Feb 11, 2026

Add KV cache quantization for prefix cache memory reduction #62

Merged

WainWong pushed a commit to WainWong/vllm-mlx that referenced this pull request Mar 2, 2026

Merge pull request waybarrios#2 from raullenchai/feat-minimax-parser

e262fa2

feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse

krystophny mentioned this pull request Mar 24, 2026

prefix_cache: preserve hybrid recurrent state across blocks #217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix asyncio.get_event_loop() deprecation for Python 3.10+#2

Fix asyncio.get_event_loop() deprecation for Python 3.10+#2
waybarrios merged 1 commit intomainfrom
fix/asyncio-event-loop-deprecation

waybarrios commented Jan 8, 2026

Uh oh!

waybarrios commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

waybarrios commented Jan 8, 2026

Summary

Changes

Test

Uh oh!

waybarrios commented Jan 8, 2026

Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant