fix(mllm_scheduler): add adaptive periodic cache clearing#157
Merged
waybarrios merged 1 commit intowaybarrios:mainfrom Mar 20, 2026
Merged
fix(mllm_scheduler): add adaptive periodic cache clearing#157waybarrios merged 1 commit intowaybarrios:mainfrom
waybarrios merged 1 commit intowaybarrios:mainfrom
Conversation
janhilgard
added a commit
to janhilgard/vllm-mlx
that referenced
this pull request
Mar 21, 2026
…sion Cherry-pick upstream merged fixes + local improvements: - PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal AGXAllocation handles; adaptive cache clear interval scales with concurrency - PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent linear memory growth; add mx.clear_cache() after cleanup - PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to MLLM scheduler (interval scales inversely with active sequences) - PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths in SimpleEngine and MLXMultimodalLM (get_chat_template) - PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first 1000 chars to prevent cross-request image cache collisions Additional fixes: - batched.py: Disable thinking mode for coder models, promote MLLM stats - mllm_batch_generator.py: Downgrade prompt size guard from error to warning - qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
pushed a commit
to raullenchai/Rapid-MLX
that referenced
this pull request
Mar 26, 2026
…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mx.clear_cache()toMLLMScheduler.step()SchedulerContext
Follow-up to #154 and @waybarrios' suggestion.
MLLMSchedulercurrently clears the Metal buffer pool only when requests finish. That covers cleanup at completion time, but long multimodal generations can run for many scheduler steps without completions and freed memory can accumulate in the MLX Metal buffer pool.The LLM
Scheduleralready addresses this with adaptive periodic cache clearing. This change ports that interval logic to the MLLM path: increment a step counter and callmx.clear_cache()every N steps, with N decreasing as concurrency rises.The existing clear on finish behavior remains in place. This change adds a second cleanup path for long running generations that stay active across many scheduler steps.