Skip to content
Closed
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
6d6f085
feat: Add Anthropic Messages API endpoint with tool call parser fallback
janhilgard Feb 6, 2026
a158cef
Fix VRAM spike at end of generation via lazy-eval-safe cache handling
janhilgard Feb 6, 2026
af46e0d
Add mid-prefill cache saving, disconnect detection, and cancellation …
janhilgard Feb 6, 2026
3b6f756
Fix lint: rename ambiguous variable `l`, remove unused import
janhilgard Feb 6, 2026
dd00521
Fix mid-prefill cache restore: convert BatchKVCache to KVCache
janhilgard Feb 6, 2026
44b1cea
Fix cache fetch: prefer supersequence match over prefix match
janhilgard Feb 6, 2026
ae687bb
Add prefix cache sharing for requests with common prefix
janhilgard Feb 6, 2026
49ef8e6
Fix Metal SIGABRT crash when aborting chunked prefill mid-stream
janhilgard Feb 6, 2026
21e282e
Abort chunked prefill immediately on client disconnect
janhilgard Feb 7, 2026
aa97fe4
Fix streaming tool call parsing: emit structured delta.tool_calls
janhilgard Feb 7, 2026
3452b44
Fix nested JSON serialization in Nemotron XML tool parser
janhilgard Feb 7, 2026
0ce82e9
Fix prompt cache for hybrid Mamba+Transformer models
janhilgard Feb 7, 2026
06b7af4
Add prefix-subset eviction to reduce cache memory ~6x
janhilgard Feb 7, 2026
e9791ef
Fix multi-turn tool calling: enable native tool format for Hermes parser
janhilgard Feb 7, 2026
c714334
Fix <|im_end|> leaking into streaming OpenAI responses
janhilgard Feb 7, 2026
e7cf750
Fix stop token leaking as text: skip decoding EOS tokens in streaming
janhilgard Feb 7, 2026
65c5830
Fix prompt cache for chunked prefill (enables Anthropic endpoint cach…
janhilgard Feb 7, 2026
43ce701
Stop clearing prefix cache on sampler param changes
janhilgard Feb 7, 2026
b921a5e
Fix crash loop on cache shape mismatch after BatchGenerator recreation
janhilgard Feb 7, 2026
0ec2faf
Fix prefix cache for agentic multi-turn on hybrid Mamba+Transformer m…
janhilgard Feb 7, 2026
3f8b006
Optimize inference: bisect cache lookup, remove deepcopy, reduce clea…
janhilgard Feb 8, 2026
52977e9
Add GET /v1/status endpoint with real-time per-request monitoring
janhilgard Feb 8, 2026
ccba569
Merge origin/main into feature/anthropic-endpoint
janhilgard Feb 8, 2026
25557e2
fix: update native tool format test for HermesToolParser
janhilgard Feb 8, 2026
34e2e93
fix: add missing --default-temperature and --default-top-p CLI args
janhilgard Feb 8, 2026
a7ecc45
Pass request context to tool parsers for tool name validation
waybarrios Feb 8, 2026
2902742
Add Anthropic endpoint docs, tests, and CI integration
waybarrios Feb 8, 2026
2f06a7a
Fix black formatting in test_anthropic_adapter
waybarrios Feb 8, 2026
50ca1f0
Expand Anthropic endpoint and status docs with full examples
waybarrios Feb 8, 2026
d9c97fa
fix: route text-only requests through MLLM scheduler for vision models
janhilgard Feb 9, 2026
4a62193
Merge origin/main into fix/mllm-text-only-crash
janhilgard Feb 9, 2026
7bf1881
Merge remote-tracking branch 'origin/main' into fix/mllm-text-only-crash
janhilgard Feb 10, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions vllm_mlx/engine/batched.py
Original file line number Diff line number Diff line change
Expand Up @@ -419,8 +419,9 @@ async def generate(
if not self._loaded:
await self.start()

if self._is_mllm and self._mllm_scheduler and (images or videos):
# Use MLLM scheduler for multimodal
if self._is_mllm and self._mllm_scheduler:
# Use MLLM scheduler for all requests on vision models
# (both multimodal and text-only, since LLM engine is not loaded for MLLM)
output = await self._mllm_scheduler.generate(
prompt=prompt,
images=images,
Expand All @@ -437,7 +438,7 @@ async def generate(
finish_reason=output.finish_reason,
)

# Use LLM engine for text-only
# Use LLM engine for text-only (non-MLLM models)
from ..request import SamplingParams

sampling_params = SamplingParams(
Expand Down Expand Up @@ -491,8 +492,9 @@ async def stream_generate(
if not self._loaded:
await self.start()

if self._is_mllm and self._mllm_scheduler and (images or videos):
# Use MLLM scheduler for multimodal streaming
if self._is_mllm and self._mllm_scheduler:
# Use MLLM scheduler for all requests on vision models
# (both multimodal and text-only, since LLM engine is not loaded for MLLM)
request_id = await self._mllm_scheduler.add_request_async(
prompt=prompt,
images=images,
Expand All @@ -513,7 +515,7 @@ async def stream_generate(
)
return

# Use LLM engine for text-only
# Use LLM engine for text-only (non-MLLM models)
from ..request import SamplingParams

sampling_params = SamplingParams(
Expand Down
Loading