Skip to content

Add --served-model-name CLI parameter#125

Merged
waybarrios merged 3 commits intowaybarrios:mainfrom
otarkhan:feature/served-model-name
Mar 21, 2026
Merged

Add --served-model-name CLI parameter#125
waybarrios merged 3 commits intowaybarrios:mainfrom
otarkhan:feature/served-model-name

Conversation

@otarkhan
Copy link
Copy Markdown
Contributor

Summary

  • Adds --served-model-name option to vllm-mlx serve, matching vLLM's behavior. When set, API responses use the specified name instead of the model path (e.g., serve mlx-community/Llama-3.2-3B-Instruct-4bit but have it appear as my-llama in /v1/models and all responses). When omitted, behavior is unchanged.
  • Validates incoming request model names against the served name, returning a 404 if they don't match — previously the server accepted any model name without checking.
  • All API responses (/v1/completions, /v1/chat/completions, /v1/messages, and their streaming variants) now consistently use the served model name.

Changes

  • vllm_mlx/cli.py: Added --served-model-name argument and passed it through to load_model().
  • vllm_mlx/server.py: Added served_model_name parameter to load_model(), added _validate_model_name() helper, wired validation into all three endpoints, and replaced all response-facing uses of request.model with _model_name.

otarkhan and others added 3 commits February 28, 2026 20:09
Allow users to serve a model under a different name in API responses,
matching vLLM's --served-model-name behavior.
The cache directory was derived from _model_name which could be
overridden by --served-model-name, causing cache misses when the
served name changed. Use the actual model path instead.
@waybarrios
Copy link
Copy Markdown
Owner

resolved the merge conflicts with main and pushed to your branch. the PR looks good, clean feature that matches vLLM's --served-model-name behavior. merging now

@waybarrios waybarrios merged commit c609b59 into waybarrios:main Mar 21, 2026
6 checks passed
raullenchai pushed a commit to raullenchai/Rapid-MLX that referenced this pull request Mar 26, 2026
…ection, served-model-name

Merge 16 upstream commits (22dcbf8..d235c37) into our fork:

- feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180)
- feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150)
- fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97)
- feat: Add --served-model-name CLI parameter (waybarrios#125)
- feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127)
- fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157)
- fix: Metal resource leak under high concurrency (waybarrios#92)

Conflict resolution strategy: keep all fork features (DeltaNet snapshots,
fast SSE templates, tool injection, cloud routing, prompt cache, etc.)
while incorporating upstream's new functionality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants