Add --served-model-name CLI parameter by otarkhan · Pull Request #125 · waybarrios/vllm-mlx

otarkhan · 2026-02-28T18:11:00Z

Summary

Adds --served-model-name option to vllm-mlx serve, matching vLLM's behavior. When set, API responses use the specified name instead of the model path (e.g., serve mlx-community/Llama-3.2-3B-Instruct-4bit but have it appear as my-llama in /v1/models and all responses). When omitted, behavior is unchanged.
Validates incoming request model names against the served name, returning a 404 if they don't match — previously the server accepted any model name without checking.
All API responses (/v1/completions, /v1/chat/completions, /v1/messages, and their streaming variants) now consistently use the served model name.

Changes

vllm_mlx/cli.py: Added --served-model-name argument and passed it through to load_model().
vllm_mlx/server.py: Added served_model_name parameter to load_model(), added _validate_model_name() helper, wired validation into all three endpoints, and replaced all response-facing uses of request.model with _model_name.

Allow users to serve a model under a different name in API responses, matching vLLM's --served-model-name behavior.

The cache directory was derived from _model_name which could be overridden by --served-model-name, causing cache misses when the served name changed. Use the actual model path instead.

waybarrios · 2026-03-21T22:06:18Z

resolved the merge conflicts with main and pushed to your branch. the PR looks good, clean feature that matches vLLM's --served-model-name behavior. merging now

…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

otarkhan and others added 3 commits February 28, 2026 20:09

Add --served-model-name CLI parameter

85bae64

Allow users to serve a model under a different name in API responses, matching vLLM's --served-model-name behavior.

Fix prefix cache dir using served name instead of model path

41b4e76

The cache directory was derived from _model_name which could be overridden by --served-model-name, causing cache misses when the served name changed. Use the actual model path instead.

resolve merge conflicts with main

6e413f6

waybarrios merged commit c609b59 into waybarrios:main Mar 21, 2026
6 checks passed

krystophny mentioned this pull request Mar 24, 2026

Add served model alias support computor-org/vllm-mlx#3

Closed

raullenchai mentioned this pull request Mar 26, 2026

Sync upstream: SpecPrefill, native video, MTP injection raullenchai/Rapid-MLX#58

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --served-model-name CLI parameter#125

Add --served-model-name CLI parameter#125
waybarrios merged 3 commits intowaybarrios:mainfrom
otarkhan:feature/served-model-name

otarkhan commented Feb 28, 2026

Uh oh!

waybarrios commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

otarkhan commented Feb 28, 2026

Summary

Changes

Uh oh!

waybarrios commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants