Skip to content

[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends#26656

Merged
vllm-bot merged 3 commits intovllm-project:mainfrom
MengqingCao:mlakwargs
Oct 15, 2025
Merged

[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends#26656
vllm-bot merged 3 commits intovllm-project:mainfrom
MengqingCao:mlakwargs

Conversation

@MengqingCao
Copy link
Copy Markdown
Contributor

@MengqingCao MengqingCao commented Oct 12, 2025

Purpose

Tiny refactor on DeepSeek to make it reusable for different backends

  • allow MLAAttention to accept variable length parameter lists
    * add enable_dsa_topk_indices_buffer in AttentionBackend to flexibly determine whether to create topk indices buffer
  • remove cuda hard code

Test Plan

Test pass with DeepSeek-V3.2-Exp

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a small refactor to make the DeepSeek implementation more reusable across different backends. The changes include making MLAAttention accept variable-length parameter lists, adding a feature flag for DSA topk indices buffer creation, and removing hardcoded CUDA device strings. While the changes are generally positive, I've identified a critical issue related to a potential KV cache layout mismatch and a high-severity issue concerning code duplication that should be addressed.

@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

https://github.com/vllm-project/vllm/blob/e860103d33ced506474fdd9b88610f0d0bc43101/vllm/model_executor/models/deepseek_v2.py#L472-L476
P1 Badge Align Deepseek indexer num_kv_heads with backend

The new get_kv_cache_spec now reports num_kv_heads=8, but the associated backend (DeepseekV32IndexerBackend.get_kv_cache_shape) still asserts num_kv_heads == 1. During KV-cache initialization the runner passes this spec to the backend, triggering the assertion and aborting execution for DeepSeek V3.2 indexer layers. The remaining cache code still stores data as a single head ([num_blocks, block_size, head_dim + 1]), so either the backend must be updated for 8 heads or the spec here should stay at 1.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Oct 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MengqingCao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 13, 2025
…t backends

  * allow MLAAttention to accept variable length parameter lists
  * add `enable_dsa_topk_indices_buffer` in `AttentionBackend` to flexibly determine whether to create topk indices buffer
  * remove cuda hard code

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
@MengqingCao
Copy link
Copy Markdown
Contributor Author

@DarkLight1337 Could help review this pr? Hoping this could be included in 0.11.1

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 15, 2025 02:50
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 15, 2025
@MengqingCao
Copy link
Copy Markdown
Contributor Author

Tried to reproduct the failed CI locally, but everything goes well. I think it is not a bug introduced by this pr? @DarkLight1337

VLLM_USE_MODELSCOPE=True pytest -sv tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2
INFO 10-15 13:58:18 [__init__.py:224] Automatically detected platform cuda.
/home/vllm_25fall/cmq/vllm/cmq/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=============================================================================== test session starts ===============================================================================
platform linux -- Python 3.12.11, pytest-8.3.5, pluggy-1.5.0 -- /home/vllm_25fall/cmq/vllm/cmq/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/home/vllm_25fall/cmq/vllm/.hypothesis/examples'))
rootdir: /home/vllm_25fall/cmq/vllm
configfile: pyproject.toml
plugins: forked-1.6.0, schemathesis-3.39.15, asyncio-0.24.0, subtests-0.14.1, cov-6.3.0, rerunfailures-14.0, shard-0.1.2, hypothesis-6.131.0, mock-3.14.0, buildkite-test-collector-0.1.9, hydra-core-1.3.2, timeout-2.3.1, anyio-4.6.2.post1
asyncio: mode=Mode.STRICT, default_loop_scope=None
collecting ... [2025-10-15 13:58:21] INFO config.py:54: PyTorch version 2.8.0+cu128 available.
[2025-10-15 13:58:21] INFO config.py:66: Polars version 1.29.0 available.
collected 2 items                                                                                                                                                                 
Running 2 items in this shard: tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-True-Qwen/Qwen3-0.6B], tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-False-Qwen/Qwen3-0.6B]

tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-True-Qwen/Qwen3-0.6B] Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
2025-10-15 13:58:22,088 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
2025-10-15 13:58:22,474 - modelscope - INFO - Target directory already exists, skipping creation.
ERROR 10-15 13:58:22 [registry.py:587] Cached model info for class vllm.model_executor.models.qwen3.Qwen3ForCausalLM error. 
ERROR 10-15 13:58:22 [registry.py:587] Traceback (most recent call last):
ERROR 10-15 13:58:22 [registry.py:587]   File "/home/vllm_25fall/cmq/vllm/vllm/model_executor/models/registry.py", line 585, in _load_modelinfo_from_cache
ERROR 10-15 13:58:22 [registry.py:587]     return _ModelInfo(**mi_dict["modelinfo"])
ERROR 10-15 13:58:22 [registry.py:587]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-15 13:58:22 [registry.py:587] TypeError: _ModelInfo.__init__() missing 1 required positional argument: 'supports_v0_only'
INFO 10-15 13:58:28 [model.py:645] Resolved architecture: Qwen3ForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING 10-15 13:58:28 [model.py:1932] Casting torch.bfloat16 to torch.float16.
INFO 10-15 13:58:28 [model.py:1706] Using max model len 40960
Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
2025-10-15 13:58:28,536 - modelscope - INFO - Target directory already exists, skipping creation.
Launching RemoteOpenAIServer with: vllm serve Qwen/Qwen3-0.6B --dtype half --enable-auto-tool-choice --structured-outputs-config.backend xgrammar --tool-call-parser hermes --reasoning-parser qwen3 --gpu-memory-utilization 0.4 --port 43433 --seed 0 --hf-overrides {"model_type": "kimi_k2", "kv_lora_rank": null}
INFO 10-15 13:58:30 [__init__.py:224] Automatically detected platform cuda.
(APIServer pid=2285197) INFO 10-15 13:58:34 [api_server.py:1870] vLLM API server version 0.11.1rc2.dev25+g7cfa420f4
(APIServer pid=2285197) INFO 10-15 13:58:34 [utils.py:239] non-default args: {'model_tag': 'Qwen/Qwen3-0.6B', 'port': 43433, 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'dtype': 'half', 'seed': 0, 'hf_overrides': {'model_type': 'kimi_k2', 'kv_lora_rank': None}, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.4, 'structured_outputs_config': StructuredOutputsConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='')}
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:34,757 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:35,148 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:58:35 [model.py:645] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=2285197) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=2285197) WARNING 10-15 13:58:35 [model.py:1932] Casting torch.bfloat16 to torch.float16.
(APIServer pid=2285197) INFO 10-15 13:58:35 [model.py:1706] Using max model len 40960
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:35,595 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:58:35 [scheduler.py:225] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:36,154 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:36,844 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 10-15 13:58:39 [__init__.py:224] Automatically detected platform cuda.
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:42 [core.py:727] Waiting for init message from front-end.
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:42 [core.py:94] Initializing a V1 LLM engine (v0.11.1rc2.dev25+g7cfa420f4) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': '', 'custom_ops': [], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': True, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_capture_size': 512, 'local_cache_dir': None}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [gpu_model_runner.py:2849] Starting to load model Qwen/Qwen3-0.6B...
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [gpu_model_runner.py:2879] Loading model from scratch...
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [cuda.py:405] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=2287306) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(EngineCore_DP0 pid=2287306) 2025-10-15 13:58:46,360 - modelscope - INFO - Target directory already exists, skipping creation.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.41it/s]
(EngineCore_DP0 pid=2287306) 
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:46 [default_loader.py:314] Loading weights took 0.28 seconds
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:46 [gpu_model_runner.py:2911] Model loading took 1.1201 GiB and 0.817729 seconds
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:50 [backends.py:604] Using cache directory: /home/vllm_25fall/.cache/vllm/torch_compile_cache/a7ccd4e3e2/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:50 [backends.py:618] Dynamo bytecode transform time: 3.69 s
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::mamba_mixer2
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::mamba_mixer
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::short_conv
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::linear_attention
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::plamo2_mamba_mixer
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::gdn_attention
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:53 [backends.py:243] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:06 [backends.py:270] Compiling a graph for dynamic shape takes 15.32 s
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:11 [monitor.py:33] torch.compile takes 19.01 s in total
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:52 [gpu_worker.py:315] Available KV cache memory: 52.59 GiB
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:52 [kv_cache_utils.py:1199] GPU KV cache size: 492,320 tokens
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:52 [kv_cache_utils.py:1204] Maximum concurrency for 40,960 tokens per request: 12.02x
(EngineCore_DP0 pid=2287306) 2025-10-15 13:59:52,920 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2287306) 2025-10-15 13:59:52,979 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:02<00:00, 32.63it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 42.39it/s]
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:57 [gpu_model_runner.py:3801] Graph capturing finished in 4 secs, took 0.59 GiB
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:57 [core.py:240] init engine (profile, create kv cache, warmup model) took 70.19 seconds
(EngineCore_DP0 pid=2287306) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(EngineCore_DP0 pid=2287306) 2025-10-15 13:59:57,462 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:59:58 [loggers.py:173] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 30770
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:58 [gc_utils.py:40] GC Debug Config. enabled:False,top_objects:-1
(APIServer pid=2285197) INFO 10-15 13:59:58 [api_server.py:1628] Supported tasks: ['generate']
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:59:58,425 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) WARNING 10-15 13:59:58 [model.py:1573] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_responses.py:147] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_responses.py:183] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_engine.py:285] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:59:58,818 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_chat.py:131] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:59:59,202 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:59:59 [serving_completion.py:67] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=2285197) INFO 10-15 13:59:59 [api_server.py:1939] Starting vLLM API server 0 on http://0.0.0.0:43433
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:38] Available routes are:
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=2285197) INFO:     Started server process [2285197]
(APIServer pid=2285197) INFO:     Waiting for application startup.
(APIServer pid=2285197) INFO:     Application startup complete.
(APIServer pid=2285197) INFO:     127.0.0.1:36700 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=2285197) /home/vllm_25fall/cmq/vllm/cmq/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'deprecated' attribute with value 'max_tokens is deprecated in favor of the max_completion_tokens field' was provided to the `Field()` function, which has no effect in the context it was used. 'deprecated' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
(APIServer pid=2285197)   warnings.warn(
(APIServer pid=2285197) INFO 10-15 13:59:59 [chat_utils.py:545] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=2285197) INFO:     127.0.0.1:36716 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-15 13:59:59] INFO _client.py:1786: HTTP Request: POST http://localhost:43433/v1/chat/completions "HTTP/1.1 200 OK"
PASSED
tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-False-Qwen/Qwen3-0.6B] (APIServer pid=2285197) INFO:     127.0.0.1:55960 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-15 14:00:03] INFO _client.py:1786: HTTP Request: POST http://localhost:43433/v1/chat/completions "HTTP/1.1 200 OK"
PASSED(APIServer pid=2285197) INFO 10-15 14:00:04 [launcher.py:110] Shutting down FastAPI HTTP server.
[rank0]:[W1015 14:00:04.047004909 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
nanobind: leaked 2 instances!
 - leaked instance 0x7f106100b438 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked instance 0x7f106100a898 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
nanobind: leaked 5 types!
 - leaked type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked type "xgrammar.xgrammar_bindings.TokenizerInfo"
 - leaked type "xgrammar.xgrammar_bindings.Grammar"
 - leaked type "xgrammar.xgrammar_bindings.GrammarMatcher"
 - leaked type "xgrammar.xgrammar_bindings.GrammarCompiler"
nanobind: leaked 47 functions!
 - leaked function "serialize_json"
 - leaked function "reset"
 - leaked function "__init__"
 - leaked function "union"
 - leaked function "serialize_json"
 - leaked function "to_string"
 - leaked function ""
 - leaked function ""
 - leaked function "__init__"
 - leaked function ""
 - leaked function "from_ebnf"
 - leaked function ""
 - leaked function "from_vocab_and_metadata"
 - leaked function "deserialize_json"
 - leaked function "concat"
 - leaked function "from_json_schema"
 - leaked function "compile_regex"
 - leaked function ""
 - leaked function ""
 - leaked function "serialize_json"
 - leaked function "deserialize_json"
 - leaked function "clear_cache"
 - leaked function "_detect_metadata_from_hf"
 - leaked function ""
 - leaked function "accept_string"
 - leaked function "from_regex"
 - leaked function "accept_token"
 - leaked function "dump_metadata"
 - leaked function "_debug_print_internal_state"
 - leaked function "compile_grammar"
 - leaked function "compile_structural_tag"
 - leaked function "is_terminated"
 - leaked function ""
 - leaked function "__init__"
 - leaked function ""
 - leaked function ""
 - leaked function "compile_json_schema"
 - leaked function "rollback"
 - leaked function "get_cache_size_bytes"
 - leaked function "from_structural_tag"
 - leaked function "find_jump_forward_string"
 - leaked function ""
 - leaked function "builtin_json_grammar"
 - leaked function "compile_builtin_json_grammar"
 - leaked function "deserialize_json"
 - leaked function "fill_next_token_bitmask"
 - leaked function ""
nanobind: this is likely caused by a reference counting issue in the binding code.
(APIServer pid=2285197) INFO:     Shutting down
(APIServer pid=2285197) INFO:     Waiting for application shutdown.
(APIServer pid=2285197) INFO:     Application shutdown complete.


================================================================================ warnings summary =================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

cmq/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/vllm_25fall/cmq/vllm/cmq/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================== 2 passed, 3 warnings in 105.28s (0:01:45) ====================================================================

@vllm-bot vllm-bot merged commit 302ef40 into vllm-project:main Oct 15, 2025
56 of 58 checks passed
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…t backends (vllm-project#26656)

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
@Yikun Yikun added this to the v0.11.1 milestone Oct 16, 2025
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
…t backends (vllm-project#26656)

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…t backends (vllm-project#26656)

Signed-off-by: MengqingCao <cmq0113@163.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…t backends (vllm-project#26656)

Signed-off-by: MengqingCao <cmq0113@163.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…t backends (vllm-project#26656)

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…t backends (vllm-project#26656)

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…t backends (vllm-project#26656)

Signed-off-by: MengqingCao <cmq0113@163.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…t backends (vllm-project#26656)

Signed-off-by: MengqingCao <cmq0113@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants