[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends by MengqingCao · Pull Request #26656 · vllm-project/vllm

MengqingCao · 2025-10-12T09:53:46Z

Purpose

Tiny refactor on DeepSeek to make it reusable for different backends

allow MLAAttention to accept variable length parameter lists
* add enable_dsa_topk_indices_buffer in AttentionBackend to flexibly determine whether to create topk indices buffer
remove cuda hard code

Test Plan

Test pass with DeepSeek-V3.2-Exp

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a small refactor to make the DeepSeek implementation more reusable across different backends. The changes include making MLAAttention accept variable-length parameter lists, adding a feature flag for DSA topk indices buffer creation, and removing hardcoded CUDA device strings. While the changes are generally positive, I've identified a critical issue related to a potential KV cache layout mismatch and a high-severity issue concerning code duplication that should be addressed.

vllm/model_executor/models/deepseek_v2.py

vllm/model_executor/models/deepseek_mtp.py

vllm/model_executor/models/deepseek_v2.py

chatgpt-codex-connector · 2025-10-12T09:56:46Z

💡 Codex Review

https://github.com/vllm-project/vllm/blob/e860103d33ced506474fdd9b88610f0d0bc43101/vllm/model_executor/models/deepseek_v2.py#L472-L476
Align Deepseek indexer num_kv_heads with backend

The new get_kv_cache_spec now reports num_kv_heads=8, but the associated backend (DeepseekV32IndexerBackend.get_kv_cache_shape) still asserts num_kv_heads == 1. During KV-cache initialization the runner passes this spec to the backend, triggering the assertion and aborting execution for DeepSeek V3.2 indexer layers. The remaining cache code still stores data as a single head ([num_blocks, block_size, head_dim + 1]), so either the backend must be updated for 8 heads or the spec here should stay at 1.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

mergify · 2025-10-13T01:18:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MengqingCao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…t backends * allow MLAAttention to accept variable length parameter lists * add `enable_dsa_topk_indices_buffer` in `AttentionBackend` to flexibly determine whether to create topk indices buffer * remove cuda hard code Signed-off-by: MengqingCao <cmq0113@163.com>

Signed-off-by: MengqingCao <cmq0113@163.com>

vllm/model_executor/models/deepseek_v2.py

Signed-off-by: MengqingCao <cmq0113@163.com>

MengqingCao · 2025-10-15T02:47:00Z

@DarkLight1337 Could help review this pr? Hoping this could be included in 0.11.1

MengqingCao · 2025-10-15T06:03:24Z

Tried to reproduct the failed CI locally, but everything goes well. I think it is not a bug introduced by this pr? @DarkLight1337

VLLM_USE_MODELSCOPE=True pytest -sv tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2
INFO 10-15 13:58:18 [__init__.py:224] Automatically detected platform cuda.
/home/vllm_25fall/cmq/vllm/cmq/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=============================================================================== test session starts ===============================================================================
platform linux -- Python 3.12.11, pytest-8.3.5, pluggy-1.5.0 -- /home/vllm_25fall/cmq/vllm/cmq/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/home/vllm_25fall/cmq/vllm/.hypothesis/examples'))
rootdir: /home/vllm_25fall/cmq/vllm
configfile: pyproject.toml
plugins: forked-1.6.0, schemathesis-3.39.15, asyncio-0.24.0, subtests-0.14.1, cov-6.3.0, rerunfailures-14.0, shard-0.1.2, hypothesis-6.131.0, mock-3.14.0, buildkite-test-collector-0.1.9, hydra-core-1.3.2, timeout-2.3.1, anyio-4.6.2.post1
asyncio: mode=Mode.STRICT, default_loop_scope=None
collecting ... [2025-10-15 13:58:21] INFO config.py:54: PyTorch version 2.8.0+cu128 available.
[2025-10-15 13:58:21] INFO config.py:66: Polars version 1.29.0 available.
collected 2 items                                                                                                                                                                 
Running 2 items in this shard: tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-True-Qwen/Qwen3-0.6B], tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-False-Qwen/Qwen3-0.6B]

tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-True-Qwen/Qwen3-0.6B] Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
2025-10-15 13:58:22,088 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
2025-10-15 13:58:22,474 - modelscope - INFO - Target directory already exists, skipping creation.
ERROR 10-15 13:58:22 [registry.py:587] Cached model info for class vllm.model_executor.models.qwen3.Qwen3ForCausalLM error. 
ERROR 10-15 13:58:22 [registry.py:587] Traceback (most recent call last):
ERROR 10-15 13:58:22 [registry.py:587]   File "/home/vllm_25fall/cmq/vllm/vllm/model_executor/models/registry.py", line 585, in _load_modelinfo_from_cache
ERROR 10-15 13:58:22 [registry.py:587]     return _ModelInfo(**mi_dict["modelinfo"])
ERROR 10-15 13:58:22 [registry.py:587]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-15 13:58:22 [registry.py:587] TypeError: _ModelInfo.__init__() missing 1 required positional argument: 'supports_v0_only'
INFO 10-15 13:58:28 [model.py:645] Resolved architecture: Qwen3ForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING 10-15 13:58:28 [model.py:1932] Casting torch.bfloat16 to torch.float16.
INFO 10-15 13:58:28 [model.py:1706] Using max model len 40960
Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
2025-10-15 13:58:28,536 - modelscope - INFO - Target directory already exists, skipping creation.
Launching RemoteOpenAIServer with: vllm serve Qwen/Qwen3-0.6B --dtype half --enable-auto-tool-choice --structured-outputs-config.backend xgrammar --tool-call-parser hermes --reasoning-parser qwen3 --gpu-memory-utilization 0.4 --port 43433 --seed 0 --hf-overrides {"model_type": "kimi_k2", "kv_lora_rank": null}
INFO 10-15 13:58:30 [__init__.py:224] Automatically detected platform cuda.
(APIServer pid=2285197) INFO 10-15 13:58:34 [api_server.py:1870] vLLM API server version 0.11.1rc2.dev25+g7cfa420f4
(APIServer pid=2285197) INFO 10-15 13:58:34 [utils.py:239] non-default args: {'model_tag': 'Qwen/Qwen3-0.6B', 'port': 43433, 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'dtype': 'half', 'seed': 0, 'hf_overrides': {'model_type': 'kimi_k2', 'kv_lora_rank': None}, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.4, 'structured_outputs_config': StructuredOutputsConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='')}
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:34,757 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:35,148 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:58:35 [model.py:645] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=2285197) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=2285197) WARNING 10-15 13:58:35 [model.py:1932] Casting torch.bfloat16 to torch.float16.
(APIServer pid=2285197) INFO 10-15 13:58:35 [model.py:1706] Using max model len 40960
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:35,595 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:58:35 [scheduler.py:225] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:36,154 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:36,844 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 10-15 13:58:39 [__init__.py:224] Automatically detected platform cuda.
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:42 [core.py:727] Waiting for init message from front-end.
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:42 [core.py:94] Initializing a V1 LLM engine (v0.11.1rc2.dev25+g7cfa420f4) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': '', 'custom_ops': [], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': True, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_capture_size': 512, 'local_cache_dir': None}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [gpu_model_runner.py:2849] Starting to load model Qwen/Qwen3-0.6B...
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [gpu_model_runner.py:2879] Loading model from scratch...
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [cuda.py:405] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=2287306) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(EngineCore_DP0 pid=2287306) 2025-10-15 13:58:46,360 - modelscope - INFO - Target directory already exists, skipping creation.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.41it/s]
(EngineCore_DP0 pid=2287306) 
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:46 [default_loader.py:314] Loading weights took 0.28 seconds
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:46 [gpu_model_runner.py:2911] Model loading took 1.1201 GiB and 0.817729 seconds
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:50 [backends.py:604] Using cache directory: /home/vllm_25fall/.cache/vllm/torch_compile_cache/a7ccd4e3e2/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:50 [backends.py:618] Dynamo bytecode transform time: 3.69 s
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::mamba_mixer2
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::mamba_mixer
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::short_conv
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::linear_attention
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::plamo2_mamba_mixer
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::gdn_attention
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:53 [backends.py:243] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:06 [backends.py:270] Compiling a graph for dynamic shape takes 15.32 s
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:11 [monitor.py:33] torch.compile takes 19.01 s in total
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:52 [gpu_worker.py:315] Available KV cache memory: 52.59 GiB
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:52 [kv_cache_utils.py:1199] GPU KV cache size: 492,320 tokens
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:52 [kv_cache_utils.py:1204] Maximum concurrency for 40,960 tokens per request: 12.02x
(EngineCore_DP0 pid=2287306) 2025-10-15 13:59:52,920 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2287306) 2025-10-15 13:59:52,979 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:02<00:00, 32.63it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 42.39it/s]
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:57 [gpu_model_runner.py:3801] Graph capturing finished in 4 secs, took 0.59 GiB
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:57 [core.py:240] init engine (profile, create kv cache, warmup model) took 70.19 seconds
(EngineCore_DP0 pid=2287306) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(EngineCore_DP0 pid=2287306) 2025-10-15 13:59:57,462 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:59:58 [loggers.py:173] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 30770
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:58 [gc_utils.py:40] GC Debug Config. enabled:False,top_objects:-1
(APIServer pid=2285197) INFO 10-15 13:59:58 [api_server.py:1628] Supported tasks: ['generate']
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:59:58,425 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) WARNING 10-15 13:59:58 [model.py:1573] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_responses.py:147] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_responses.py:183] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_engine.py:285] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:59:58,818 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_chat.py:131] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:59:59,202 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:59:59 [serving_completion.py:67] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=2285197) INFO 10-15 13:59:59 [api_server.py:1939] Starting vLLM API server 0 on http://0.0.0.0:43433
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:38] Available routes are:
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=2285197) INFO:     Started server process [2285197]
(APIServer pid=2285197) INFO:     Waiting for application startup.
(APIServer pid=2285197) INFO:     Application startup complete.
(APIServer pid=2285197) INFO:     127.0.0.1:36700 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=2285197) /home/vllm_25fall/cmq/vllm/cmq/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'deprecated' attribute with value 'max_tokens is deprecated in favor of the max_completion_tokens field' was provided to the `Field()` function, which has no effect in the context it was used. 'deprecated' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
(APIServer pid=2285197)   warnings.warn(
(APIServer pid=2285197) INFO 10-15 13:59:59 [chat_utils.py:545] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=2285197) INFO:     127.0.0.1:36716 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-15 13:59:59] INFO _client.py:1786: HTTP Request: POST http://localhost:43433/v1/chat/completions "HTTP/1.1 200 OK"
PASSED
tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-False-Qwen/Qwen3-0.6B] (APIServer pid=2285197) INFO:     127.0.0.1:55960 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-15 14:00:03] INFO _client.py:1786: HTTP Request: POST http://localhost:43433/v1/chat/completions "HTTP/1.1 200 OK"
PASSED(APIServer pid=2285197) INFO 10-15 14:00:04 [launcher.py:110] Shutting down FastAPI HTTP server.
[rank0]:[W1015 14:00:04.047004909 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
nanobind: leaked 2 instances!
 - leaked instance 0x7f106100b438 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked instance 0x7f106100a898 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
nanobind: leaked 5 types!
 - leaked type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked type "xgrammar.xgrammar_bindings.TokenizerInfo"
 - leaked type "xgrammar.xgrammar_bindings.Grammar"
 - leaked type "xgrammar.xgrammar_bindings.GrammarMatcher"
 - leaked type "xgrammar.xgrammar_bindings.GrammarCompiler"
nanobind: leaked 47 functions!
 - leaked function "serialize_json"
 - leaked function "reset"
 - leaked function "__init__"
 - leaked function "union"
 - leaked function "serialize_json"
 - leaked function "to_string"
 - leaked function ""
 - leaked function ""
 - leaked function "__init__"
 - leaked function ""
 - leaked function "from_ebnf"
 - leaked function ""
 - leaked function "from_vocab_and_metadata"
 - leaked function "deserialize_json"
 - leaked function "concat"
 - leaked function "from_json_schema"
 - leaked function "compile_regex"
 - leaked function ""
 - leaked function ""
 - leaked function "serialize_json"
 - leaked function "deserialize_json"
 - leaked function "clear_cache"
 - leaked function "_detect_metadata_from_hf"
 - leaked function ""
 - leaked function "accept_string"
 - leaked function "from_regex"
 - leaked function "accept_token"
 - leaked function "dump_metadata"
 - leaked function "_debug_print_internal_state"
 - leaked function "compile_grammar"
 - leaked function "compile_structural_tag"
 - leaked function "is_terminated"
 - leaked function ""
 - leaked function "__init__"
 - leaked function ""
 - leaked function ""
 - leaked function "compile_json_schema"
 - leaked function "rollback"
 - leaked function "get_cache_size_bytes"
 - leaked function "from_structural_tag"
 - leaked function "find_jump_forward_string"
 - leaked function ""
 - leaked function "builtin_json_grammar"
 - leaked function "compile_builtin_json_grammar"
 - leaked function "deserialize_json"
 - leaked function "fill_next_token_bitmask"
 - leaked function ""
nanobind: this is likely caused by a reference counting issue in the binding code.
(APIServer pid=2285197) INFO:     Shutting down
(APIServer pid=2285197) INFO:     Waiting for application shutdown.
(APIServer pid=2285197) INFO:     Application shutdown complete.


================================================================================ warnings summary =================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

cmq/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/vllm_25fall/cmq/vllm/cmq/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================== 2 passed, 3 warnings in 105.28s (0:01:45) ====================================================================

…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>

…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>

MengqingCao requested review from LucasWilkinson, WoosukKwon, alexm-redhat, comaniac, luccafong, njhill, youkaichao and zhuohan123 as code owners October 12, 2025 09:53

mergify bot added deepseek Related to DeepSeek models v1 labels Oct 12, 2025

gemini-code-assist bot reviewed Oct 12, 2025

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

vllm/model_executor/models/deepseek_mtp.py Outdated Show resolved Hide resolved

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

MengqingCao force-pushed the mlakwargs branch from e860103 to bbd96b6 Compare October 12, 2025 09:55

mergify bot added the needs-rebase label Oct 13, 2025

MengqingCao added 2 commits October 13, 2025 03:22

fix comments

ae6ab4a

Signed-off-by: MengqingCao <cmq0113@163.com>

MengqingCao force-pushed the mlakwargs branch from 1708799 to ae6ab4a Compare October 13, 2025 03:23

mergify bot removed the needs-rebase label Oct 13, 2025

wangxiyuan reviewed Oct 14, 2025

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

remove enable_dsa_topk_indices_buffer

8674612

Signed-off-by: MengqingCao <cmq0113@163.com>

DarkLight1337 approved these changes Oct 15, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 15, 2025 02:50

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 15, 2025

MengqingCao mentioned this pull request Oct 15, 2025

[Attention][DSA] Abstrat DSA topk-indices-buffer to AttentionBackend #26872

Closed

5 tasks

vllm-bot merged commit 302ef40 into vllm-project:main Oct 15, 2025
56 of 58 checks passed

Yikun added this to the v0.11.1 milestone Oct 16, 2025

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for differen…

53d00c3

…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for differen…

6de75e0

…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for differen…

ab690df

…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for differen…

505cda1

…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends#26656

[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends#26656
vllm-bot merged 3 commits intovllm-project:mainfrom
MengqingCao:mlakwargs

MengqingCao commented Oct 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Oct 12, 2025

Uh oh!

mergify bot commented Oct 13, 2025

Uh oh!

Uh oh!

MengqingCao commented Oct 15, 2025

Uh oh!

MengqingCao commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

MengqingCao commented Oct 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Oct 12, 2025

💡 Codex Review

Uh oh!

mergify bot commented Oct 13, 2025

Uh oh!

Uh oh!

MengqingCao commented Oct 15, 2025

Uh oh!

MengqingCao commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MengqingCao commented Oct 12, 2025 •

edited by github-actions bot

Loading