[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends#26656
[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends#26656vllm-bot merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a small refactor to make the DeepSeek implementation more reusable across different backends. The changes include making MLAAttention accept variable-length parameter lists, adding a feature flag for DSA topk indices buffer creation, and removing hardcoded CUDA device strings. While the changes are generally positive, I've identified a critical issue related to a potential KV cache layout mismatch and a high-severity issue concerning code duplication that should be addressed.
e860103 to
bbd96b6
Compare
💡 Codex Reviewhttps://github.com/vllm-project/vllm/blob/e860103d33ced506474fdd9b88610f0d0bc43101/vllm/model_executor/models/deepseek_v2.py#L472-L476 The new ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. |
|
This pull request has merge conflicts that must be resolved before it can be |
…t backends * allow MLAAttention to accept variable length parameter lists * add `enable_dsa_topk_indices_buffer` in `AttentionBackend` to flexibly determine whether to create topk indices buffer * remove cuda hard code Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
1708799 to
ae6ab4a
Compare
Signed-off-by: MengqingCao <cmq0113@163.com>
|
@DarkLight1337 Could help review this pr? Hoping this could be included in 0.11.1 |
|
Tried to reproduct the failed CI locally, but everything goes well. I think it is not a bug introduced by this pr? @DarkLight1337 VLLM_USE_MODELSCOPE=True pytest -sv tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2
INFO 10-15 13:58:18 [__init__.py:224] Automatically detected platform cuda.
/home/vllm_25fall/cmq/vllm/cmq/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=============================================================================== test session starts ===============================================================================
platform linux -- Python 3.12.11, pytest-8.3.5, pluggy-1.5.0 -- /home/vllm_25fall/cmq/vllm/cmq/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/home/vllm_25fall/cmq/vllm/.hypothesis/examples'))
rootdir: /home/vllm_25fall/cmq/vllm
configfile: pyproject.toml
plugins: forked-1.6.0, schemathesis-3.39.15, asyncio-0.24.0, subtests-0.14.1, cov-6.3.0, rerunfailures-14.0, shard-0.1.2, hypothesis-6.131.0, mock-3.14.0, buildkite-test-collector-0.1.9, hydra-core-1.3.2, timeout-2.3.1, anyio-4.6.2.post1
asyncio: mode=Mode.STRICT, default_loop_scope=None
collecting ... [2025-10-15 13:58:21] INFO config.py:54: PyTorch version 2.8.0+cu128 available.
[2025-10-15 13:58:21] INFO config.py:66: Polars version 1.29.0 available.
collected 2 items
Running 2 items in this shard: tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-True-Qwen/Qwen3-0.6B], tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-False-Qwen/Qwen3-0.6B]
tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-True-Qwen/Qwen3-0.6B] Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
2025-10-15 13:58:22,088 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
2025-10-15 13:58:22,474 - modelscope - INFO - Target directory already exists, skipping creation.
ERROR 10-15 13:58:22 [registry.py:587] Cached model info for class vllm.model_executor.models.qwen3.Qwen3ForCausalLM error.
ERROR 10-15 13:58:22 [registry.py:587] Traceback (most recent call last):
ERROR 10-15 13:58:22 [registry.py:587] File "/home/vllm_25fall/cmq/vllm/vllm/model_executor/models/registry.py", line 585, in _load_modelinfo_from_cache
ERROR 10-15 13:58:22 [registry.py:587] return _ModelInfo(**mi_dict["modelinfo"])
ERROR 10-15 13:58:22 [registry.py:587] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-15 13:58:22 [registry.py:587] TypeError: _ModelInfo.__init__() missing 1 required positional argument: 'supports_v0_only'
INFO 10-15 13:58:28 [model.py:645] Resolved architecture: Qwen3ForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING 10-15 13:58:28 [model.py:1932] Casting torch.bfloat16 to torch.float16.
INFO 10-15 13:58:28 [model.py:1706] Using max model len 40960
Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
2025-10-15 13:58:28,536 - modelscope - INFO - Target directory already exists, skipping creation.
Launching RemoteOpenAIServer with: vllm serve Qwen/Qwen3-0.6B --dtype half --enable-auto-tool-choice --structured-outputs-config.backend xgrammar --tool-call-parser hermes --reasoning-parser qwen3 --gpu-memory-utilization 0.4 --port 43433 --seed 0 --hf-overrides {"model_type": "kimi_k2", "kv_lora_rank": null}
INFO 10-15 13:58:30 [__init__.py:224] Automatically detected platform cuda.
(APIServer pid=2285197) INFO 10-15 13:58:34 [api_server.py:1870] vLLM API server version 0.11.1rc2.dev25+g7cfa420f4
(APIServer pid=2285197) INFO 10-15 13:58:34 [utils.py:239] non-default args: {'model_tag': 'Qwen/Qwen3-0.6B', 'port': 43433, 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'dtype': 'half', 'seed': 0, 'hf_overrides': {'model_type': 'kimi_k2', 'kv_lora_rank': None}, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.4, 'structured_outputs_config': StructuredOutputsConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='')}
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:34,757 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:35,148 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:58:35 [model.py:645] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=2285197) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=2285197) WARNING 10-15 13:58:35 [model.py:1932] Casting torch.bfloat16 to torch.float16.
(APIServer pid=2285197) INFO 10-15 13:58:35 [model.py:1706] Using max model len 40960
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:35,595 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:58:35 [scheduler.py:225] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:36,154 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:58:36,844 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 10-15 13:58:39 [__init__.py:224] Automatically detected platform cuda.
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:42 [core.py:727] Waiting for init message from front-end.
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:42 [core.py:94] Initializing a V1 LLM engine (v0.11.1rc2.dev25+g7cfa420f4) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': '', 'custom_ops': [], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': True, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_capture_size': 512, 'local_cache_dir': None}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [gpu_model_runner.py:2849] Starting to load model Qwen/Qwen3-0.6B...
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [gpu_model_runner.py:2879] Loading model from scratch...
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:45 [cuda.py:405] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=2287306) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(EngineCore_DP0 pid=2287306) 2025-10-15 13:58:46,360 - modelscope - INFO - Target directory already exists, skipping creation.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.41it/s]
(EngineCore_DP0 pid=2287306)
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:46 [default_loader.py:314] Loading weights took 0.28 seconds
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:46 [gpu_model_runner.py:2911] Model loading took 1.1201 GiB and 0.817729 seconds
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:50 [backends.py:604] Using cache directory: /home/vllm_25fall/.cache/vllm/torch_compile_cache/a7ccd4e3e2/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:50 [backends.py:618] Dynamo bytecode transform time: 3.69 s
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::mamba_mixer2
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::mamba_mixer
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::short_conv
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::linear_attention
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::plamo2_mamba_mixer
(EngineCore_DP0 pid=2287306) WARNING 10-15 13:58:50 [partition_rules.py:41] Failed to resolve operator for Inductor partition: vllm::gdn_attention
(EngineCore_DP0 pid=2287306) INFO 10-15 13:58:53 [backends.py:243] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:06 [backends.py:270] Compiling a graph for dynamic shape takes 15.32 s
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:11 [monitor.py:33] torch.compile takes 19.01 s in total
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:52 [gpu_worker.py:315] Available KV cache memory: 52.59 GiB
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:52 [kv_cache_utils.py:1199] GPU KV cache size: 492,320 tokens
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:52 [kv_cache_utils.py:1204] Maximum concurrency for 40,960 tokens per request: 12.02x
(EngineCore_DP0 pid=2287306) 2025-10-15 13:59:52,920 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2287306) 2025-10-15 13:59:52,979 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:02<00:00, 32.63it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 42.39it/s]
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:57 [gpu_model_runner.py:3801] Graph capturing finished in 4 secs, took 0.59 GiB
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:57 [core.py:240] init engine (profile, create kv cache, warmup model) took 70.19 seconds
(EngineCore_DP0 pid=2287306) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(EngineCore_DP0 pid=2287306) 2025-10-15 13:59:57,462 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:59:58 [loggers.py:173] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 30770
(EngineCore_DP0 pid=2287306) INFO 10-15 13:59:58 [gc_utils.py:40] GC Debug Config. enabled:False,top_objects:-1
(APIServer pid=2285197) INFO 10-15 13:59:58 [api_server.py:1628] Supported tasks: ['generate']
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:59:58,425 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) WARNING 10-15 13:59:58 [model.py:1573] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_responses.py:147] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_responses.py:183] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_engine.py:285] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:59:58,818 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:59:58 [serving_chat.py:131] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=2285197) Downloading Model from https://www.modelscope.cn to directory: /home/vllm_25fall/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B
(APIServer pid=2285197) 2025-10-15 13:59:59,202 - modelscope - INFO - Target directory already exists, skipping creation.
(APIServer pid=2285197) INFO 10-15 13:59:59 [serving_completion.py:67] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=2285197) INFO 10-15 13:59:59 [api_server.py:1939] Starting vLLM API server 0 on http://0.0.0.0:43433
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:38] Available routes are:
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=2285197) INFO 10-15 13:59:59 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=2285197) INFO: Started server process [2285197]
(APIServer pid=2285197) INFO: Waiting for application startup.
(APIServer pid=2285197) INFO: Application startup complete.
(APIServer pid=2285197) INFO: 127.0.0.1:36700 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=2285197) /home/vllm_25fall/cmq/vllm/cmq/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'deprecated' attribute with value 'max_tokens is deprecated in favor of the max_completion_tokens field' was provided to the `Field()` function, which has no effect in the context it was used. 'deprecated' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
(APIServer pid=2285197) warnings.warn(
(APIServer pid=2285197) INFO 10-15 13:59:59 [chat_utils.py:545] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=2285197) INFO: 127.0.0.1:36716 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-15 13:59:59] INFO _client.py:1786: HTTP Request: POST http://localhost:43433/v1/chat/completions "HTTP/1.1 200 OK"
PASSED
tests/entrypoints/openai/test_completion_with_function_calling.py::test_tool_id_kimi_k2[required-False-Qwen/Qwen3-0.6B] (APIServer pid=2285197) INFO: 127.0.0.1:55960 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-15 14:00:03] INFO _client.py:1786: HTTP Request: POST http://localhost:43433/v1/chat/completions "HTTP/1.1 200 OK"
PASSED(APIServer pid=2285197) INFO 10-15 14:00:04 [launcher.py:110] Shutting down FastAPI HTTP server.
[rank0]:[W1015 14:00:04.047004909 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
nanobind: leaked 2 instances!
- leaked instance 0x7f106100b438 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
- leaked instance 0x7f106100a898 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
nanobind: leaked 5 types!
- leaked type "xgrammar.xgrammar_bindings.CompiledGrammar"
- leaked type "xgrammar.xgrammar_bindings.TokenizerInfo"
- leaked type "xgrammar.xgrammar_bindings.Grammar"
- leaked type "xgrammar.xgrammar_bindings.GrammarMatcher"
- leaked type "xgrammar.xgrammar_bindings.GrammarCompiler"
nanobind: leaked 47 functions!
- leaked function "serialize_json"
- leaked function "reset"
- leaked function "__init__"
- leaked function "union"
- leaked function "serialize_json"
- leaked function "to_string"
- leaked function ""
- leaked function ""
- leaked function "__init__"
- leaked function ""
- leaked function "from_ebnf"
- leaked function ""
- leaked function "from_vocab_and_metadata"
- leaked function "deserialize_json"
- leaked function "concat"
- leaked function "from_json_schema"
- leaked function "compile_regex"
- leaked function ""
- leaked function ""
- leaked function "serialize_json"
- leaked function "deserialize_json"
- leaked function "clear_cache"
- leaked function "_detect_metadata_from_hf"
- leaked function ""
- leaked function "accept_string"
- leaked function "from_regex"
- leaked function "accept_token"
- leaked function "dump_metadata"
- leaked function "_debug_print_internal_state"
- leaked function "compile_grammar"
- leaked function "compile_structural_tag"
- leaked function "is_terminated"
- leaked function ""
- leaked function "__init__"
- leaked function ""
- leaked function ""
- leaked function "compile_json_schema"
- leaked function "rollback"
- leaked function "get_cache_size_bytes"
- leaked function "from_structural_tag"
- leaked function "find_jump_forward_string"
- leaked function ""
- leaked function "builtin_json_grammar"
- leaked function "compile_builtin_json_grammar"
- leaked function "deserialize_json"
- leaked function "fill_next_token_bitmask"
- leaked function ""
nanobind: this is likely caused by a reference counting issue in the binding code.
(APIServer pid=2285197) INFO: Shutting down
(APIServer pid=2285197) INFO: Waiting for application shutdown.
(APIServer pid=2285197) INFO: Application shutdown complete.
================================================================================ warnings summary =================================================================================
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
cmq/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305
/home/vllm_25fall/cmq/vllm/cmq/lib/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
ref_error: type[Exception] = jsonschema.RefResolutionError,
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================== 2 passed, 3 warnings in 105.28s (0:01:45) ==================================================================== |
…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>
…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>
…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>
…t backends (vllm-project#26656) Signed-off-by: MengqingCao <cmq0113@163.com>
Purpose
Tiny refactor on DeepSeek to make it reusable for different backends
* addenable_dsa_topk_indices_bufferinAttentionBackendto flexibly determine whether to create topk indices bufferTest Plan
Test pass with DeepSeek-V3.2-Exp
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.