Skip to content

[Bugfix] Fix KV cache sizing and allocation for hybrid Mamba/attention models#37429

Open
swtb3 wants to merge 4 commits intovllm-project:mainfrom
swtb3:fix/hybrid-mamba-compact-allocation
Open

[Bugfix] Fix KV cache sizing and allocation for hybrid Mamba/attention models#37429
swtb3 wants to merge 4 commits intovllm-project:mainfrom
swtb3:fix/hybrid-mamba-compact-allocation

Conversation

@swtb3
Copy link
Copy Markdown

@swtb3 swtb3 commented Mar 18, 2026

Summary

  • Fix KV cache block count overestimation for hybrid Mamba/attention models (e.g. Qwen3.5) by detecting mixed groups and sizing each independently
  • Fix token capacity reporting to only count Mamba groups when mamba_cache_mode="all"
  • Add compact Mamba allocation: Mamba layers self-manage a dedicated O(1) block pool instead of sharing the attention pool, eliminating 7x memory waste and OOM
  • Fix compact pool exhaustion causing 4x throughput regression by capping "none" mode allocation at 1+spec blocks per request and making remove_skipped_blocks a no-op for permanent Mamba state

Supersedes #37124 (closed due to rebase notification issue).

Test plan

  • 5 new TDD tests for compact "none" mode performance fix
  • 8 new tests for hybrid Mamba/attention KV cache sizing
  • Benchmark on Qwen3.5-27B-FP8 with MTP spec decoding to confirm throughput recovery

AI assistance was used (Claude). This is not duplicating an existing PR — it supersedes #37124 with a clean branch.

🤖 Generated with Claude Code

…n models

Fix KV cache overestimation, memory waste, OOM, and throughput regression
  for hybrid Mamba/attention models (e.g. Qwen3.5).

  - Fix KV cache block count overestimation by detecting mixed Mamba/attention                                                                                                                                                                                                          groups and sizing each independently instead of using worst-case uniform sizing                                                                                                                                                                                                   - Fix token capacity reporting to only count Mamba groups when
    mamba_cache_mode="all" (prefix caching)
  - Add compact Mamba allocation: Mamba layers self-manage a small dedicated
    block pool (O(1) per request) instead of sharing the attention pool,
    eliminating 7x memory waste and OOM on large models
  - Fix compact pool exhaustion causing 4x throughput regression by capping
    "none" mode allocation at 1+spec blocks per request (matching kernel usage)
    and making remove_skipped_blocks a no-op for permanent Mamba state

  Co-authored-by: Claude <noreply@anthropic.com>                                                                                                                                                                                                                                       Signed-off-by: swtb3 <135991636+swtb3@users.noreply.github.com>

Signed-off-by: swtb3 <135991636+swtb3@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new 'compact Mamba allocation' strategy to significantly improve memory efficiency and token capacity for hybrid Mamba+attention models like Qwen3.5. The changes involve decoupling Mamba layer memory allocation from attention layers, allowing Mamba (which has O(1) state per request) to use fewer blocks. This is achieved by adding new logic to kv_cache_utils.py for determining KV cache configurations and concurrency estimates for mixed models, including separate block pools for Mamba layers in 'none' and 'align' cache modes. The MambaManager in single_type_kv_cache_manager.py is updated to handle this compact allocation, managing its own private block pool and ensuring blocks are allocated and freed correctly without interfering with the shared attention block pool. Extensive new test cases are added to validate the correctness, efficiency, and concurrency behavior of this new allocation scheme, including regression guards for pure attention/Mamba models and specific tests for the 'none' and 'align' Mamba cache modes.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 18, 2026

Hi @swtb3, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…ator wiring

Add mamba_num_blocks field to KVCacheConfig and pass it through
  kv_cache_coordinator to MambaManager. These were missed when
  squashing the compact Mamba allocation commits onto a fresh branch.

  Co-authored-by: Claude <noreply@anthropic.com>
135991636+swtb3@users.noreply.github.com

Signed-off-by: swtb3 <135991636+swtb3@users.noreply.github.com>
@repne
Copy link
Copy Markdown

repne commented Mar 18, 2026

I can still report the performance degradation of TTFT. This is what I see in the startup log:

Startup log
(APIServer pid=5653) INFO 03-18 14:23:56 [utils.py:297]
(APIServer pid=5653) INFO 03-18 14:23:56 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=5653) INFO 03-18 14:23:56 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1rc1.dev416+g9dfb5ae0f.d20260318
(APIServer pid=5653) INFO 03-18 14:23:56 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen3.5-27B-FP8
(APIServer pid=5653) INFO 03-18 14:23:56 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=5653) INFO 03-18 14:23:56 [utils.py:297]
(APIServer pid=5653) INFO 03-18 14:23:56 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen3.5-27B-FP8', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'Qwen/Qwen3.5-27B-FP8', 'max_model_len': 262144, 'load_format': 'instanttensor', 'attention_backend': 'TRITON_ATTN', 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'block_size': 32, 'gpu_memory_utilization': 0.94, 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 8192, 'max_num_seqs': 32, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 2, 'rejection_sample_method': 'probabilistic'}, 'optimization_level': '3'}
(APIServer pid=5653) INFO 03-18 14:23:57 [model.py:533] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=5653) INFO 03-18 14:23:57 [model.py:1582] Using max model len 262144
(APIServer pid=5653) INFO 03-18 14:23:58 [model.py:533] Resolved architecture: Qwen3_5MTP
(APIServer pid=5653) INFO 03-18 14:23:58 [model.py:1582] Using max model len 262144
(APIServer pid=5653) WARNING 03-18 14:23:58 [speculative.py:499] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=5653) INFO 03-18 14:23:58 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=5653) WARNING 03-18 14:23:58 [config.py:372] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=5653) INFO 03-18 14:23:58 [config.py:392] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=5653) INFO 03-18 14:23:58 [config.py:212] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=5653) INFO 03-18 14:23:58 [config.py:243] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=5653) INFO 03-18 14:23:58 [vllm.py:750] Asynchronous scheduling is enabled.
(APIServer pid=5653) INFO 03-18 14:23:58 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=5653) INFO 03-18 14:23:59 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
(EngineCore pid=5738) INFO 03-18 14:24:03 [core.py:103] Initializing a V1 LLM engine (v0.17.1rc1.dev416+g9dfb5ae0f.d20260318) with config: model='Qwen/Qwen3.5-27B-FP8', speculative_config=SpeculativeConfig(method='mtp', model='Qwen/Qwen3.5-27B-FP8', num_spec_tokens=2), tokenizer='Qwen/Qwen3.5-27B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=instanttensor, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.5-27B-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 192, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=5738) WARNING 03-18 14:24:03 [multiproc_executor.py:1001] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=5738) INFO 03-18 14:24:03 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.1.182 (local), world_size=2, local_world_size=2
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 03-18 14:24:07 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
INFO 03-18 14:24:07 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=5809) INFO 03-18 14:24:07 [parallel_state.py:1395] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:36145 backend=nccl
(Worker pid=5810) INFO 03-18 14:24:07 [parallel_state.py:1395] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:36145 backend=nccl
(Worker pid=5809) INFO 03-18 14:24:07 [pynccl.py:111] vLLM is using nccl==2.29.7
(Worker pid=5809) WARNING 03-18 14:24:07 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=5810) WARNING 03-18 14:24:07 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=5810) INFO 03-18 14:24:07 [parallel_state.py:1717] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank N/A, EPLB rank N/A
(Worker pid=5809) INFO 03-18 14:24:07 [parallel_state.py:1717] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=5810) WARNING 03-18 14:24:07 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=5809) WARNING 03-18 14:24:07 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP0 pid=5809) INFO 03-18 14:24:07 [gpu_model_runner.py:4506] Starting to load model Qwen/Qwen3.5-27B-FP8...
(Worker_TP1 pid=5810) INFO 03-18 14:24:08 [cuda.py:389] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP1 pid=5810) INFO 03-18 14:24:08 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP1 pid=5810) INFO 03-18 14:24:08 [qwen3_next.py:202] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=5809) INFO 03-18 14:24:08 [cuda.py:389] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=5809) INFO 03-18 14:24:08 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=5809) INFO 03-18 14:24:08 [qwen3_next.py:202] Using Triton/FLA GDN prefill kernel
(Worker_TP1 pid=5810) INFO 03-18 14:24:08 [cuda.py:273] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=5809) INFO 03-18 14:24:08 [cuda.py:273] Using AttentionBackendEnum.TRITON_ATTN backend.
Loading safetensors using InstantTensor loader:   0% Completed | 0/1606 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:   0% Completed | 1/1606 [00:01<33:15,  1.24s/it]
Loading safetensors using InstantTensor loader:   0% Completed | 2/1606 [00:02<31:43,  1.19s/it]
Loading safetensors using InstantTensor loader:   2% Completed | 38/1606 [00:03<01:43, 15.13it/s]
Loading safetensors using InstantTensor loader:   4% Completed | 65/1606 [00:04<01:20, 19.25it/s]
Loading safetensors using InstantTensor loader:   8% Completed | 122/1606 [00:05<00:46, 31.88it/s]
Loading safetensors using InstantTensor loader:  11% Completed | 179/1606 [00:06<00:36, 39.60it/s]
Loading safetensors using InstantTensor loader:  15% Completed | 237/1606 [00:07<00:30, 45.14it/s]
Loading safetensors using InstantTensor loader:  18% Completed | 294/1606 [00:08<00:27, 48.13it/s]
Loading safetensors using InstantTensor loader:  22% Completed | 349/1606 [00:09<00:25, 49.39it/s]
Loading safetensors using InstantTensor loader:  25% Completed | 409/1606 [00:10<00:23, 51.68it/s]
Loading safetensors using InstantTensor loader:  30% Completed | 475/1606 [00:11<00:20, 55.91it/s]
Loading safetensors using InstantTensor loader:  34% Completed | 547/1606 [00:12<00:17, 60.50it/s]
Loading safetensors using InstantTensor loader:  99% Completed | 1584/1606 [00:13<00:00, 347.59it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 1606/1606 [00:13<00:00, 116.70it/s]
(Worker_TP0 pid=5809)
(Worker_TP0 pid=5809) INFO 03-18 14:24:23 [default_loader.py:384] Loading weights took 13.86 seconds
(Worker_TP0 pid=5809) INFO 03-18 14:24:23 [gpu_model_runner.py:4530] Loading drafter model...
(Worker_TP1 pid=5810) INFO 03-18 14:24:23 [gpu_model_runner.py:4530] Loading drafter model...
Loading safetensors using InstantTensor loader:   0% Completed | 0/1606 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:   0% Completed | 1/1606 [00:01<36:03,  1.35s/it]
Loading safetensors using InstantTensor loader:   0% Completed | 2/1606 [00:02<32:31,  1.22s/it]
Loading safetensors using InstantTensor loader:   2% Completed | 38/1606 [00:03<01:46, 14.68it/s]
Loading safetensors using InstantTensor loader:   6% Completed | 97/1606 [00:04<00:50, 29.94it/s]
Loading safetensors using InstantTensor loader:   8% Completed | 129/1606 [00:05<00:49, 29.68it/s]
Loading safetensors using InstantTensor loader:  11% Completed | 183/1606 [00:06<00:38, 37.15it/s]
Loading safetensors using InstantTensor loader:  15% Completed | 238/1606 [00:07<00:32, 42.53it/s]
Loading safetensors using InstantTensor loader:  18% Completed | 295/1606 [00:08<00:28, 46.62it/s]
Loading safetensors using InstantTensor loader:  22% Completed | 351/1606 [00:09<00:25, 49.43it/s]
Loading safetensors using InstantTensor loader:  26% Completed | 412/1606 [00:10<00:22, 52.53it/s]
Loading safetensors using InstantTensor loader:  29% Completed | 465/1606 [00:11<00:22, 51.59it/s]
Loading safetensors using InstantTensor loader:  34% Completed | 547/1606 [00:12<00:17, 59.92it/s]
Loading safetensors using InstantTensor loader:  43% Completed | 693/1606 [00:13<00:10, 84.62it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 1606/1606 [00:14<00:00, 113.74it/s]
(Worker_TP0 pid=5809)
(Worker_TP0 pid=5809) INFO 03-18 14:24:37 [default_loader.py:384] Loading weights took 14.17 seconds
(Worker_TP0 pid=5809) INFO 03-18 14:24:37 [eagle.py:1365] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP1 pid=5810) INFO 03-18 14:24:37 [eagle.py:1365] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP0 pid=5809) INFO 03-18 14:24:37 [eagle.py:1419] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP1 pid=5810) INFO 03-18 14:24:37 [eagle.py:1419] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP0 pid=5809) INFO 03-18 14:24:38 [gpu_model_runner.py:4591] Model loading took 14.21 GiB memory and 29.308630 seconds
(Worker_TP0 pid=5809) INFO 03-18 14:24:40 [backends.py:988] Using cache directory: /home/repne/.cache/vllm/torch_compile_cache/38024dafbd/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=5809) INFO 03-18 14:24:40 [backends.py:1048] Dynamo bytecode transform time: 2.43 s
(Worker_TP0 pid=5809) INFO 03-18 14:24:43 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 1.666 s
(Worker_TP0 pid=5809) INFO 03-18 14:24:43 [monitor.py:48] torch.compile took 4.74 s in total
(Worker_TP0 pid=5809) INFO 03-18 14:24:43 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/fc097b34cdd61442bf3f92d3863e79e6ab0ff30bffd9327635ce9c91eaa16664/rank_0_0/model
(Worker_TP1 pid=5810) INFO 03-18 14:24:43 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/fc097b34cdd61442bf3f92d3863e79e6ab0ff30bffd9327635ce9c91eaa16664/rank_1_0/model
(Worker_TP0 pid=5809) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=5809)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP1 pid=5810) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=5810)   return fn(*contiguous_args, **contiguous_kwargs)
^[[A^[[A(Worker_TP0 pid=5809) INFO 03-18 14:25:03 [monitor.py:76] Initial profiling/warmup run took 20.23 s
(Worker_TP0 pid=5809) INFO 03-18 14:25:03 [backends.py:988] Using cache directory: /home/repne/.cache/vllm/torch_compile_cache/38024dafbd/rank_0_0/eagle_head for vLLM's torch.compile
(Worker_TP0 pid=5809) INFO 03-18 14:25:03 [backends.py:1048] Dynamo bytecode transform time: 0.07 s
(Worker_TP0 pid=5809) INFO 03-18 14:25:03 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 0.027 s
(Worker_TP0 pid=5809) INFO 03-18 14:25:03 [monitor.py:48] torch.compile took 0.42 s in total
(Worker_TP0 pid=5809) INFO 03-18 14:25:03 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/b9809f09c677bf2d65ce912e42158c0eb911e0a11479b18c36f4e6efe445a31e/rank_0_0/model
(Worker_TP1 pid=5810) INFO 03-18 14:25:03 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/b9809f09c677bf2d65ce912e42158c0eb911e0a11479b18c36f4e6efe445a31e/rank_1_0/model
(Worker_TP0 pid=5809) INFO 03-18 14:25:03 [monitor.py:76] Initial profiling/warmup run took 0.07 s
(Worker_TP0 pid=5809) WARNING 03-18 14:25:04 [kv_cache_utils.py:1109] Add 3 padding layers, may waste at most 6.25% KV cache memory
(Worker_TP0 pid=5809) INFO 03-18 14:25:04 [kv_cache_utils.py:854] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=192
(Worker_TP1 pid=5810) WARNING 03-18 14:25:04 [kv_cache_utils.py:1109] Add 3 padding layers, may waste at most 6.25% KV cache memory
(Worker_TP1 pid=5810) INFO 03-18 14:25:04 [kv_cache_utils.py:854] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=192
(Worker_TP0 pid=5809) INFO 03-18 14:25:04 [gpu_model_runner.py:5632] Profiling CUDA graph memory: PIECEWISE=26 (largest=192), FULL=14 (largest=96)
(Worker_TP1 pid=5810) INFO 03-18 14:25:04 [gpu_model_runner.py:5632] Profiling CUDA graph memory: PIECEWISE=26 (largest=192), FULL=14 (largest=96)
(Worker_TP0 pid=5809) INFO 03-18 14:25:14 [custom_all_reduce.py:216] Registering 522 cuda graph addresses
(Worker_TP1 pid=5810) INFO 03-18 14:25:14 [custom_all_reduce.py:216] Registering 522 cuda graph addresses
(Worker_TP1 pid=5810) INFO 03-18 14:25:15 [gpu_model_runner.py:5711] Estimated CUDA graph memory: 0.47 GiB total
(Worker_TP0 pid=5809) INFO 03-18 14:25:15 [gpu_model_runner.py:5711] Estimated CUDA graph memory: 0.47 GiB total
(Worker_TP1 pid=5810) INFO 03-18 14:25:15 [gpu_worker.py:472] CUDA graph memory profiling is enabled (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1). This will become the default in v0.19. The current --gpu-memory-utilization=0.9400 is equivalent to --gpu-memory-utilization=0.9249 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9551.
(Worker_TP0 pid=5809) INFO 03-18 14:25:15 [gpu_worker.py:456] Available KV cache memory: 13.61 GiB
(Worker_TP0 pid=5809) INFO 03-18 14:25:15 [gpu_worker.py:472] CUDA graph memory profiling is enabled (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1). This will become the default in v0.19. The current --gpu-memory-utilization=0.9400 is equivalent to --gpu-memory-utilization=0.9249 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9551.
(EngineCore pid=5738) WARNING 03-18 14:25:15 [kv_cache_utils.py:1109] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=5738) INFO 03-18 14:25:15 [kv_cache_utils.py:1497] GPU KV cache size: 410,400 tokens
(EngineCore pid=5738) INFO 03-18 14:25:15 [kv_cache_utils.py:1502] Maximum concurrency for 262,144 tokens per request: 1.00x
(Worker_TP0 pid=5809) 2026-03-18 14:25:15,668 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=5810) 2026-03-18 14:25:15,668 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=5809) 2026-03-18 14:25:16,532 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP1 pid=5810) 2026-03-18 14:25:16,532 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 26/26 [00:01<00:00, 21.39it/s]
Capturing CUDA graphs (decode, FULL):  71%|████████████████████████████████████████████████████████████████████████████████                                | 10/14 [00:03<00:02,  1.87it/s](Worker_TP1 pid=5810) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (18) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=5810)   return fn(*contiguous_args, **contiguous_kwargs)
/home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (18) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=5809)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=5809) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=5810) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=5809)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP1 pid=5810)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL):  86%|████████████████████████████████████████████████████████████████████████████████████████████████                | 12/14 [00:06<00:01,  1.21it/s](Worker_TP1 pid=5810) /home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=5810)   return fn(*args, **kwargs)
(Worker_TP1 pid=5810) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=5810)   return fn(*contiguous_args, **contiguous_kwargs)
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=5809)   return fn(*args, **kwargs)
(Worker_TP0 pid=5809) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=5809)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=5809) /home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=5809)   return fn(*args, **kwargs)
(Worker_TP1 pid=5810) /home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=5810)   return fn(*args, **kwargs)
(Worker_TP0 pid=5809) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=5809)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP1 pid=5810) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=5810)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:06<00:00,  2.28it/s]
(Worker_TP1 pid=5810) INFO 03-18 14:25:29 [custom_all_reduce.py:216] Registering 5238 cuda graph addresses
(Worker_TP0 pid=5809) INFO 03-18 14:25:29 [custom_all_reduce.py:216] Registering 5238 cuda graph addresses
(Worker_TP1 pid=5810) INFO 03-18 14:25:29 [gpu_worker.py:617] CUDA graph pool memory: 0.5 GiB (actual), 0.47 GiB (estimated), difference: 0.02 GiB (4.7%).
(Worker_TP0 pid=5809) INFO 03-18 14:25:29 [gpu_model_runner.py:5771] Graph capturing finished in 13 secs, took 0.50 GiB
(Worker_TP0 pid=5809) INFO 03-18 14:25:29 [gpu_worker.py:617] CUDA graph pool memory: 0.5 GiB (actual), 0.47 GiB (estimated), difference: 0.02 GiB (4.7%).
(EngineCore pid=5738) INFO 03-18 14:25:29 [core.py:281] init engine (profile, create kv cache, warmup model) took 51.68 seconds
(EngineCore pid=5738) INFO 03-18 14:25:30 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=5738) INFO 03-18 14:25:30 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=5738) INFO 03-18 14:25:30 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=5653) INFO 03-18 14:25:30 [api_server.py:595] Supported tasks: ['generate']
(APIServer pid=5653) INFO 03-18 14:25:30 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=5653) WARNING 03-18 14:25:31 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=5653) INFO 03-18 14:25:31 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=5653) INFO 03-18 14:25:31 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=5653) INFO 03-18 14:25:33 [hf.py:320] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=5653) INFO 03-18 14:25:39 [base.py:216] Multi-modal warmup completed in 5.966s
(APIServer pid=5653) INFO 03-18 14:25:39 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=5653) INFO 03-18 14:25:39 [api_server.py:599] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:37] Available routes are:
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=5653) INFO 03-18 14:25:39 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=5653) INFO:     Started server process [5653]
(APIServer pid=5653) INFO:     Waiting for application startup.
(APIServer pid=5653) INFO:     Application startup complete.

swtb3 added 2 commits March 19, 2026 01:39
…id models

Fix two bugs causing a 4x throughput regression on long-context hybrid                                          Mamba/attention models (e.g. Qwen3.5 at 262K context):
                                                                                                                  1. Concurrency formula used max_model_len as attention cost, giving C=1                                            for long contexts. Mamba state is O(1) per request, so concurrency
     should be independent of sequence length. Replace with shared-pool
     cap formula that guarantees attention_blocks >= shared pool equivalent.
                                                                                                                  2. Mamba page sizes were padded to match attention even in compact mode                                            where Mamba has its own separate tensors. Use real_page_size_bytes
     for Mamba allocation, cost accounting, and tensor reshape (including
     the model runner's stride and block count derivation).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: swtb3 <135991636+swtb3@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mamba layers self-manage a dedicated O(1) block pool instead of sharing the attention pool, eliminating 7x memory waste and OOM

IIUC, that'd be overriding a lot of work that went into HMA, building around making a single shared pool work.
Any similar change would at least require an RFC to discuss.
Would you care to elaborate more of your suggested changes in this format?

cc @heheda12345

@repne
Copy link
Copy Markdown

repne commented Mar 19, 2026

Thank you @swtb3 , just tested this and I don't see decrease in performance anymore. However, previous commit would bring GPU KV cache size from 100k to 400k, but now I am getting only a 30k increase which is still decent.

@swtb3
Copy link
Copy Markdown
Author

swtb3 commented Mar 19, 2026

Mamba layers self-manage a dedicated O(1) block pool instead of sharing the attention pool, eliminating 7x memory waste and OOM

IIUC, that'd be overriding a lot of work that went into HMA, building around making a single shared pool work. Any similar change would at least require an RFC to discuss. Would you care to elaborate more of your suggested changes in this format?

cc @heheda12345

To be honest folks, this has become a bit of a rabbit hole, It would be good to get some assistance from a maintainer who has more knowledge of vLLMs machinery.

All I know is that hybrid mamba models are fantastic for keeping kv cache low...but vLLM currently ignores that improvement and treats all layers the same.

@swtb3
Copy link
Copy Markdown
Author

swtb3 commented Mar 19, 2026

Thank you @swtb3 , just tested this and I don't see decrease in performance anymore. However, previous commit would bring GPU KV cache size from 100k to 400k, but now I am getting only a 30k increase which is still decent.

Could you share the logs/ results?

@repne
Copy link
Copy Markdown

repne commented Mar 19, 2026

Tested against main @ c63ca2b

Command:

NCCL_P2P_LEVEL=SYS
NCCL_IB_DISABLE=1 \
NCCL_NET_GDR_LEVEL=SYS \
NCCL_MIN_NCHANNELS=4 \
NCCL_ALLOC_P2P_NET_LL_BUFFERS=1 \
VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE=1 \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
vllm serve \
    Qwen/Qwen3.5-27B-FP8 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.94 \
    --max-model-len 262144 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 8192 \
    --block-size 32 \
    --language-model-only \
    -O3 \
    --enable-auto-tool-choice \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --attention-backend TRITON_ATTN \
    --enable-prefix-caching \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 2 \
    --speculative-config.rejection_sample_method probabilistic \
    --load-format instanttensor

Before PR:

INFO 03-19 13:31:21 [kv_cache_utils.py:1316] GPU KV cache size: 104,800 tokens
INFO 03-19 13:31:21 [kv_cache_utils.py:1321] Maximum concurrency for 262,144 tokens per request: 1.54x

After PR:

INFO 03-19 13:37:53 [kv_cache_utils.py:1498] GPU KV cache size: 132,800 tokens
INFO 03-19 13:37:53 [kv_cache_utils.py:1503] Maximum concurrency for 262,144 tokens per request: 0.51x
WARNING 03-19 13:37:53 [compilation.py:1236] Capping cudagraph capture sizes from max 192 to 162 to fit Mamba cache blocks (166 blocks available). This limits the maximum batch size that can use CUDA graphs. To increase this limit, reduce max_num_seqs or increase available GPU memory.
WARNING 03-19 13:37:53 [compilation.py:1236] Capping cudagraph capture sizes from max 192 to 162 to fit Mamba cache blocks (166 blocks available). This limits the maximum batch size that can use CUDA graphs. To increase this limit, reduce max_num_seqs or increase available GPU memory.

Benchmark command:

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3.5-27B-FP8 \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ~/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-warmups 10 \
  --num-prompts 100 \
  --seed 42

Benchmark before PR:

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  91.07
Total input tokens:                      22053
Total generated tokens:                  25660
Request throughput (req/s):              1.10
Output token throughput (tok/s):         281.76
Peak output token throughput (tok/s):    192.00
Peak concurrent requests:                100.00
Total token throughput (tok/s):          523.92
---------------Time to First Token----------------
Mean TTFT (ms):                          23806.17
Median TTFT (ms):                        16324.47
P99 TTFT (ms):                           63565.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          107.32
Median TPOT (ms):                        97.28
P99 TPOT (ms):                           188.23
---------------Inter-token Latency----------------
Mean ITL (ms):                           205.54
Median ITL (ms):                         175.36
P99 ITL (ms):                            939.49
---------------Speculative Decoding---------------
Acceptance rate (%):                     69.88
Acceptance length:                       2.40
Drafts:                                  10697
Draft tokens:                            21394
Accepted tokens:                         14950
Per-position acceptance (%):
  Position 0:                            79.86
  Position 1:                            59.90
==================================================

Benchmark after PR:

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  91.17
Total input tokens:                      22053
Total generated tokens:                  25437
Request throughput (req/s):              1.10
Output token throughput (tok/s):         279.02
Peak output token throughput (tok/s):    192.00
Peak concurrent requests:                100.00
Total token throughput (tok/s):          520.91
---------------Time to First Token----------------
Mean TTFT (ms):                          24200.97
Median TTFT (ms):                        17265.91
P99 TTFT (ms):                           59877.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          97.41
Median TPOT (ms):                        97.86
P99 TPOT (ms):                           166.74
---------------Inter-token Latency----------------
Mean ITL (ms):                           198.66
Median ITL (ms):                         175.11
P99 ITL (ms):                            888.18
---------------Speculative Decoding---------------
Acceptance rate (%):                     67.93
Acceptance length:                       2.36
Drafts:                                  10778
Draft tokens:                            21556
Accepted tokens:                         14644
Per-position acceptance (%):
  Position 0:                            78.83
  Position 1:                            57.04
==================================================
Full startup log before PR
(APIServer pid=1704) INFO 03-19 13:30:01 [utils.py:297]
(APIServer pid=1704) INFO 03-19 13:30:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1704) INFO 03-19 13:30:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1rc1.dev502+gbb9420919.d20260319
(APIServer pid=1704) INFO 03-19 13:30:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen3.5-27B-FP8
(APIServer pid=1704) INFO 03-19 13:30:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1704) INFO 03-19 13:30:01 [utils.py:297]
(APIServer pid=1704) INFO 03-19 13:30:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen3.5-27B-FP8', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'Qwen/Qwen3.5-27B-FP8', 'max_model_len': 262144, 'load_format': 'instanttensor', 'attention_backend': 'TRITON_ATTN', 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'block_size': 32, 'gpu_memory_utilization': 0.94, 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 8192, 'max_num_seqs': 32, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 2, 'rejection_sample_method': 'probabilistic'}, 'optimization_level': '3'}
(APIServer pid=1704) INFO 03-19 13:30:01 [model.py:533] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1704) INFO 03-19 13:30:01 [model.py:1582] Using max model len 262144
(APIServer pid=1704) INFO 03-19 13:30:02 [model.py:533] Resolved architecture: Qwen3_5MTP
(APIServer pid=1704) INFO 03-19 13:30:02 [model.py:1582] Using max model len 262144
(APIServer pid=1704) WARNING 03-19 13:30:02 [speculative.py:499] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1704) INFO 03-19 13:30:02 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1704) WARNING 03-19 13:30:02 [config.py:372] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=1704) INFO 03-19 13:30:02 [config.py:392] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=1704) INFO 03-19 13:30:03 [config.py:212] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1704) INFO 03-19 13:30:03 [config.py:243] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1704) INFO 03-19 13:30:03 [vllm.py:750] Asynchronous scheduling is enabled.
(APIServer pid=1704) INFO 03-19 13:30:03 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1704) INFO 03-19 13:30:04 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
(EngineCore pid=1794) INFO 03-19 13:30:07 [core.py:103] Initializing a V1 LLM engine (v0.17.1rc1.dev502+gbb9420919.d20260319) with config: model='Qwen/Qwen3.5-27B-FP8', speculative_config=SpeculativeConfig(method='mtp', model='Qwen/Qwen3.5-27B-FP8', num_spec_tokens=2), tokenizer='Qwen/Qwen3.5-27B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=instanttensor, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.5-27B-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 192, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=1794) WARNING 03-19 13:30:07 [multiproc_executor.py:1013] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=1794) INFO 03-19 13:30:07 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.50.176 (local), world_size=2, local_world_size=2
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 03-19 13:30:11 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
INFO 03-19 13:30:11 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=1865) INFO 03-19 13:30:11 [parallel_state.py:1400] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:56877 backend=nccl
(Worker pid=1866) INFO 03-19 13:30:11 [parallel_state.py:1400] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:56877 backend=nccl
(Worker pid=1865) INFO 03-19 13:30:12 [pynccl.py:111] vLLM is using nccl==2.29.7
(Worker pid=1865) WARNING 03-19 13:30:12 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1866) WARNING 03-19 13:30:12 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1866) INFO 03-19 13:30:12 [parallel_state.py:1716] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank N/A, EPLB rank N/A
(Worker pid=1865) INFO 03-19 13:30:12 [parallel_state.py:1716] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=1866) WARNING 03-19 13:30:12 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=1865) WARNING 03-19 13:30:12 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP0 pid=1865) INFO 03-19 13:30:12 [gpu_model_runner.py:4516] Starting to load model Qwen/Qwen3.5-27B-FP8...
(Worker_TP0 pid=1865) INFO 03-19 13:30:13 [cuda.py:389] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=1865) INFO 03-19 13:30:13 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=1865) INFO 03-19 13:30:13 [qwen3_next.py:203] Using Triton/FLA GDN prefill kernel
(Worker_TP1 pid=1866) INFO 03-19 13:30:13 [cuda.py:273] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=1865) INFO 03-19 13:30:13 [cuda.py:273] Using AttentionBackendEnum.TRITON_ATTN backend.
Loading safetensors using InstantTensor loader:   0% Completed | 0/1606 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:   0% Completed | 1/1606 [00:01<30:45,  1.15s/it]
Loading safetensors using InstantTensor loader:   0% Completed | 2/1606 [00:02<29:55,  1.12s/it]
Loading safetensors using InstantTensor loader:   2% Completed | 38/1606 [00:03<01:41, 15.51it/s]
Loading safetensors using InstantTensor loader:   4% Completed | 65/1606 [00:04<01:17, 19.76it/s]
Loading safetensors using InstantTensor loader:   8% Completed | 121/1606 [00:05<00:46, 32.19it/s]
Loading safetensors using InstantTensor loader:  11% Completed | 177/1606 [00:06<00:36, 39.28it/s]
Loading safetensors using InstantTensor loader:  15% Completed | 234/1606 [00:07<00:30, 44.79it/s]
Loading safetensors using InstantTensor loader:  18% Completed | 291/1606 [00:08<00:27, 48.05it/s]
Loading safetensors using InstantTensor loader:  22% Completed | 348/1606 [00:09<00:25, 49.92it/s]
Loading safetensors using InstantTensor loader:  25% Completed | 409/1606 [00:10<00:22, 52.73it/s]
Loading safetensors using InstantTensor loader:  31% Completed | 500/1606 [00:11<00:17, 63.94it/s]
Loading safetensors using InstantTensor loader:  35% Completed | 565/1606 [00:12<00:16, 61.32it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 1606/1606 [00:13<00:00, 117.98it/s]
(Worker_TP0 pid=1865)
(Worker_TP0 pid=1865) INFO 03-19 13:30:27 [default_loader.py:384] Loading weights took 13.98 seconds
(Worker_TP0 pid=1865) INFO 03-19 13:30:27 [gpu_model_runner.py:4540] Loading drafter model...
(Worker_TP1 pid=1866) INFO 03-19 13:30:27 [gpu_model_runner.py:4540] Loading drafter model...
Loading safetensors using InstantTensor loader:   0% Completed | 0/1606 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:   0% Completed | 1/1606 [00:01<35:31,  1.33s/it]
Loading safetensors using InstantTensor loader:   0% Completed | 2/1606 [00:02<32:01,  1.20s/it]
Loading safetensors using InstantTensor loader:   2% Completed | 38/1606 [00:03<01:44, 15.03it/s]
Loading safetensors using InstantTensor loader:   6% Completed | 97/1606 [00:04<00:48, 31.04it/s]
Loading safetensors using InstantTensor loader:   8% Completed | 130/1606 [00:05<00:48, 30.57it/s]
Loading safetensors using InstantTensor loader:  11% Completed | 184/1606 [00:06<00:37, 37.60it/s]
Loading safetensors using InstantTensor loader:  15% Completed | 241/1606 [00:07<00:31, 43.55it/s]
Loading safetensors using InstantTensor loader:  18% Completed | 297/1606 [00:08<00:27, 47.09it/s]
Loading safetensors using InstantTensor loader:  22% Completed | 354/1606 [00:09<00:25, 49.05it/s]
Loading safetensors using InstantTensor loader:  26% Completed | 419/1606 [00:10<00:22, 53.10it/s]
Loading safetensors using InstantTensor loader:  32% Completed | 520/1606 [00:11<00:16, 67.16it/s]
Loading safetensors using InstantTensor loader:  41% Completed | 666/1606 [00:12<00:10, 89.18it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 1606/1606 [00:13<00:00, 120.12it/s]
(Worker_TP0 pid=1865)
(Worker_TP1 pid=1866) INFO 03-19 13:30:41 [eagle.py:1365] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP1 pid=1866) INFO 03-19 13:30:41 [eagle.py:1419] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP0 pid=1865) INFO 03-19 13:30:41 [default_loader.py:384] Loading weights took 13.42 seconds
(Worker_TP0 pid=1865) INFO 03-19 13:30:41 [eagle.py:1365] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP0 pid=1865) INFO 03-19 13:30:41 [eagle.py:1419] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP0 pid=1865) INFO 03-19 13:30:42 [gpu_model_runner.py:4601] Model loading took 14.21 GiB memory and 28.452174 seconds
(Worker_TP0 pid=1865) INFO 03-19 13:30:45 [backends.py:990] Using cache directory: /home/repne/.cache/vllm/torch_compile_cache/5c5555a76f/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=1865) INFO 03-19 13:30:45 [backends.py:1050] Dynamo bytecode transform time: 2.67 s
(Worker_TP0 pid=1865) INFO 03-19 13:30:47 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 1.828 s
(Worker_TP0 pid=1865) INFO 03-19 13:30:47 [monitor.py:48] torch.compile took 5.15 s in total
(Worker_TP0 pid=1865) INFO 03-19 13:30:47 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/5a4b27f53e842aad678ca335ae4bc07c2cd0cd62d4bfc216a7b03226576ee4b2/rank_0_0/model
(Worker_TP1 pid=1866) INFO 03-19 13:30:47 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/5a4b27f53e842aad678ca335ae4bc07c2cd0cd62d4bfc216a7b03226576ee4b2/rank_1_0/model
(Worker_TP0 pid=1865) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=1865)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP1 pid=1866) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=1866)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=1865) INFO 03-19 13:31:07 [monitor.py:76] Initial profiling/warmup run took 20.16 s
(Worker_TP1 pid=1866) WARNING 03-19 13:31:07 [decorators.py:304] Compiling model again due to a load failure from /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/c6b02e2fc0af6c2bcaf0f132afe44184216fea0cf0b02061224342fd1c536b2b/rank_1_0/model, reason: Source code has changed since the last compilation. Recompiling the model.
(Worker_TP0 pid=1865) WARNING 03-19 13:31:07 [decorators.py:304] Compiling model again due to a load failure from /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/c6b02e2fc0af6c2bcaf0f132afe44184216fea0cf0b02061224342fd1c536b2b/rank_0_0/model, reason: Source code has changed since the last compilation. Recompiling the model.
(Worker_TP0 pid=1865) INFO 03-19 13:31:08 [backends.py:990] Using cache directory: /home/repne/.cache/vllm/torch_compile_cache/5c5555a76f/rank_0_0/eagle_head for vLLM's torch.compile
(Worker_TP0 pid=1865) INFO 03-19 13:31:08 [backends.py:1050] Dynamo bytecode transform time: 0.87 s
(Worker_TP0 pid=1865) INFO 03-19 13:31:09 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 0.045 s
(Worker_TP0 pid=1865) INFO 03-19 13:31:09 [decorators.py:627] saved AOT compiled function to /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/c6b02e2fc0af6c2bcaf0f132afe44184216fea0cf0b02061224342fd1c536b2b/rank_0_0/model
(Worker_TP0 pid=1865) INFO 03-19 13:31:09 [monitor.py:48] torch.compile took 1.30 s in total
(Worker_TP0 pid=1865) INFO 03-19 13:31:09 [monitor.py:76] Initial profiling/warmup run took 0.07 s
(Worker_TP0 pid=1865) WARNING 03-19 13:31:09 [kv_cache_utils.py:1056] Add 3 padding layers, may waste at most 6.25% KV cache memory
(Worker_TP0 pid=1865) INFO 03-19 13:31:09 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=192
(Worker_TP1 pid=1866) WARNING 03-19 13:31:09 [kv_cache_utils.py:1056] Add 3 padding layers, may waste at most 6.25% KV cache memory
(Worker_TP1 pid=1866) INFO 03-19 13:31:09 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=192
(Worker_TP0 pid=1865) INFO 03-19 13:31:09 [gpu_model_runner.py:5646] Profiling CUDA graph memory: PIECEWISE=26 (largest=192), FULL=14 (largest=96)
(Worker_TP1 pid=1866) INFO 03-19 13:31:09 [gpu_model_runner.py:5646] Profiling CUDA graph memory: PIECEWISE=26 (largest=192), FULL=14 (largest=96)
(Worker_TP0 pid=1865) INFO 03-19 13:31:19 [custom_all_reduce.py:216] Registering 522 cuda graph addresses
(Worker_TP1 pid=1866) INFO 03-19 13:31:19 [custom_all_reduce.py:216] Registering 522 cuda graph addresses
(Worker_TP0 pid=1865) INFO 03-19 13:31:20 [gpu_model_runner.py:5725] Estimated CUDA graph memory: 0.47 GiB total
(Worker_TP1 pid=1866) INFO 03-19 13:31:20 [gpu_model_runner.py:5725] Estimated CUDA graph memory: 0.47 GiB total
(Worker_TP0 pid=1865) INFO 03-19 13:31:21 [gpu_worker.py:456] Available KV cache memory: 13.61 GiB
(Worker_TP0 pid=1865) INFO 03-19 13:31:21 [gpu_worker.py:472] CUDA graph memory profiling is enabled (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1). This will become the default in v0.19. The current --gpu-memory-utilization=0.9400 is equivalent to --gpu-memory-utilization=0.9249 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9551.
(Worker_TP1 pid=1866) INFO 03-19 13:31:21 [gpu_worker.py:472] CUDA graph memory profiling is enabled (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1). This will become the default in v0.19. The current --gpu-memory-utilization=0.9400 is equivalent to --gpu-memory-utilization=0.9249 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9551.
(EngineCore pid=1794) WARNING 03-19 13:31:21 [kv_cache_utils.py:1056] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=1794) INFO 03-19 13:31:21 [kv_cache_utils.py:1316] GPU KV cache size: 104,800 tokens
(EngineCore pid=1794) INFO 03-19 13:31:21 [kv_cache_utils.py:1321] Maximum concurrency for 262,144 tokens per request: 1.54x
(Worker_TP1 pid=1866) 2026-03-19 13:31:21,205 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=1865) 2026-03-19 13:31:21,205 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=1865) 2026-03-19 13:31:22,065 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP1 pid=1866) 2026-03-19 13:31:22,065 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 26/26 [00:01<00:00, 21.27it/s]
Capturing CUDA graphs (decode, FULL):  71%|██████████████████████████████████████████████████████████████████████████████▌                               | 10/14 [00:03<00:02,  1.87it/s](Worker_TP1 pid=1866) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (18) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=1866)   return fn(*contiguous_args, **contiguous_kwargs)
/home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (18) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=1865)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=1865) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=1865)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP1 pid=1866) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=1866)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL):  86%|██████████████████████████████████████████████████████████████████████████████████████████████▎               | 12/14 [00:06<00:01,  1.21it/s](Worker_TP1 pid=1866) /home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=1866)   return fn(*args, **kwargs)
(Worker_TP1 pid=1866) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=1866)   return fn(*contiguous_args, **contiguous_kwargs)
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=1865)   return fn(*args, **kwargs)
(Worker_TP0 pid=1865) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=1865)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=1865) /home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=1865)   return fn(*args, **kwargs)
(Worker_TP1 pid=1866) /home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=1865) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=1866)   return fn(*args, **kwargs)
(Worker_TP0 pid=1865)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP1 pid=1866) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=1866)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:06<00:00,  2.27it/s]
(Worker_TP1 pid=1866) INFO 03-19 13:31:34 [custom_all_reduce.py:216] Registering 5238 cuda graph addresses
(Worker_TP0 pid=1865) INFO 03-19 13:31:34 [custom_all_reduce.py:216] Registering 5238 cuda graph addresses
(Worker_TP0 pid=1865) INFO 03-19 13:31:35 [gpu_model_runner.py:5785] Graph capturing finished in 13 secs, took 0.50 GiB
(Worker_TP0 pid=1865) INFO 03-19 13:31:35 [gpu_worker.py:617] CUDA graph pool memory: 0.5 GiB (actual), 0.47 GiB (estimated), difference: 0.02 GiB (4.7%).
(Worker_TP1 pid=1866) INFO 03-19 13:31:35 [gpu_worker.py:617] CUDA graph pool memory: 0.5 GiB (actual), 0.47 GiB (estimated), difference: 0.02 GiB (4.7%).
(EngineCore pid=1794) INFO 03-19 13:31:35 [core.py:281] init engine (profile, create kv cache, warmup model) took 53.06 seconds
(EngineCore pid=1794) INFO 03-19 13:31:36 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=1794) INFO 03-19 13:31:36 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=1794) INFO 03-19 13:31:36 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1704) INFO 03-19 13:31:36 [api_server.py:595] Supported tasks: ['generate']
(APIServer pid=1704) INFO 03-19 13:31:36 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1704) WARNING 03-19 13:31:36 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1704) INFO 03-19 13:31:36 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1704) INFO 03-19 13:31:36 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1704) INFO 03-19 13:31:39 [hf.py:320] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1704) INFO 03-19 13:31:45 [base.py:216] Multi-modal warmup completed in 5.880s
(APIServer pid=1704) INFO 03-19 13:31:45 [parser_manager.py:202] "auto" tool choice has been enabled.
Full startup log after PR
(APIServer pid=2631) INFO 03-19 13:36:35 [utils.py:297]
(APIServer pid=2631) INFO 03-19 13:36:35 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=2631) INFO 03-19 13:36:35 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1rc1.dev502+gbb9420919.d20260319
(APIServer pid=2631) INFO 03-19 13:36:35 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen3.5-27B-FP8
(APIServer pid=2631) INFO 03-19 13:36:35 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2631) INFO 03-19 13:36:35 [utils.py:297]
(APIServer pid=2631) INFO 03-19 13:36:35 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen3.5-27B-FP8', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'Qwen/Qwen3.5-27B-FP8', 'max_model_len': 262144, 'load_format': 'instanttensor', 'attention_backend': 'TRITON_ATTN', 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'block_size': 32, 'gpu_memory_utilization': 0.94, 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 8192, 'max_num_seqs': 32, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 2, 'rejection_sample_method': 'probabilistic'}, 'optimization_level': '3'}
(APIServer pid=2631) INFO 03-19 13:36:36 [model.py:533] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=2631) INFO 03-19 13:36:36 [model.py:1582] Using max model len 262144
(APIServer pid=2631) INFO 03-19 13:36:36 [model.py:533] Resolved architecture: Qwen3_5MTP
(APIServer pid=2631) INFO 03-19 13:36:36 [model.py:1582] Using max model len 262144
(APIServer pid=2631) WARNING 03-19 13:36:36 [speculative.py:499] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=2631) INFO 03-19 13:36:36 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=2631) WARNING 03-19 13:36:36 [config.py:372] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=2631) INFO 03-19 13:36:36 [config.py:392] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=2631) INFO 03-19 13:36:36 [config.py:212] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=2631) INFO 03-19 13:36:36 [config.py:243] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=2631) INFO 03-19 13:36:36 [vllm.py:750] Asynchronous scheduling is enabled.
(APIServer pid=2631) INFO 03-19 13:36:36 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=2631) INFO 03-19 13:36:38 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
(EngineCore pid=2716) INFO 03-19 13:36:41 [core.py:103] Initializing a V1 LLM engine (v0.17.1rc1.dev502+gbb9420919.d20260319) with config: model='Qwen/Qwen3.5-27B-FP8', speculative_config=SpeculativeConfig(method='mtp', model='Qwen/Qwen3.5-27B-FP8', num_spec_tokens=2), tokenizer='Qwen/Qwen3.5-27B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=instanttensor, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.5-27B-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 192, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=2716) WARNING 03-19 13:36:41 [multiproc_executor.py:1013] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=2716) INFO 03-19 13:36:41 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.50.176 (local), world_size=2, local_world_size=2
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 03-19 13:36:45 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=2787) INFO 03-19 13:36:45 [parallel_state.py:1400] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:32895 backend=nccl
INFO 03-19 13:36:45 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=2788) INFO 03-19 13:36:45 [parallel_state.py:1400] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:32895 backend=nccl
(Worker pid=2787) INFO 03-19 13:36:45 [pynccl.py:111] vLLM is using nccl==2.29.7
(Worker pid=2787) WARNING 03-19 13:36:46 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=2788) WARNING 03-19 13:36:46 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=2788) INFO 03-19 13:36:46 [parallel_state.py:1716] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank N/A, EPLB rank N/A
(Worker pid=2787) INFO 03-19 13:36:46 [parallel_state.py:1716] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=2788) WARNING 03-19 13:36:46 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=2787) WARNING 03-19 13:36:46 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP0 pid=2787) INFO 03-19 13:36:46 [gpu_model_runner.py:4516] Starting to load model Qwen/Qwen3.5-27B-FP8...
(Worker_TP0 pid=2787) INFO 03-19 13:36:46 [cuda.py:389] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=2787) INFO 03-19 13:36:46 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=2787) INFO 03-19 13:36:46 [qwen3_next.py:203] Using Triton/FLA GDN prefill kernel
(Worker_TP1 pid=2788) INFO 03-19 13:36:46 [cuda.py:273] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=2787) INFO 03-19 13:36:46 [cuda.py:273] Using AttentionBackendEnum.TRITON_ATTN backend.
Loading safetensors using InstantTensor loader:   0% Completed | 0/1606 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:   0% Completed | 1/1606 [00:01<29:30,  1.10s/it]
Loading safetensors using InstantTensor loader:   0% Completed | 2/1606 [00:02<29:27,  1.10s/it]
Loading safetensors using InstantTensor loader:   2% Completed | 37/1606 [00:03<01:40, 15.63it/s]
Loading safetensors using InstantTensor loader:   4% Completed | 61/1606 [00:04<01:22, 18.66it/s]
Loading safetensors using InstantTensor loader:   7% Completed | 118/1606 [00:05<00:46, 31.70it/s]
Loading safetensors using InstantTensor loader:  11% Completed | 174/1606 [00:06<00:36, 39.24it/s]
Loading safetensors using InstantTensor loader:  14% Completed | 229/1606 [00:07<00:31, 44.05it/s]
Loading safetensors using InstantTensor loader:  18% Completed | 288/1606 [00:08<00:27, 48.00it/s]
Loading safetensors using InstantTensor loader:  21% Completed | 343/1606 [00:09<00:25, 50.03it/s]
Loading safetensors using InstantTensor loader:  25% Completed | 403/1606 [00:10<00:22, 52.75it/s]
Loading safetensors using InstantTensor loader:  28% Completed | 456/1606 [00:11<00:22, 51.00it/s]
Loading safetensors using InstantTensor loader:  34% Completed | 547/1606 [00:12<00:17, 61.63it/s]
Loading safetensors using InstantTensor loader:  99% Completed | 1596/1606 [00:13<00:00, 348.75it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 1606/1606 [00:13<00:00, 118.24it/s]
(Worker_TP0 pid=2787)
(Worker_TP0 pid=2787) INFO 03-19 13:37:01 [default_loader.py:384] Loading weights took 13.85 seconds
(Worker_TP0 pid=2787) INFO 03-19 13:37:01 [gpu_model_runner.py:4540] Loading drafter model...
(Worker_TP1 pid=2788) INFO 03-19 13:37:01 [gpu_model_runner.py:4540] Loading drafter model...
Loading safetensors using InstantTensor loader:   0% Completed | 0/1606 [00:00<?, ?it/s]
Loading safetensors using InstantTensor loader:   0% Completed | 1/1606 [00:01<35:01,  1.31s/it]
Loading safetensors using InstantTensor loader:   0% Completed | 2/1606 [00:02<31:39,  1.18s/it]
Loading safetensors using InstantTensor loader:   2% Completed | 37/1606 [00:03<01:44, 14.99it/s]
Loading safetensors using InstantTensor loader:   4% Completed | 64/1606 [00:04<01:20, 19.22it/s]
Loading safetensors using InstantTensor loader:   8% Completed | 121/1606 [00:05<00:46, 31.77it/s]
Loading safetensors using InstantTensor loader:  11% Completed | 178/1606 [00:06<00:36, 39.50it/s]
Loading safetensors using InstantTensor loader:  15% Completed | 234/1606 [00:07<00:30, 44.47it/s]
Loading safetensors using InstantTensor loader:  18% Completed | 291/1606 [00:08<00:27, 47.74it/s]
Loading safetensors using InstantTensor loader:  22% Completed | 348/1606 [00:09<00:25, 50.31it/s]
Loading safetensors using InstantTensor loader:  25% Completed | 409/1606 [00:10<00:22, 53.08it/s]
Loading safetensors using InstantTensor loader:  31% Completed | 500/1606 [00:11<00:17, 64.36it/s]
Loading safetensors using InstantTensor loader:  35% Completed | 565/1606 [00:12<00:16, 63.27it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 1606/1606 [00:13<00:00, 117.80it/s]
(Worker_TP0 pid=2787)
(Worker_TP0 pid=2787) INFO 03-19 13:37:15 [default_loader.py:384] Loading weights took 13.67 seconds
(Worker_TP0 pid=2787) INFO 03-19 13:37:15 [eagle.py:1365] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP0 pid=2787) INFO 03-19 13:37:15 [eagle.py:1419] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP1 pid=2788) INFO 03-19 13:37:15 [eagle.py:1365] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP1 pid=2788) INFO 03-19 13:37:15 [eagle.py:1419] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP0 pid=2787) INFO 03-19 13:37:15 [gpu_model_runner.py:4601] Model loading took 14.21 GiB memory and 28.644763 seconds
(Worker_TP0 pid=2787) INFO 03-19 13:37:18 [backends.py:990] Using cache directory: /home/repne/.cache/vllm/torch_compile_cache/5c5555a76f/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=2787) INFO 03-19 13:37:18 [backends.py:1050] Dynamo bytecode transform time: 2.47 s
(Worker_TP1 pid=2788) INFO 03-19 13:37:20 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/5a4b27f53e842aad678ca335ae4bc07c2cd0cd62d4bfc216a7b03226576ee4b2/rank_1_0/model
(Worker_TP0 pid=2787) INFO 03-19 13:37:20 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 1.683 s
(Worker_TP0 pid=2787) INFO 03-19 13:37:20 [monitor.py:48] torch.compile took 4.80 s in total
(Worker_TP0 pid=2787) INFO 03-19 13:37:20 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/5a4b27f53e842aad678ca335ae4bc07c2cd0cd62d4bfc216a7b03226576ee4b2/rank_0_0/model
(Worker_TP0 pid=2787) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=2787)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP1 pid=2788) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=2788)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=2787) INFO 03-19 13:37:41 [monitor.py:76] Initial profiling/warmup run took 20.18 s
(Worker_TP0 pid=2787) INFO 03-19 13:37:41 [backends.py:990] Using cache directory: /home/repne/.cache/vllm/torch_compile_cache/5c5555a76f/rank_0_0/eagle_head for vLLM's torch.compile
(Worker_TP0 pid=2787) INFO 03-19 13:37:41 [backends.py:1050] Dynamo bytecode transform time: 0.07 s
(Worker_TP1 pid=2788) INFO 03-19 13:37:41 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/c6b02e2fc0af6c2bcaf0f132afe44184216fea0cf0b02061224342fd1c536b2b/rank_1_0/model
(Worker_TP0 pid=2787) INFO 03-19 13:37:41 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 0.027 s
(Worker_TP0 pid=2787) INFO 03-19 13:37:41 [monitor.py:48] torch.compile took 0.43 s in total
(Worker_TP0 pid=2787) INFO 03-19 13:37:41 [decorators.py:296] Directly load AOT compilation from path /home/repne/.cache/vllm/torch_compile_cache/torch_aot_compile/c6b02e2fc0af6c2bcaf0f132afe44184216fea0cf0b02061224342fd1c536b2b/rank_0_0/model
(Worker_TP0 pid=2787) INFO 03-19 13:37:41 [monitor.py:76] Initial profiling/warmup run took 0.07 s
(Worker_TP1 pid=2788) WARNING 03-19 13:37:42 [kv_cache_utils.py:1109] Add 3 padding layers, may waste at most 6.25% KV cache memory
(Worker_TP1 pid=2788) INFO 03-19 13:37:42 [kv_cache_utils.py:854] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=192
(Worker_TP1 pid=2788) INFO 03-19 13:37:42 [gpu_model_runner.py:5646] Profiling CUDA graph memory: PIECEWISE=26 (largest=192), FULL=14 (largest=96)
(Worker_TP0 pid=2787) WARNING 03-19 13:37:42 [kv_cache_utils.py:1109] Add 3 padding layers, may waste at most 6.25% KV cache memory
(Worker_TP0 pid=2787) INFO 03-19 13:37:42 [kv_cache_utils.py:854] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=192
(Worker_TP0 pid=2787) INFO 03-19 13:37:42 [gpu_model_runner.py:5646] Profiling CUDA graph memory: PIECEWISE=26 (largest=192), FULL=14 (largest=96)
(Worker_TP0 pid=2787) INFO 03-19 13:37:52 [custom_all_reduce.py:216] Registering 522 cuda graph addresses
(Worker_TP1 pid=2788) INFO 03-19 13:37:52 [custom_all_reduce.py:216] Registering 522 cuda graph addresses
(Worker_TP1 pid=2788) INFO 03-19 13:37:53 [gpu_model_runner.py:5725] Estimated CUDA graph memory: 0.47 GiB total
(Worker_TP0 pid=2787) INFO 03-19 13:37:53 [gpu_model_runner.py:5725] Estimated CUDA graph memory: 0.47 GiB total
(Worker_TP1 pid=2788) INFO 03-19 13:37:53 [gpu_worker.py:472] CUDA graph memory profiling is enabled (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1). This will become the default in v0.19. The current --gpu-memory-utilization=0.9400 is equivalent to --gpu-memory-utilization=0.9249 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9551.
(Worker_TP0 pid=2787) INFO 03-19 13:37:53 [gpu_worker.py:456] Available KV cache memory: 13.61 GiB
(Worker_TP0 pid=2787) INFO 03-19 13:37:53 [gpu_worker.py:472] CUDA graph memory profiling is enabled (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1). This will become the default in v0.19. The current --gpu-memory-utilization=0.9400 is equivalent to --gpu-memory-utilization=0.9249 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9551.
(EngineCore pid=2716) WARNING 03-19 13:37:53 [kv_cache_utils.py:1109] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=2716) INFO 03-19 13:37:53 [kv_cache_utils.py:1498] GPU KV cache size: 132,800 tokens
(EngineCore pid=2716) INFO 03-19 13:37:53 [kv_cache_utils.py:1503] Maximum concurrency for 262,144 tokens per request: 0.51x
(Worker_TP1 pid=2788) WARNING 03-19 13:37:53 [compilation.py:1236] Capping cudagraph capture sizes from max 192 to 162 to fit Mamba cache blocks (166 blocks available). This limits the maximum batch size that can use CUDA graphs. To increase this limit, reduce max_num_seqs or increase available GPU memory.
(Worker_TP0 pid=2787) WARNING 03-19 13:37:53 [compilation.py:1236] Capping cudagraph capture sizes from max 192 to 162 to fit Mamba cache blocks (166 blocks available). This limits the maximum batch size that can use CUDA graphs. To increase this limit, reduce max_num_seqs or increase available GPU memory.
(Worker_TP0 pid=2787) 2026-03-19 13:37:53,494 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=2788) 2026-03-19 13:37:53,494 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=2787) 2026-03-19 13:37:54,349 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP1 pid=2788) 2026-03-19 13:37:54,349 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 22/22 [00:01<00:00, 21.89it/s]
Capturing CUDA graphs (decode, FULL):  71%|██████████████████████████████████████████████████████████████████████████████▌                               | 10/14 [00:03<00:02,  1.87it/s](Worker_TP1 pid=2788) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (18) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=2788)   return fn(*contiguous_args, **contiguous_kwargs)
/home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (18) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=2787)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP1 pid=2788) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=2787) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=2788)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=2787)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL):  86%|██████████████████████████████████████████████████████████████████████████████████████████████▎               | 12/14 [00:06<00:01,  1.22it/s](Worker_TP1 pid=2788) /home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=2788)   return fn(*args, **kwargs)
(Worker_TP1 pid=2788) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=2788)   return fn(*contiguous_args, **contiguous_kwargs)
/home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=2787)   return fn(*args, **kwargs)
(Worker_TP0 pid=2787) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=2787)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=2787) /home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=2788) /home/repne/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (8). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=2787)   return fn(*args, **kwargs)
(Worker_TP1 pid=2788)   return fn(*args, **kwargs)
(Worker_TP1 pid=2788) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=2787) /home/repne/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=2788)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=2787)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:06<00:00,  2.28it/s]
(Worker_TP1 pid=2788) INFO 03-19 13:38:06 [custom_all_reduce.py:216] Registering 4710 cuda graph addresses
(Worker_TP0 pid=2787) INFO 03-19 13:38:06 [custom_all_reduce.py:216] Registering 4710 cuda graph addresses
(Worker_TP1 pid=2788) INFO 03-19 13:38:07 [gpu_worker.py:617] CUDA graph pool memory: 0.42 GiB (actual), 0.47 GiB (estimated), difference: 0.05 GiB (13.1%).
(Worker_TP0 pid=2787) INFO 03-19 13:38:07 [gpu_model_runner.py:5785] Graph capturing finished in 13 secs, took 0.42 GiB
(Worker_TP0 pid=2787) INFO 03-19 13:38:07 [gpu_worker.py:617] CUDA graph pool memory: 0.42 GiB (actual), 0.47 GiB (estimated), difference: 0.05 GiB (13.1%).
(EngineCore pid=2716) INFO 03-19 13:38:07 [core.py:281] init engine (profile, create kv cache, warmup model) took 51.47 seconds
(EngineCore pid=2716) INFO 03-19 13:38:08 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=2716) INFO 03-19 13:38:08 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=2716) INFO 03-19 13:38:08 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=2631) INFO 03-19 13:38:08 [api_server.py:595] Supported tasks: ['generate']
(APIServer pid=2631) INFO 03-19 13:38:08 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=2631) WARNING 03-19 13:38:08 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=2631) INFO 03-19 13:38:08 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=2631) INFO 03-19 13:38:08 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=2631) INFO 03-19 13:38:11 [hf.py:320] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=2631) INFO 03-19 13:38:17 [base.py:216] Multi-modal warmup completed in 5.907s
(APIServer pid=2631) INFO 03-19 13:38:17 [parser_manager.py:202] "auto" tool choice has been enabled.

@IgniteGo
Copy link
Copy Markdown

Thank you for the great PR. I support the idea of separating Mamba and Attention cache Management.

My understanding of the your Mamba change: Mamba layers now allocate O(1) blocks (independent of request length), while Attention layers allocate O(n) blocks. Is that correct? If so, I'm concerned about prefix cache hit ratio:

  • Original: Mamba state stored in fixed token chunks (e.g., 400 tokens), allowing fine-grained reuse.
  • New: state tied to the whole request; a long request (e.g., ~100K tokens) cannot be partially reused by shorter prefixes.

Was this impact taken into account in the initial design?

In high-reuse scenarios (e.g., multi-turn chat), does recomputation overhead outweigh memory savings?

I would greatly appreciate it if you could provide any relevant theoretical analysis or experimental results on the hit ratio.

Thanks again for your work.

@swtb3
Copy link
Copy Markdown
Author

swtb3 commented Apr 23, 2026

Thank you for the great PR. I support the idea of separating Mamba and Attention cache Management.

My understanding of the your Mamba change: Mamba layers now allocate O(1) blocks (independent of request length), while Attention layers allocate O(n) blocks. Is that correct? If so, I'm concerned about prefix cache hit ratio:

  • Original: Mamba state stored in fixed token chunks (e.g., 400 tokens), allowing fine-grained reuse.
  • New: state tied to the whole request; a long request (e.g., ~100K tokens) cannot be partially reused by shorter prefixes.

Was this impact taken into account in the initial design?

In high-reuse scenarios (e.g., multi-turn chat), does recomputation overhead outweigh memory savings?

I would greatly appreciate it if you could provide any relevant theoretical analysis or experimental results on the hit ratio.

Thanks again for your work.

Hello, thanks for the kudos.

Trouble is this ended up becoming a rabbit hole.

I welcome contributions from other people, at this stage who have a deeper knowledge of KV cache in vLLM and more recent changes to it

@IgniteGo
Copy link
Copy Markdown

Thank you for the great PR. I support the idea of separating Mamba and Attention cache Management.
My understanding of the your Mamba change: Mamba layers now allocate O(1) blocks (independent of request length), while Attention layers allocate O(n) blocks. Is that correct? If so, I'm concerned about prefix cache hit ratio:

  • Original: Mamba state stored in fixed token chunks (e.g., 400 tokens), allowing fine-grained reuse.
  • New: state tied to the whole request; a long request (e.g., ~100K tokens) cannot be partially reused by shorter prefixes.

Was this impact taken into account in the initial design?
In high-reuse scenarios (e.g., multi-turn chat), does recomputation overhead outweigh memory savings?
I would greatly appreciate it if you could provide any relevant theoretical analysis or experimental results on the hit ratio.
Thanks again for your work.

Hello, thanks for the kudos.

Trouble is this ended up becoming a rabbit hole.

I welcome contributions from other people, at this stage who have a deeper knowledge of KV cache in vLLM and more recent changes to it

Thank you for your timely reply. Looking forward to having more discussions in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants