[Perf] Optimize hidden state extraction logic by benchislett · Pull Request #37374 · vllm-project/vllm

benchislett · 2026-03-18T03:36:14Z

Purpose

Eliminate unnecessary synchronization and improve overlap compatibility of hidden state extraction. There is still some inefficiency due to the lack of full overlap, which should be further improved by #32951

Also adds a benchmark script for testing performance of the hidden states extractor. In a particular case of interest for GPT-OSS, I get a 1.45x improvement over the baseline, and am able to achieve within 10% of baseline throughput performance in some cases (with a lot of parameter tuning)

Testing

None

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations to the hidden state extraction logic by leveraging asynchronous DtoH copies and disk I/O. The use of a dedicated CUDA stream, pinned memory, and a thread pool for I/O operations is well-implemented. The addition of a benchmark script is also a valuable contribution for measuring the impact of these changes. My review identifies one area for improvement to make the performance tuning more flexible.

gemini-code-assist · 2026-03-18T03:38:53Z

vllm/distributed/kv_transfer/kv_connector/v1/example_hidden_states_connector.py

+        # the default stream (model forward). Thread pool for disk writes.
+        self._copy_stream: torch.cuda.Stream | None = None  # lazy init
+        self._executor = ThreadPoolExecutor(
+            max_workers=8, thread_name_prefix="vllm-hs-save"


The number of worker threads in the ThreadPoolExecutor is hardcoded to 8. This may not be optimal for all hardware configurations and could lead to either underutilization of resources or excessive contention, especially in a performance-sensitive context. It would be better to make this value configurable, for instance, through kv_connector_extra_config.

Suggested change

max_workers=8, thread_name_prefix="vllm-hs-save"

max_workers=self._kv_transfer_config.get_from_extra_config("num_writer_threads", 8), thread_name_prefix="vllm-hs-save"

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

fynnsu · 2026-03-19T18:22:09Z

Hey, just a heads up: I attempted to benchmark this using the commands using the new speculators offline data generation script (see this pr/comment: vllm-project/speculators#353 (comment)), it threw an error "Found another KV transfer request with same req_id".

Maybe related to the async usage (although also occurred when I limited concurrent requests to 1), or chunked prefill?

Stacktrace

❯ CUDA_VISIBLE_DEVICES=0 python scripts/launch_vllm.py Qwen/Qwen3-8B --
Running command:
vllm serve Qwen/Qwen3-8B --speculative_config {"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [2, 18, 33, 36]}}} --kv_transfer_config {"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", "kv_connector_extra_config": {"shared_storage_path": "/tmp/hidden_states"}}
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297]
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.2rc1.dev30+gbe5de50a3
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen3-8B
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297]
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen3-8B', 'model': 'Qwen/Qwen3-8B', 'speculative_config': {'method': 'extract_hidden_states', 'num_speculative_tokens': 1, 'draft_model_config': {'hf_config': {'eagle_aux_hidden_state_layer_ids': [2, 18, 33, 36]}}}, 'kv_transfer_config': KVTransferConfig(kv_connector='ExampleHiddenStatesConnector', engine_id='67caa35d-4470-4e54-aad2-6f2bc7ae60cd', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_producer', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={'shared_storage_path': '/tmp/hidden_states'}, kv_connector_module_path=None, enable_permute_local_kv=False, kv_load_failure_policy='fail')}
(APIServer pid=959548) INFO 03-19 14:01:47 [model.py:533] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=959548) INFO 03-19 14:01:47 [model.py:1582] Using max model len 40960
(APIServer pid=959548) INFO 03-19 14:01:47 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=959548) INFO 03-19 14:01:47 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=959548) WARNING 03-19 14:01:47 [vllm.py:1168] Turning off hybrid kv cache manager because `--kv-transfer-config` is set. This will reduce the performance of vLLM on LLMs with sliding window attention or Mamba attention. If you are a developer of kv connector, please consider supporting hybrid kv cache manager for your connector by making sure your connector is a subclass of `SupportsHMA` defined in kv_connector/v1/base.py and use --no-disable-hybrid-kv-cache-manager to start vLLM.
(EngineCore pid=963700) INFO 03-19 14:01:54 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev30+gbe5de50a3) with config: model='Qwen/Qwen3-8B', speculative_config=SpeculativeConfig(method='extract_hidden_states', model=None, num_spec_tokens=1), tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=963700) INFO 03-19 14:01:55 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.14.217.17:50297 backend=nccl
(EngineCore pid=963700) INFO 03-19 14:01:55 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=963700) WARNING 03-19 14:01:55 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=963700) INFO 03-19 14:01:55 [gpu_model_runner.py:4506] Starting to load model Qwen/Qwen3-8B...
(EngineCore pid=963700) INFO 03-19 14:01:56 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=963700) INFO 03-19 14:01:56 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=963700) <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=963700) <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:01,  2.20it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:01,  1.89it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:01,  1.84it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  2.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  2.17it/s]
(EngineCore pid=963700)
(EngineCore pid=963700) INFO 03-19 14:01:59 [default_loader.py:373] Loading weights took 2.34 seconds
(EngineCore pid=963700) INFO 03-19 14:01:59 [gpu_model_runner.py:4530] Loading drafter model...
(EngineCore pid=963700) WARNING 03-19 14:01:59 [vllm.py:1739] `torch.compile` is turned on, but the model Qwen/Qwen3-8B does not support it. Please open an issue on GitHub if you want it to be supported.
(EngineCore pid=963700) INFO 03-19 14:01:59 [default_loader.py:373] Loading weights took 9125665.46 seconds
(EngineCore pid=963700) INFO 03-19 14:01:59 [gpu_model_runner.py:4568] Using auxiliary layers from speculative config: (2, 18, 33, 36)
(EngineCore pid=963700) INFO 03-19 14:01:59 [gpu_model_runner.py:4591] Model loading took 15.27 GiB memory and 3.270017 seconds
(EngineCore pid=963700) INFO 03-19 14:02:03 [backends.py:988] Using cache directory: /home/fynnsu/.cache/vllm/torch_compile_cache/8ba5af3c3d/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=963700) INFO 03-19 14:02:03 [backends.py:1048] Dynamo bytecode transform time: 2.97 s
(EngineCore pid=963700) INFO 03-19 14:02:04 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 1.245 s
(EngineCore pid=963700) INFO 03-19 14:02:04 [monitor.py:48] torch.compile took 4.51 s in total
(EngineCore pid=963700) INFO 03-19 14:02:04 [decorators.py:296] Directly load AOT compilation from path /home/fynnsu/.cache/vllm/torch_compile_cache/torch_aot_compile/187bf04caf7a27d4a09033238166cc6876bdedd5e08b2cc352a756c85eefdb8e/rank_0_0/model
(EngineCore pid=963700) INFO 03-19 14:02:04 [monitor.py:76] Initial profiling/warmup run took 0.12 s
(EngineCore pid=963700) INFO 03-19 14:02:11 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=963700) WARNING 03-19 14:02:11 [gpu_model_runner.py:6064] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with CacheOnlyAttentionBackend backend (support: AttentionCGSupport.NEVER); setting cudagraph_mode=PIECEWISE because attention is compiled piecewise
(EngineCore pid=963700) INFO 03-19 14:02:11 [gpu_model_runner.py:5632] Profiling CUDA graph memory: PIECEWISE=51 (largest=512)
(EngineCore pid=963700) INFO 03-19 14:02:13 [gpu_model_runner.py:5711] Estimated CUDA graph memory: 0.40 GiB total
(EngineCore pid=963700) INFO 03-19 14:02:13 [gpu_worker.py:452] Available KV cache memory: 54.75 GiB
(EngineCore pid=963700) INFO 03-19 14:02:13 [gpu_worker.py:486] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9051 to maintain the same effective KV cache size.
(EngineCore pid=963700) INFO 03-19 14:02:13 [kv_cache_utils.py:1316] GPU KV cache size: 326,192 tokens
(EngineCore pid=963700) INFO 03-19 14:02:13 [kv_cache_utils.py:1321] Maximum concurrency for 40,960 tokens per request: 7.96x
(EngineCore pid=963700) INFO 03-19 14:02:13 [factory.py:64] Creating v1 connector with name: ExampleHiddenStatesConnector and engine_id: 67caa35d-4470-4e54-aad2-6f2bc7ae60cd
(EngineCore pid=963700) WARNING 03-19 14:02:13 [base.py:189] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
(EngineCore pid=963700) INFO 03-19 14:02:13 [example_hidden_states_connector.py:156] KVTransferConfig(kv_connector='ExampleHiddenStatesConnector', engine_id='67caa35d-4470-4e54-aad2-6f2bc7ae60cd', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_producer', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={'shared_storage_path': '/tmp/hidden_states'}, kv_connector_module_path=None, enable_permute_local_kv=False, kv_load_failure_policy='fail')
(EngineCore pid=963700) INFO 03-19 14:02:13 [example_hidden_states_connector.py:157] Shared storage path is /tmp/hidden_states
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████| 51/51 [00:02<00:00, 23.99it/s]
(EngineCore pid=963700) INFO 03-19 14:02:16 [gpu_model_runner.py:5771] Graph capturing finished in 3 secs, took 0.43 GiB
(EngineCore pid=963700) INFO 03-19 14:02:16 [gpu_worker.py:614] CUDA graph pool memory: 0.43 GiB (actual), 0.4 GiB (estimated), difference: 0.03 GiB (6.0%).
(EngineCore pid=963700) INFO 03-19 14:02:16 [core.py:281] init engine (profile, create kv cache, warmup model) took 16.18 seconds
(EngineCore pid=963700) INFO 03-19 14:02:16 [factory.py:64] Creating v1 connector with name: ExampleHiddenStatesConnector and engine_id: 67caa35d-4470-4e54-aad2-6f2bc7ae60cd
(EngineCore pid=963700) WARNING 03-19 14:02:16 [base.py:189] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
(EngineCore pid=963700) INFO 03-19 14:02:16 [example_hidden_states_connector.py:156] KVTransferConfig(kv_connector='ExampleHiddenStatesConnector', engine_id='67caa35d-4470-4e54-aad2-6f2bc7ae60cd', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_producer', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={'shared_storage_path': '/tmp/hidden_states'}, kv_connector_module_path=None, enable_permute_local_kv=False, kv_load_failure_policy='fail')
(EngineCore pid=963700) INFO 03-19 14:02:16 [example_hidden_states_connector.py:157] Shared storage path is /tmp/hidden_states
(EngineCore pid=963700) INFO 03-19 14:02:17 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=959548) INFO 03-19 14:02:17 [api_server.py:595] Supported tasks: ['generate']
(APIServer pid=959548) WARNING 03-19 14:02:17 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=959548) INFO 03-19 14:02:18 [hf.py:320] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=959548) INFO 03-19 14:02:18 [api_server.py:599] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:37] Available routes are:
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=959548) INFO:     Started server process [959548]
(APIServer pid=959548) INFO:     Waiting for application startup.
(APIServer pid=959548) INFO:     Application startup complete.
(APIServer pid=959548) INFO:     127.0.0.1:41278 - "GET /v1/models HTTP/1.1" 200 OK
(EngineCore pid=963700) ERROR 03-19 14:03:08 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.2rc1.dev30+gbe5de50a3) with config: model='Qwen/Qwen3-8B', speculative_config=SpeculativeConfig(method='extract_hidden_states', model=None, num_spec_tokens=1), tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/home/fynnsu/.cache/vllm/torch_compile_cache/8ba5af3c3d', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': '/home/fynnsu/.cache/vllm/torch_compile_cache/8ba5af3c3d/rank_0_0/backbone', 'fast_moe_cold_start': False, 'static_all_moe_layers': []},
(EngineCore pid=963700) ERROR 03-19 14:03:08 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['cmpl-9ed56d3d60a3f0d1-0-9a661b80'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[([129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147],)],num_computed_tokens=[2048],num_output_tokens=[0]), num_scheduled_tokens={cmpl-9ed56d3d60a3f0d1-0-9a661b80: 296}, total_num_scheduled_tokens=296, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[147], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=ExampleHiddenStatesConnectorMetadata(requests=[ReqMeta(req_id='cmpl-9ed56d3d60a3f0d1-0-9a661b80', filename='/tmp/hidden_states/cmpl-9ed56d3d60a3f0d1-0-9a661b80.safetensors', token_ids=Tensor(shape=torch.Size([2344]), device=cpu,dtype=torch.int64), slot_mapping=Tensor(shape=torch.Size([2352]), device=cpu,dtype=torch.int64), new_req=false)]), ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.007210830962425141, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] EngineCore encountered a fatal error.
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] Traceback (most recent call last):
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1092, in run_engine_core
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     engine_core.run_busy_loop()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1133, in run_busy_loop
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     self._process_engine_step()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1172, in _process_engine_step
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     outputs, model_executed = self.step_fn()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]                               ^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 497, in step_with_batch_queue
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     model_output = future.result()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]                    ^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     return self.__get_result()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     raise self._exception
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/executor/uniproc_executor.py", line 82, in collective_rpc
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     return func(*args, **kwargs)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     return func(*args, **kwargs)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/gpu_worker.py", line 756, in sample_tokens
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     return self.model_runner.sample_tokens(grammar_output)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     return func(*args, **kwargs)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/gpu_model_runner.py", line 4072, in sample_tokens
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     self.finalize_kv_connector()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 88, in finalize_kv_connector
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     kv_connector.wait_for_save()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]   File "/home/fynnsu/vllm-project/vllm/vllm/distributed/kv_transfer/kv_connector/v1/example_hidden_states_connector.py", line 219, in wait_for_save
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]     assert prior is None, "Found another KV transfer request with same req_id!"
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101]            ^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] AssertionError: Found another KV transfer request with same req_id!
(EngineCore pid=963700) Process EngineCore:
(EngineCore pid=963700) Traceback (most recent call last):
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] AsyncLLM output_handler failed.
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] Traceback (most recent call last):
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/async_llm.py", line 663, in output_handler
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707]     outputs = await engine_core.get_output_async()
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707]   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core_client.py", line 1022, in get_output_async
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707]     raise self._format_exception(outputs) from None
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore pid=963700)   File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=963700)     self.run()
(EngineCore pid=963700)   File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=963700)     self._target(*self._args, **self._kwargs)
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=963700)     raise e
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1092, in run_engine_core
(EngineCore pid=963700)     engine_core.run_busy_loop()
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1133, in run_busy_loop
(EngineCore pid=963700)     self._process_engine_step()
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1172, in _process_engine_step
(EngineCore pid=963700)     outputs, model_executed = self.step_fn()
(EngineCore pid=963700)                               ^^^^^^^^^^^^^^
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 497, in step_with_batch_queue
(EngineCore pid=963700)     model_output = future.result()
(EngineCore pid=963700)                    ^^^^^^^^^^^^^^^
(EngineCore pid=963700)   File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=963700)     return self.__get_result()
(EngineCore pid=963700)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700)   File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=963700)     raise self._exception
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/executor/uniproc_executor.py", line 82, in collective_rpc
(EngineCore pid=963700)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=963700)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=963700)     return func(*args, **kwargs)
(EngineCore pid=963700)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=963700)     return func(*args, **kwargs)
(EngineCore pid=963700)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/gpu_worker.py", line 756, in sample_tokens
(EngineCore pid=963700)     return self.model_runner.sample_tokens(grammar_output)
(EngineCore pid=963700)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=963700)     return func(*args, **kwargs)
(EngineCore pid=963700)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/gpu_model_runner.py", line 4072, in sample_tokens
(EngineCore pid=963700)     self.finalize_kv_connector()
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 88, in finalize_kv_connector
(EngineCore pid=963700)     kv_connector.wait_for_save()
(EngineCore pid=963700)   File "/home/fynnsu/vllm-project/vllm/vllm/distributed/kv_transfer/kv_connector/v1/example_hidden_states_connector.py", line 219, in wait_for_save
(EngineCore pid=963700)     assert prior is None, "Found another KV transfer request with same req_id!"
(EngineCore pid=963700)            ^^^^^^^^^^^^^
(EngineCore pid=963700) AssertionError: Found another KV transfer request with same req_id!
(APIServer pid=959548) INFO:     127.0.0.1:41278 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=959548) INFO:     Shutting down
[rank0]:[W319 14:03:08.124439735 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=959548) INFO:     Waiting for application shutdown.
(APIServer pid=959548) INFO:     Application shutdown complete.
(APIServer pid=959548) INFO:     Finished server process [959548]

benchislett · 2026-03-19T20:29:11Z

Must disable chunked prefill

benchislett · 2026-03-19T20:29:42Z

It's possible to just have it wait for the previous copy to finish and overwrite. But in the interest of performance I think one write per request makes more sense, in which case chunked prefill must be disabled.

fynnsu · 2026-03-20T17:29:23Z

Yep, got this working with --no-enable-chunked-prefill. Updated the benchmark results in the comment. TLDR: this pr halves the single gpu offline generation time.

mergify · 2026-03-23T20:12:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

benchislett added 2 commits March 18, 2026 03:19

optimize hidden state extraction

20169c5

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

clean up args

5ba9bcf

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot added performance Performance-related issues speculative-decoding v1 kv-connector labels Mar 18, 2026

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

cleanup for the warmup

be5de50

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

fynnsu mentioned this pull request Mar 20, 2026

Update data generation to use vLLM extract_hidden_states vllm-project/speculators#353

Merged

4 tasks

mergify bot added the needs-rebase label Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Optimize hidden state extraction logic#37374

[Perf] Optimize hidden state extraction logic#37374
benchislett wants to merge 3 commits intovllm-project:mainfrom
CentML:optimize-hidden-state-ex-r2

benchislett commented Mar 18, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 18, 2026

Uh oh!

fynnsu commented Mar 19, 2026

Uh oh!

benchislett commented Mar 19, 2026

Uh oh!

benchislett commented Mar 19, 2026

Uh oh!

fynnsu commented Mar 20, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	max_workers=8, thread_name_prefix="vllm-hs-save"
	max_workers=self._kv_transfer_config.get_from_extra_config("num_writer_threads", 8), thread_name_prefix="vllm-hs-save"

Uh oh!

Conversation

benchislett commented Mar 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

fynnsu commented Mar 19, 2026

Uh oh!

benchislett commented Mar 19, 2026

Uh oh!

benchislett commented Mar 19, 2026

Uh oh!

fynnsu commented Mar 20, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benchislett commented Mar 18, 2026 •

edited by github-actions bot

Loading