[Perf] Optimize hidden state extraction logic#37374
[Perf] Optimize hidden state extraction logic#37374benchislett wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations to the hidden state extraction logic by leveraging asynchronous DtoH copies and disk I/O. The use of a dedicated CUDA stream, pinned memory, and a thread pool for I/O operations is well-implemented. The addition of a benchmark script is also a valuable contribution for measuring the impact of these changes. My review identifies one area for improvement to make the performance tuning more flexible.
| # the default stream (model forward). Thread pool for disk writes. | ||
| self._copy_stream: torch.cuda.Stream | None = None # lazy init | ||
| self._executor = ThreadPoolExecutor( | ||
| max_workers=8, thread_name_prefix="vllm-hs-save" |
There was a problem hiding this comment.
The number of worker threads in the ThreadPoolExecutor is hardcoded to 8. This may not be optimal for all hardware configurations and could lead to either underutilization of resources or excessive contention, especially in a performance-sensitive context. It would be better to make this value configurable, for instance, through kv_connector_extra_config.
| max_workers=8, thread_name_prefix="vllm-hs-save" | |
| max_workers=self._kv_transfer_config.get_from_extra_config("num_writer_threads", 8), thread_name_prefix="vllm-hs-save" |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
Hey, just a heads up: I attempted to benchmark this using the commands using the new speculators offline data generation script (see this pr/comment: vllm-project/speculators#353 (comment)), it threw an error "Found another KV transfer request with same req_id". Maybe related to the async usage (although also occurred when I limited concurrent requests to 1), or chunked prefill? Stacktrace❯ CUDA_VISIBLE_DEVICES=0 python scripts/launch_vllm.py Qwen/Qwen3-8B --
Running command:
vllm serve Qwen/Qwen3-8B --speculative_config {"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [2, 18, 33, 36]}}} --kv_transfer_config {"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", "kv_connector_extra_config": {"shared_storage_path": "/tmp/hidden_states"}}
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297]
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297] █ █ █▄ ▄█
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.2rc1.dev30+gbe5de50a3
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297] █▄█▀ █ █ █ █ model Qwen/Qwen3-8B
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:297]
(APIServer pid=959548) INFO 03-19 14:01:46 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen3-8B', 'model': 'Qwen/Qwen3-8B', 'speculative_config': {'method': 'extract_hidden_states', 'num_speculative_tokens': 1, 'draft_model_config': {'hf_config': {'eagle_aux_hidden_state_layer_ids': [2, 18, 33, 36]}}}, 'kv_transfer_config': KVTransferConfig(kv_connector='ExampleHiddenStatesConnector', engine_id='67caa35d-4470-4e54-aad2-6f2bc7ae60cd', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_producer', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={'shared_storage_path': '/tmp/hidden_states'}, kv_connector_module_path=None, enable_permute_local_kv=False, kv_load_failure_policy='fail')}
(APIServer pid=959548) INFO 03-19 14:01:47 [model.py:533] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=959548) INFO 03-19 14:01:47 [model.py:1582] Using max model len 40960
(APIServer pid=959548) INFO 03-19 14:01:47 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=959548) INFO 03-19 14:01:47 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=959548) WARNING 03-19 14:01:47 [vllm.py:1168] Turning off hybrid kv cache manager because `--kv-transfer-config` is set. This will reduce the performance of vLLM on LLMs with sliding window attention or Mamba attention. If you are a developer of kv connector, please consider supporting hybrid kv cache manager for your connector by making sure your connector is a subclass of `SupportsHMA` defined in kv_connector/v1/base.py and use --no-disable-hybrid-kv-cache-manager to start vLLM.
(EngineCore pid=963700) INFO 03-19 14:01:54 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev30+gbe5de50a3) with config: model='Qwen/Qwen3-8B', speculative_config=SpeculativeConfig(method='extract_hidden_states', model=None, num_spec_tokens=1), tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=963700) INFO 03-19 14:01:55 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.14.217.17:50297 backend=nccl
(EngineCore pid=963700) INFO 03-19 14:01:55 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=963700) WARNING 03-19 14:01:55 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=963700) INFO 03-19 14:01:55 [gpu_model_runner.py:4506] Starting to load model Qwen/Qwen3-8B...
(EngineCore pid=963700) INFO 03-19 14:01:56 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=963700) INFO 03-19 14:01:56 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=963700) <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=963700) <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:01, 2.20it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:01, 1.89it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:01, 1.84it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 2.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 2.17it/s]
(EngineCore pid=963700)
(EngineCore pid=963700) INFO 03-19 14:01:59 [default_loader.py:373] Loading weights took 2.34 seconds
(EngineCore pid=963700) INFO 03-19 14:01:59 [gpu_model_runner.py:4530] Loading drafter model...
(EngineCore pid=963700) WARNING 03-19 14:01:59 [vllm.py:1739] `torch.compile` is turned on, but the model Qwen/Qwen3-8B does not support it. Please open an issue on GitHub if you want it to be supported.
(EngineCore pid=963700) INFO 03-19 14:01:59 [default_loader.py:373] Loading weights took 9125665.46 seconds
(EngineCore pid=963700) INFO 03-19 14:01:59 [gpu_model_runner.py:4568] Using auxiliary layers from speculative config: (2, 18, 33, 36)
(EngineCore pid=963700) INFO 03-19 14:01:59 [gpu_model_runner.py:4591] Model loading took 15.27 GiB memory and 3.270017 seconds
(EngineCore pid=963700) INFO 03-19 14:02:03 [backends.py:988] Using cache directory: /home/fynnsu/.cache/vllm/torch_compile_cache/8ba5af3c3d/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=963700) INFO 03-19 14:02:03 [backends.py:1048] Dynamo bytecode transform time: 2.97 s
(EngineCore pid=963700) INFO 03-19 14:02:04 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 1.245 s
(EngineCore pid=963700) INFO 03-19 14:02:04 [monitor.py:48] torch.compile took 4.51 s in total
(EngineCore pid=963700) INFO 03-19 14:02:04 [decorators.py:296] Directly load AOT compilation from path /home/fynnsu/.cache/vllm/torch_compile_cache/torch_aot_compile/187bf04caf7a27d4a09033238166cc6876bdedd5e08b2cc352a756c85eefdb8e/rank_0_0/model
(EngineCore pid=963700) INFO 03-19 14:02:04 [monitor.py:76] Initial profiling/warmup run took 0.12 s
(EngineCore pid=963700) INFO 03-19 14:02:11 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=963700) WARNING 03-19 14:02:11 [gpu_model_runner.py:6064] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with CacheOnlyAttentionBackend backend (support: AttentionCGSupport.NEVER); setting cudagraph_mode=PIECEWISE because attention is compiled piecewise
(EngineCore pid=963700) INFO 03-19 14:02:11 [gpu_model_runner.py:5632] Profiling CUDA graph memory: PIECEWISE=51 (largest=512)
(EngineCore pid=963700) INFO 03-19 14:02:13 [gpu_model_runner.py:5711] Estimated CUDA graph memory: 0.40 GiB total
(EngineCore pid=963700) INFO 03-19 14:02:13 [gpu_worker.py:452] Available KV cache memory: 54.75 GiB
(EngineCore pid=963700) INFO 03-19 14:02:13 [gpu_worker.py:486] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9051 to maintain the same effective KV cache size.
(EngineCore pid=963700) INFO 03-19 14:02:13 [kv_cache_utils.py:1316] GPU KV cache size: 326,192 tokens
(EngineCore pid=963700) INFO 03-19 14:02:13 [kv_cache_utils.py:1321] Maximum concurrency for 40,960 tokens per request: 7.96x
(EngineCore pid=963700) INFO 03-19 14:02:13 [factory.py:64] Creating v1 connector with name: ExampleHiddenStatesConnector and engine_id: 67caa35d-4470-4e54-aad2-6f2bc7ae60cd
(EngineCore pid=963700) WARNING 03-19 14:02:13 [base.py:189] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
(EngineCore pid=963700) INFO 03-19 14:02:13 [example_hidden_states_connector.py:156] KVTransferConfig(kv_connector='ExampleHiddenStatesConnector', engine_id='67caa35d-4470-4e54-aad2-6f2bc7ae60cd', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_producer', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={'shared_storage_path': '/tmp/hidden_states'}, kv_connector_module_path=None, enable_permute_local_kv=False, kv_load_failure_policy='fail')
(EngineCore pid=963700) INFO 03-19 14:02:13 [example_hidden_states_connector.py:157] Shared storage path is /tmp/hidden_states
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████| 51/51 [00:02<00:00, 23.99it/s]
(EngineCore pid=963700) INFO 03-19 14:02:16 [gpu_model_runner.py:5771] Graph capturing finished in 3 secs, took 0.43 GiB
(EngineCore pid=963700) INFO 03-19 14:02:16 [gpu_worker.py:614] CUDA graph pool memory: 0.43 GiB (actual), 0.4 GiB (estimated), difference: 0.03 GiB (6.0%).
(EngineCore pid=963700) INFO 03-19 14:02:16 [core.py:281] init engine (profile, create kv cache, warmup model) took 16.18 seconds
(EngineCore pid=963700) INFO 03-19 14:02:16 [factory.py:64] Creating v1 connector with name: ExampleHiddenStatesConnector and engine_id: 67caa35d-4470-4e54-aad2-6f2bc7ae60cd
(EngineCore pid=963700) WARNING 03-19 14:02:16 [base.py:189] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
(EngineCore pid=963700) INFO 03-19 14:02:16 [example_hidden_states_connector.py:156] KVTransferConfig(kv_connector='ExampleHiddenStatesConnector', engine_id='67caa35d-4470-4e54-aad2-6f2bc7ae60cd', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_producer', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={'shared_storage_path': '/tmp/hidden_states'}, kv_connector_module_path=None, enable_permute_local_kv=False, kv_load_failure_policy='fail')
(EngineCore pid=963700) INFO 03-19 14:02:16 [example_hidden_states_connector.py:157] Shared storage path is /tmp/hidden_states
(EngineCore pid=963700) INFO 03-19 14:02:17 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=959548) INFO 03-19 14:02:17 [api_server.py:595] Supported tasks: ['generate']
(APIServer pid=959548) WARNING 03-19 14:02:17 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=959548) INFO 03-19 14:02:18 [hf.py:320] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=959548) INFO 03-19 14:02:18 [api_server.py:599] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:37] Available routes are:
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=959548) INFO 03-19 14:02:18 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=959548) INFO: Started server process [959548]
(APIServer pid=959548) INFO: Waiting for application startup.
(APIServer pid=959548) INFO: Application startup complete.
(APIServer pid=959548) INFO: 127.0.0.1:41278 - "GET /v1/models HTTP/1.1" 200 OK
(EngineCore pid=963700) ERROR 03-19 14:03:08 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.2rc1.dev30+gbe5de50a3) with config: model='Qwen/Qwen3-8B', speculative_config=SpeculativeConfig(method='extract_hidden_states', model=None, num_spec_tokens=1), tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/home/fynnsu/.cache/vllm/torch_compile_cache/8ba5af3c3d', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': '/home/fynnsu/.cache/vllm/torch_compile_cache/8ba5af3c3d/rank_0_0/backbone', 'fast_moe_cold_start': False, 'static_all_moe_layers': []},
(EngineCore pid=963700) ERROR 03-19 14:03:08 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['cmpl-9ed56d3d60a3f0d1-0-9a661b80'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[([129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147],)],num_computed_tokens=[2048],num_output_tokens=[0]), num_scheduled_tokens={cmpl-9ed56d3d60a3f0d1-0-9a661b80: 296}, total_num_scheduled_tokens=296, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[147], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=ExampleHiddenStatesConnectorMetadata(requests=[ReqMeta(req_id='cmpl-9ed56d3d60a3f0d1-0-9a661b80', filename='/tmp/hidden_states/cmpl-9ed56d3d60a3f0d1-0-9a661b80.safetensors', token_ids=Tensor(shape=torch.Size([2344]), device=cpu,dtype=torch.int64), slot_mapping=Tensor(shape=torch.Size([2352]), device=cpu,dtype=torch.int64), new_req=false)]), ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.007210830962425141, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] EngineCore encountered a fatal error.
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] Traceback (most recent call last):
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1092, in run_engine_core
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] engine_core.run_busy_loop()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1133, in run_busy_loop
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] self._process_engine_step()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1172, in _process_engine_step
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] outputs, model_executed = self.step_fn()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] ^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 497, in step_with_batch_queue
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] model_output = future.result()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] ^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] return self.__get_result()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] raise self._exception
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/v1/executor/uniproc_executor.py", line 82, in collective_rpc
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] return func(*args, **kwargs)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] return func(*args, **kwargs)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/gpu_worker.py", line 756, in sample_tokens
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] return self.model_runner.sample_tokens(grammar_output)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] return func(*args, **kwargs)
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/gpu_model_runner.py", line 4072, in sample_tokens
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] self.finalize_kv_connector()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 88, in finalize_kv_connector
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] kv_connector.wait_for_save()
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] File "/home/fynnsu/vllm-project/vllm/vllm/distributed/kv_transfer/kv_connector/v1/example_hidden_states_connector.py", line 219, in wait_for_save
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] assert prior is None, "Found another KV transfer request with same req_id!"
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] ^^^^^^^^^^^^^
(EngineCore pid=963700) ERROR 03-19 14:03:08 [core.py:1101] AssertionError: Found another KV transfer request with same req_id!
(EngineCore pid=963700) Process EngineCore:
(EngineCore pid=963700) Traceback (most recent call last):
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] AsyncLLM output_handler failed.
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] Traceback (most recent call last):
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/async_llm.py", line 663, in output_handler
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] outputs = await engine_core.get_output_async()
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core_client.py", line 1022, in get_output_async
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] raise self._format_exception(outputs) from None
(APIServer pid=959548) ERROR 03-19 14:03:08 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore pid=963700) File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=963700) self.run()
(EngineCore pid=963700) File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=963700) self._target(*self._args, **self._kwargs)
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=963700) raise e
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1092, in run_engine_core
(EngineCore pid=963700) engine_core.run_busy_loop()
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1133, in run_busy_loop
(EngineCore pid=963700) self._process_engine_step()
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 1172, in _process_engine_step
(EngineCore pid=963700) outputs, model_executed = self.step_fn()
(EngineCore pid=963700) ^^^^^^^^^^^^^^
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/engine/core.py", line 497, in step_with_batch_queue
(EngineCore pid=963700) model_output = future.result()
(EngineCore pid=963700) ^^^^^^^^^^^^^^^
(EngineCore pid=963700) File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=963700) return self.__get_result()
(EngineCore pid=963700) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) File "/home/fynnsu/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=963700) raise self._exception
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/executor/uniproc_executor.py", line 82, in collective_rpc
(EngineCore pid=963700) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=963700) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=963700) return func(*args, **kwargs)
(EngineCore pid=963700) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=963700) return func(*args, **kwargs)
(EngineCore pid=963700) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/gpu_worker.py", line 756, in sample_tokens
(EngineCore pid=963700) return self.model_runner.sample_tokens(grammar_output)
(EngineCore pid=963700) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=963700) return func(*args, **kwargs)
(EngineCore pid=963700) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/gpu_model_runner.py", line 4072, in sample_tokens
(EngineCore pid=963700) self.finalize_kv_connector()
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 88, in finalize_kv_connector
(EngineCore pid=963700) kv_connector.wait_for_save()
(EngineCore pid=963700) File "/home/fynnsu/vllm-project/vllm/vllm/distributed/kv_transfer/kv_connector/v1/example_hidden_states_connector.py", line 219, in wait_for_save
(EngineCore pid=963700) assert prior is None, "Found another KV transfer request with same req_id!"
(EngineCore pid=963700) ^^^^^^^^^^^^^
(EngineCore pid=963700) AssertionError: Found another KV transfer request with same req_id!
(APIServer pid=959548) INFO: 127.0.0.1:41278 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=959548) INFO: Shutting down
[rank0]:[W319 14:03:08.124439735 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=959548) INFO: Waiting for application shutdown.
(APIServer pid=959548) INFO: Application shutdown complete.
(APIServer pid=959548) INFO: Finished server process [959548] |
|
Must disable chunked prefill |
|
It's possible to just have it wait for the previous copy to finish and overwrite. But in the interest of performance I think one write per request makes more sense, in which case chunked prefill must be disabled. |
|
Yep, got this working with |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Eliminate unnecessary synchronization and improve overlap compatibility of hidden state extraction. There is still some inefficiency due to the lack of full overlap, which should be further improved by #32951
Also adds a benchmark script for testing performance of the hidden states extractor. In a particular case of interest for GPT-OSS, I get a 1.45x improvement over the baseline, and am able to achieve within 10% of baseline throughput performance in some cases (with a lot of parameter tuning)
Testing
None