[Build] Switch default CUDA to 13.0, update CUDA architecture lists, clean up stale build-args by Harry-Chen · Pull Request #39878 · vllm-project/vllm

Harry-Chen · 2026-04-15T07:34:16Z

Summary

Switch default CUDA version from 12.9 to 13.0. CUDA 12.9 becomes the cu129 variant; CUDA 13.0 wheels and Docker images are now the unsuffixed default (pip install vllm). This aligns with torch 2.11 release.
Update CUDA architecture lists following PyTorch RELEASE.md, and define them as pipeline-level env vars to reduce repetition.
Remove stale FLASHINFER_AOT_COMPILE build-arg references left over after [CI]: remove unused FLASHINFER_AOT_COMPILE build argument #32627 removed the ARG from the Dockerfile.

We need to announce to the users after this is merged.

Details

Default CUDA version swap (12.9 → 13.0)

Component	Before	After
`VLLM_MAIN_CUDA_VERSION`	`12.9`	`13.0`
Default wheel suffix	(none, implies cu129)	(none, implies cu130)
Variant wheel suffix	`+cu130`	`+cu129`
Default Docker tag	`vllm-openai:latest` = CUDA 12.9	`vllm-openai:latest` = CUDA 13.0
Variant Docker tag	`vllm-openai:latest-cu130`	`vllm-openai:latest-cu129`
Nightly index default	`cu129`	`cu130`

Files: vllm/envs.py, release-pipeline.yaml, annotate-release.sh, generate-and-upload-nightly-index.sh, check-ray-compatibility.sh

CUDA architecture lists

New arch lists, following PyTorch RELEASE.md:

	CUDA 13.0 (default)	CUDA 12.9 (variant)
x86_64	`7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX`	`7.5 8.0 8.6 8.9 9.0 10.0 12.0`
aarch64	`8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX`	`8.0 8.7 8.9 9.0 10.0 12.0`

Volta (SM70 / SM75) architectures are removed since they are no longer supported by PyTorch 2.11 with newer CUDA.

Notable inclusions beyond upstream PyTorch defaults:

SM86 — broader Ampere coverage (e.g. GA106/GA107: RTX 3060, RTX 3070)
SM89 — required for marlin fp8 kernels (QMMA.16832.F32.E4M3.E4M3 SAAS instruction, only supported on SM89 and SM120+)

CUDA 12.9 builds omit +PTX since they are not the forward-compatible default.

Arch lists are defined as top-level env: variables in release-pipeline.yaml (CUDA_ARCH_X86, CUDA_ARCH_AARCH64, CUDA_ARCH_X86_CU129, CUDA_ARCH_AARCH64_CU129) and referenced via ${...} in all build commands, eliminating 12 hardcoded duplicates.

Files: release-pipeline.yaml, docker/Dockerfile, docker/versions.json, image_build_torch_nightly.sh, .github/workflows/scripts/build.sh

FLASHINFER_AOT_COMPILE cleanup

The FLASHINFER_AOT_COMPILE Dockerfile ARG was proposed to be removed in #32627. This PR includes it.

Test Plan

Will be tested by CI and release pipeline.

Test Result

CI pipeline: https://buildkite.com/vllm/ci/builds/61430/. This PR does not introduce new regressions.
Release pipeline: https://buildkite.com/vllm/release-v2/builds/665/. All passed.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Make CUDA 13.0 the default build variant and CUDA 12.9 the optional variant. This affects: - VLLM_MAIN_CUDA_VERSION: 12.9 -> 13.0 (envs.py) - Default wheel variant alias: cu129 -> cu130 (nightly index) - Release pipeline: CUDA 13.0 images/wheels are now the default (no suffix), CUDA 12.9 becomes the cu129 variant - Release annotations: updated wheel filenames and Docker tags - Ray compatibility: PyTorch index URL updated to cu130 Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>

Update CUDA architecture lists following PyTorch RELEASE.md (https://github.com/pytorch/pytorch/blob/main/RELEASE.md). Default (CUDA 13.0): x86_64: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX aarch64: 8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX CUDA 12.9 variant: same arches without +PTX forward compat. Notable arch inclusions beyond PyTorch defaults: - SM86 for broader Ampere coverage (e.g. RTX 3060/3070) - SM89 for marlin fp8 support (QMMA SAAS instruction) Define CUDA_ARCH_{X86,AARCH64}{,_CU129} env vars in the release pipeline to reduce repetition across wheel and Docker image builds. Also update Dockerfile default and CI build scripts to match. Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>

The FLASHINFER_AOT_COMPILE ARG was removed from the Dockerfile in PR vllm-project#32627, but stale references remained in the release pipeline and docker-bake.hcl. Clean them up. Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>

Harry-Chen · 2026-04-15T07:35:49Z

@claude review

gemini-code-assist

Code Review

This pull request transitions the build and release pipelines to CUDA 13.0 as the primary version, updating environment variables, Docker build arguments, and script defaults accordingly. Feedback indicates that the updated CUDA architecture lists omit support for SM70 (Volta/V100) and SM121 (DGX Spark), which may cause regressions for users on that hardware. Furthermore, a missing manylinux version argument was noted in the nightly wheel upload script for CUDA 12.9 builds.

Copilot

Pull request overview

Updates vLLM’s build/release configuration to make CUDA 13.0 the default (CUDA 12.9 becomes the cu129 variant), refreshes CUDA arch lists, and removes stale build-arg usage related to FlashInfer.

Changes:

Switch default CUDA version from 12.9 → 13.0 across env/config and Buildkite scripts.
Update TORCH_CUDA_ARCH_LIST / torch_cuda_arch_list defaults and pipeline arch lists (including +PTX for the default variant).
Remove stale FLASHINFER_AOT_COMPILE build-arg references from Docker bake targets and Buildkite build commands.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`vllm/envs.py`	Bumps `VLLM_MAIN_CUDA_VERSION` default to `13.0`.
`docker/versions.json`	Updates default `TORCH_CUDA_ARCH_LIST` to the new arch set and `+PTX`.
`docker/docker-bake.hcl`	Removes `FLASHINFER_AOT_COMPILE` build arg from targets.
`docker/Dockerfile`	Updates `torch_cuda_arch_list` default used to set `TORCH_CUDA_ARCH_LIST` in build stages.
`.github/workflows/scripts/build.sh`	Refreshes wheel build `TORCH_CUDA_ARCH_LIST` used in GHA builds.
`.buildkite/scripts/generate-and-upload-nightly-index.sh`	Changes default nightly index alias from `cu129` → `cu130`.
`.buildkite/scripts/check-ray-compatibility.sh`	Updates PyTorch CUDA index URL default to `cu130`.
`.buildkite/scripts/annotate-release.sh`	Adjusts annotated wheel/image commands to reflect new default/variant CUDA mapping.
`.buildkite/release-pipeline.yaml`	Reworks release pipeline to default to CUDA 13.0; introduces top-level arch env vars; updates tags/manifests/variant naming.
`.buildkite/image_build/image_build_torch_nightly.sh`	Updates nightly build `torch_cuda_arch_list` used for the torch-nightly test image.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The upload-nightly-wheels.sh script defaults to manylinux_2_31, so the aarch64 CUDA 12.9 wheel was already getting the right tag. But x86_64 CUDA 12.9 had the arg explicitly while aarch64 did not — make it explicit for consistency. Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>

ehfd · 2026-04-18T16:01:12Z

Are we completely abandoning Volta, even for CUDA 12?

(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.

Details

(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299] 
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev391+g80b18230e
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299]   █▄█▀ █     █     █     █  model   Qwen/Qwen3-VL-Embedding-8B
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299] 
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:233] non-default args: {'api_server_count': 4, 'host': '0.0.0.0', 'port': 5000, 'model': 'Qwen/Qwen3-VL-Embedding-8B', 'runner': 'pooling', 'convert': 'embed', 'trust_remote_code': True, 'max_model_len': 262144, 'download_dir': '/workspace/.cache/huggingface/hub', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.95, 'mm_processor_cache_gb': 8.0, 'mm_processor_cache_type': 'shm'}
(APIServer pid=1) INFO 04-18 15:44:43 [config.py:947] Found sentence-transformers tokenize configuration.
(APIServer pid=1) INFO 04-18 15:44:58 [model.py:554] Resolved architecture: Qwen3VLForConditionalGeneration
(APIServer pid=1) INFO 04-18 15:44:59 [config.py:835] Found sentence-transformers modules configuration.
(APIServer pid=1) WARNING 04-18 15:44:59 [model.py:1970] Your device 'Tesla V100-SXM2-32GB' (with compute capability 7.0) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
(APIServer pid=1) WARNING 04-18 15:44:59 [model.py:2023] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 04-18 15:44:59 [model.py:1685] Using max model len 262144
(APIServer pid=1) INFO 04-18 15:44:59 [vllm.py:834] Asynchronous scheduling is disabled.
(APIServer pid=1) INFO 04-18 15:44:59 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) WARNING 04-18 15:44:59 [vllm.py:1033] Pooling models do not support full cudagraphs. Overriding cudagraph_mode to PIECEWISE.
(APIServer pid=1) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=279) INFO 04-18 15:45:26 [core.py:107[] Initializing a V1 LLM engine (v0.19.1rc1.dev391+g80b18230e) with config: model='Qwen/Qwen3-VL-Embedding-8B', speculative_config=None, tokenizer='Qwen/Qwen3-VL-Embedding-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=262144, download_dir='/workspace/.cache/huggingface/hub', load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-VL-Embedding-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=PoolerConfig(task=None, pooling_type=None, seq_pooling_type='LAST', tok_pooling_type='ALL', use_activation=None, dimensions=None, enable_chunked_processing=False, max_embed_len=None, logit_mean=None, logit_sigma=None, logit_bias=None, logit_scale=None, step_tag_id=None, returned_token_ids=None), compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048[], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512[], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': ]}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=279) WARNING 04-18 15:45:26 [multiproc_executor.py:1029] Reducing Torch parallelism from 39 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=279) INFO 04-18 15:45:26 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.244.112.43 (local), world_size=2, local_world_size=2
`Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
INFO 04-18 15:45:38 [config.py:947] Found sentence-transformers tokenize configuration.
(Worker pid=364) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU0 Tesla V100-SXM2-32GB which is of compute capability (CC) 7.0.
(Worker pid=364) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(Worker pid=364) - 7.5 which supports hardware CC >=7.5,<8.0
(Worker pid=364) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(Worker pid=364) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(Worker pid=364) - 9.0 which supports hardware CC >=9.0,<10.0
(Worker pid=364) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(Worker pid=364) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=364) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=364) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6, 12.8
(Worker pid=364)   _warn_unsupported_code(d, device_cc, code_ccs)
(Worker pid=364) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU1 Tesla V100-SXM2-32GB which is of compute capability (CC) 7.0.
(Worker pid=364) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(Worker pid=364) - 7.5 which supports hardware CC >=7.5,<8.0
(Worker pid=364) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(Worker pid=364) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(Worker pid=364) - 9.0 which supports hardware CC >=9.0,<10.0
(Worker pid=364) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(Worker pid=364) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=364) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=364) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6, 12.8
(Worker pid=364)   _warn_unsupported_code(d, device_cc, code_ccs)
(Worker pid=364) INFO 04-18 15:45:39 [parallel_state.py:1400] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:58837 backend=nccl
`Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
INFO 04-18 15:45:46 [config.py:947] Found sentence-transformers tokenize configuration.
(Worker pid=369) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU0 Tesla V100-SXM2-32GB which is of compute capability (CC) 7.0.
(Worker pid=369) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(Worker pid=369) - 7.5 which supports hardware CC >=7.5,<8.0
(Worker pid=369) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(Worker pid=369) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(Worker pid=369) - 9.0 which supports hardware CC >=9.0,<10.0
(Worker pid=369) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(Worker pid=369) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=369) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=369) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6, 12.8
(Worker pid=369)   _warn_unsupported_code(d, device_cc, code_ccs)
(Worker pid=369) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU1 Tesla V100-SXM2-32GB which is of compute capability (CC) 7.0.
(Worker pid=369) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(Worker pid=369) - 7.5 which supports hardware CC >=7.5,<8.0
(Worker pid=369) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(Worker pid=369) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(Worker pid=369) - 9.0 which supports hardware CC >=9.0,<10.0
(Worker pid=369) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(Worker pid=369) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=369) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=369) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6, 12.8
(Worker pid=369)   _warn_unsupported_code(d, device_cc, code_ccs)
(Worker pid=369) INFO 04-18 15:45:46 [parallel_state.py:1400] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:58837 backend=nccl
(Worker pid=364) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=369) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=364) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=369) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=364) INFO 04-18 15:45:47 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 611, in __init__
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.worker.init_device()
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 317, in init_device
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.worker.init_device()  # type: ignore
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in init_device
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     init_worker_distributed_environment(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 1051, in init_worker_distributed_environment
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     ensure_model_parallel_initialized(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1745, in ensure_model_parallel_initialized
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     initialize_model_parallel(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1576, in initialize_model_parallel
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     _TP = init_model_parallel_group(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]           ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1157, in init_model_parallel_group
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return GroupCoordinator(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 376, in __init__
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.device_communicator = device_comm_cls(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]                                ^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 75, in __init__
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.pynccl_comm = PyNcclCommunicator(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]                        ^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 143, in __init__
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     data = torch.zeros(1, device=device)
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] 
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 611, in __init__
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.worker.init_device()
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 317, in init_device
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.worker.init_device()  # type: ignore
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in init_device
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     init_worker_distributed_environment(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 1051, in init_worker_distributed_environment
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     ensure_model_parallel_initialized(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1745, in ensure_model_parallel_initialized
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     initialize_model_parallel(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1576, in initialize_model_parallel
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     _TP = init_model_parallel_group(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]           ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1157, in init_model_parallel_group
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return GroupCoordinator(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 376, in __init__
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.device_communicator = device_comm_cls(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]                                ^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 75, in __init__
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.pynccl_comm = PyNcclCommunicator(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]                        ^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 143, in __init__
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     data = torch.zeros(1, device=device)
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] 
[rank0[]:[W418 15:45:49.697881747 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132] EngineCore failed to start.
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132] Traceback (most recent call last):
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1106, in run_engine_core
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     return func(*args, **kwargs)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 872, in __init__
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     super().__init__(
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 116, in __init__
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 107, in __init__
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     super().__init__(vllm_config)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     return func(*args, **kwargs)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     self._init_executor()
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 200, in _init_executor
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=279) Process EngineCore:
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 747, in wait_for_ready
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     raise e from None
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore pid=279) Traceback (most recent call last):
(EngineCore pid=279)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=279)     self.run()
(EngineCore pid=279)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=279)     self._target(*self._args, **self._kwargs)
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1136, in run_engine_core
(EngineCore pid=279)     raise e
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1106, in run_engine_core
(EngineCore pid=279)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=279)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=279)     return func(*args, **kwargs)
(EngineCore pid=279)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 872, in __init__
(EngineCore pid=279)     super().__init__(
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 116, in __init__
(EngineCore pid=279)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=279)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 107, in __init__
(EngineCore pid=279)     super().__init__(vllm_config)
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=279)     return func(*args, **kwargs)
(EngineCore pid=279)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=279)     self._init_executor()
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 200, in _init_executor
(EngineCore pid=279)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=279)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 747, in wait_for_ready
(EngineCore pid=279)     raise e from None
(EngineCore pid=279) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 718, in <module>
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 219, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 148, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1094, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1153, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Harry-Chen · 2026-04-18T16:04:06Z

@ehfd PyTorch has dropped volta support from its 12.8 build: pytorch/pytorch#172351

dmitry-tokarev-nv · 2026-04-23T06:13:24Z

+  CUDA_ARCH_X86: "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"
+  # aarch64 only architectures: 8.7 for Orin, 11.0 for Thor (since CUDA 13)
+  CUDA_ARCH_AARCH64: "8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX"
+  CUDA_ARCH_X86_CU129: "7.5 8.0 8.6 8.9 9.0 10.0 12.0"
+  CUDA_ARCH_AARCH64_CU129: "8.0 8.7 8.9 9.0 10.0 12.0"


Suggested change

CUDA_ARCH_X86: "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"

# aarch64 only architectures: 8.7 for Orin, 11.0 for Thor (since CUDA 13)

CUDA_ARCH_AARCH64: "8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX"

CUDA_ARCH_X86_CU129: "7.5 8.0 8.6 8.9 9.0 10.0 12.0"

CUDA_ARCH_AARCH64_CU129: "8.0 8.7 8.9 9.0 10.0 12.0"

CUDA_ARCH_X86: "7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX"

# aarch64 only architectures: 8.7 for Orin, 11.0 for Thor (since CUDA 13)

CUDA_ARCH_AARCH64: "8.0 8.6 8.7 8.9 9.0 10.0 10.3 11.0 12.0+PTX"

CUDA_ARCH_X86_CU129: "7.5 8.0 8.6 8.9 9.0 10.0 12.0"

CUDA_ARCH_AARCH64_CU129: "8.0 8.6 8.7 8.9 9.0 10.0 12.0"

Both 8.6 and 10.3 applicable to x86 and aarch64
see https://developer.nvidia.com/cuda/gpus
8.6 - covers long list of PCIe GPUs
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

Thanks!

We are not compiling for sm_86 since the beginning on aarch64, and IIRC it could use the cubin from sm_80.

For CUDA 13+, we use family specifier 10.0f so 10.3 is included in kernels that requires tcgen05, and for other kernels 10.0 is enough to be compatible, I think.

dmitry-tokarev-nv · 2026-04-23T06:17:09Z

 export MAX_JOBS=1
 # Make sure release wheels are built for the following architectures
-export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
+export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"


Suggested change

export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"

export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX"

see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

This list is not used anymore since we do not run GitHub actions. It's kept for reference only.

dmitry-tokarev-nv · 2026-04-23T06:18:53Z


 # install kv_connectors if requested
-ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'
+ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX'


Suggested change

ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX'

ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX'

see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

Same as above -- 10.3 is absorbed by 10.0f family on CUDA 13 in supported kernels, and generally can reuse the 10.0 cubin.

dmitry-tokarev-nv · 2026-04-23T06:19:06Z

 # See https://github.com/pytorch/pytorch/pull/123243
 # From versions.json: .torch.cuda_arch_list
-ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'
+ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX'


Suggested change

ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX'

ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX'

see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

Same as above -- 10.3 is absorbed by 10.0f family on CUDA 13 in supported kernels, and generally can reuse the 10.0 cubin.

dmitry-tokarev-nv · 2026-04-23T06:19:28Z

    },
    "TORCH_CUDA_ARCH_LIST": {
-      "default": "7.0 7.5 8.0 8.9 9.0 10.0 12.0"
+      "default": "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"


Suggested change

"default": "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"

"default": "7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX"

see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

Same as above -- 10.3 is absorbed by 10.0f family on CUDA 13 in supported kernels, and generally can reuse the 10.0 cubin.

dmitry-tokarev-nv · 2026-04-23T06:26:27Z

+  set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0")
 elseif(DEFINED CMAKE_CUDA_COMPILER_VERSION AND
   CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
-  set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0;12.1")
+  set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;10.3;12.0;12.1")


Suggested change

set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0")

elseif(DEFINED CMAKE_CUDA_COMPILER_VERSION AND

CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)

set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0;12.1")

set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;10.3;12.0;12.1")

set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;10.3;12.0;12.1")

elseif(DEFINED CMAKE_CUDA_COMPILER_VERSION AND

CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)

set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0")

I think the lists need to be flipped.
Cuda 12.8 doesn't support 10.3, 11.0, 12.1.
see https://docs.nvidia.com/cuda/archive/12.8.2/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list

We are only using CUDA 12.9, so it should be safe to use thouse architectures. And please see the comments above that explains why we are not specifying 10.3 and 12.1 since CUDA 13.0

Ack. It could be misleading to users that logic is for cuda 12.8 but in reality it's for cuda 12.9.
BTW 12.9 doesn't support CC 11.0 - see https://docs.nvidia.com/cuda/archive/12.9.1/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list

dmitry-tokarev-nv

posted comments with suggestions

…clean up stale build-args (#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> (cherry picked from commit 3ed5231)

…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Adrian <info@zzit.ch>

…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Harry-Chen and others added 3 commits April 15, 2026 15:11

Copilot AI review requested due to automatic review settings April 15, 2026 07:34

Copilot started reviewing on behalf of Harry-Chen April 15, 2026 07:34 View session

mergify Bot added ci/build nvidia labels Apr 15, 2026

github-project-automation Bot added this to NVIDIA Apr 15, 2026

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread .buildkite/release-pipeline.yaml

Comment thread .buildkite/release-pipeline.yaml Outdated

Comment thread .buildkite/release-pipeline.yaml Outdated

Comment thread docker/Dockerfile

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread .buildkite/scripts/generate-and-upload-nightly-index.sh

Harry-Chen and others added 2 commits April 15, 2026 15:42

[Build] Clean up more unused CUDA architectures

96b73dc

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>

Harry-Chen requested review from LucasWilkinson and tlrmchlsmth as code owners April 15, 2026 13:28

Harry-Chen mentioned this pull request Apr 17, 2026

[CI] Add sm_110 to aarch64 CUDA 13.0 builds #31544

Open

Harry-Chen mentioned this pull request Apr 23, 2026

[Build] Bump CUDA to 13.0.2 to match PyTorch 2.11.0 #40669

Merged

5 tasks

dmitry-tokarev-nv reviewed Apr 23, 2026

View reviewed changes

dmitry-tokarev-nv suggested changes Apr 23, 2026

View reviewed changes

github-project-automation Bot moved this to In review in NVIDIA Apr 23, 2026

youkaichao merged commit 3ed5231 into vllm-project:main Apr 23, 2026
143 of 146 checks passed

github-project-automation Bot moved this from In review to Done in NVIDIA Apr 23, 2026

ai-bond mentioned this pull request Apr 25, 2026

How to enable Flash Attention on V100s with vLLM? Any version constraints? I'm trying to run Qwen3.6-35B-A3B-AWQ. ai-bond/flash-attention-v100#27

Closed

Harry-Chen deleted the default-cu130 branch April 27, 2026 14:12

Defilan mentioned this pull request Apr 28, 2026

test + integrate vLLM v0.20.0 (TurboQuant 2-bit KV, DeepSeek V4, FA4 default) defilantech/LLMKube#354

Closed

5 tasks

AlessandroPomponio mentioned this pull request May 4, 2026

build(deps): update dependencies IBM/ado#933

Closed

	export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"
	export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX"

	ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX'
	ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX'

	"default": "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"
	"default": "7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX"

Uh oh!

Conversation

Harry-Chen commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Default CUDA version swap (12.9 → 13.0)

CUDA architecture lists

FLASHINFER_AOT_COMPILE cleanup

Test Plan

Test Result

Uh oh!

Harry-Chen commented Apr 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

ehfd commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Harry-Chen commented Apr 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmitry-tokarev-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Harry-Chen commented Apr 15, 2026 •

edited

Loading

ehfd commented Apr 18, 2026 •

edited

Loading