Skip to content

[Build] Switch default CUDA to 13.0, update CUDA architecture lists, clean up stale build-args#39878

Merged
youkaichao merged 5 commits into
vllm-project:mainfrom
Harry-Chen:default-cu130
Apr 23, 2026
Merged

[Build] Switch default CUDA to 13.0, update CUDA architecture lists, clean up stale build-args#39878
youkaichao merged 5 commits into
vllm-project:mainfrom
Harry-Chen:default-cu130

Conversation

@Harry-Chen
Copy link
Copy Markdown
Member

@Harry-Chen Harry-Chen commented Apr 15, 2026

Summary

  • Switch default CUDA version from 12.9 to 13.0. CUDA 12.9 becomes the cu129 variant; CUDA 13.0 wheels and Docker images are now the unsuffixed default (pip install vllm). This aligns with torch 2.11 release.
  • Update CUDA architecture lists following PyTorch RELEASE.md, and define them as pipeline-level env vars to reduce repetition.
  • Remove stale FLASHINFER_AOT_COMPILE build-arg references left over after [CI]: remove unused FLASHINFER_AOT_COMPILE build argument #32627 removed the ARG from the Dockerfile.

We need to announce to the users after this is merged.

Details

Default CUDA version swap (12.9 → 13.0)

Component Before After
VLLM_MAIN_CUDA_VERSION 12.9 13.0
Default wheel suffix (none, implies cu129) (none, implies cu130)
Variant wheel suffix +cu130 +cu129
Default Docker tag vllm-openai:latest = CUDA 12.9 vllm-openai:latest = CUDA 13.0
Variant Docker tag vllm-openai:latest-cu130 vllm-openai:latest-cu129
Nightly index default cu129 cu130

Files: vllm/envs.py, release-pipeline.yaml, annotate-release.sh, generate-and-upload-nightly-index.sh, check-ray-compatibility.sh

CUDA architecture lists

New arch lists, following PyTorch RELEASE.md:

CUDA 13.0 (default) CUDA 12.9 (variant)
x86_64 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX 7.5 8.0 8.6 8.9 9.0 10.0 12.0
aarch64 8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX 8.0 8.7 8.9 9.0 10.0 12.0

Volta (SM70 / SM75) architectures are removed since they are no longer supported by PyTorch 2.11 with newer CUDA.

Notable inclusions beyond upstream PyTorch defaults:

  • SM86 — broader Ampere coverage (e.g. GA106/GA107: RTX 3060, RTX 3070)
  • SM89 — required for marlin fp8 kernels (QMMA.16832.F32.E4M3.E4M3 SAAS instruction, only supported on SM89 and SM120+)

CUDA 12.9 builds omit +PTX since they are not the forward-compatible default.

Arch lists are defined as top-level env: variables in release-pipeline.yaml (CUDA_ARCH_X86, CUDA_ARCH_AARCH64, CUDA_ARCH_X86_CU129, CUDA_ARCH_AARCH64_CU129) and referenced via ${...} in all build commands, eliminating 12 hardcoded duplicates.

Files: release-pipeline.yaml, docker/Dockerfile, docker/versions.json, image_build_torch_nightly.sh, .github/workflows/scripts/build.sh

FLASHINFER_AOT_COMPILE cleanup

The FLASHINFER_AOT_COMPILE Dockerfile ARG was proposed to be removed in #32627. This PR includes it.

Test Plan

Will be tested by CI and release pipeline.

Test Result

CI pipeline: https://buildkite.com/vllm/ci/builds/61430/. This PR does not introduce new regressions.
Release pipeline: https://buildkite.com/vllm/release-v2/builds/665/. All passed.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Harry-Chen and others added 3 commits April 15, 2026 15:11
Make CUDA 13.0 the default build variant and CUDA 12.9 the optional
variant. This affects:

- VLLM_MAIN_CUDA_VERSION: 12.9 -> 13.0 (envs.py)
- Default wheel variant alias: cu129 -> cu130 (nightly index)
- Release pipeline: CUDA 13.0 images/wheels are now the default
  (no suffix), CUDA 12.9 becomes the cu129 variant
- Release annotations: updated wheel filenames and Docker tags
- Ray compatibility: PyTorch index URL updated to cu130

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Update CUDA architecture lists following PyTorch RELEASE.md
(https://github.com/pytorch/pytorch/blob/main/RELEASE.md).

Default (CUDA 13.0):
  x86_64:  7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
  aarch64: 8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX

CUDA 12.9 variant: same arches without +PTX forward compat.

Notable arch inclusions beyond PyTorch defaults:
- SM86 for broader Ampere coverage (e.g. RTX 3060/3070)
- SM89 for marlin fp8 support (QMMA SAAS instruction)

Define CUDA_ARCH_{X86,AARCH64}{,_CU129} env vars in the release
pipeline to reduce repetition across wheel and Docker image builds.
Also update Dockerfile default and CI build scripts to match.

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
The FLASHINFER_AOT_COMPILE ARG was removed from the Dockerfile in
PR vllm-project#32627, but stale references remained in the release pipeline and
docker-bake.hcl. Clean them up.

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Copilot AI review requested due to automatic review settings April 15, 2026 07:34
@Harry-Chen
Copy link
Copy Markdown
Member Author

@claude review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions the build and release pipelines to CUDA 13.0 as the primary version, updating environment variables, Docker build arguments, and script defaults accordingly. Feedback indicates that the updated CUDA architecture lists omit support for SM70 (Volta/V100) and SM121 (DGX Spark), which may cause regressions for users on that hardware. Furthermore, a missing manylinux version argument was noted in the nightly wheel upload script for CUDA 12.9 builds.

Comment thread .buildkite/release-pipeline.yaml
Comment thread .buildkite/release-pipeline.yaml Outdated
Comment thread .buildkite/release-pipeline.yaml Outdated
Comment thread docker/Dockerfile
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates vLLM’s build/release configuration to make CUDA 13.0 the default (CUDA 12.9 becomes the cu129 variant), refreshes CUDA arch lists, and removes stale build-arg usage related to FlashInfer.

Changes:

  • Switch default CUDA version from 12.9 → 13.0 across env/config and Buildkite scripts.
  • Update TORCH_CUDA_ARCH_LIST / torch_cuda_arch_list defaults and pipeline arch lists (including +PTX for the default variant).
  • Remove stale FLASHINFER_AOT_COMPILE build-arg references from Docker bake targets and Buildkite build commands.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
vllm/envs.py Bumps VLLM_MAIN_CUDA_VERSION default to 13.0.
docker/versions.json Updates default TORCH_CUDA_ARCH_LIST to the new arch set and +PTX.
docker/docker-bake.hcl Removes FLASHINFER_AOT_COMPILE build arg from targets.
docker/Dockerfile Updates torch_cuda_arch_list default used to set TORCH_CUDA_ARCH_LIST in build stages.
.github/workflows/scripts/build.sh Refreshes wheel build TORCH_CUDA_ARCH_LIST used in GHA builds.
.buildkite/scripts/generate-and-upload-nightly-index.sh Changes default nightly index alias from cu129cu130.
.buildkite/scripts/check-ray-compatibility.sh Updates PyTorch CUDA index URL default to cu130.
.buildkite/scripts/annotate-release.sh Adjusts annotated wheel/image commands to reflect new default/variant CUDA mapping.
.buildkite/release-pipeline.yaml Reworks release pipeline to default to CUDA 13.0; introduces top-level arch env vars; updates tags/manifests/variant naming.
.buildkite/image_build/image_build_torch_nightly.sh Updates nightly build torch_cuda_arch_list used for the torch-nightly test image.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .buildkite/scripts/generate-and-upload-nightly-index.sh
Harry-Chen and others added 2 commits April 15, 2026 15:42
The upload-nightly-wheels.sh script defaults to manylinux_2_31, so
the aarch64 CUDA 12.9 wheel was already getting the right tag. But
x86_64 CUDA 12.9 had the arg explicitly while aarch64 did not —
make it explicit for consistency.

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Apr 18, 2026

Are we completely abandoning Volta, even for CUDA 12?

(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
Details
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299] 
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev391+g80b18230e
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299]   █▄█▀ █     █     █     █  model   Qwen/Qwen3-VL-Embedding-8B
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:299] 
(APIServer pid=1) INFO 04-18 15:44:41 [utils.py:233] non-default args: {'api_server_count': 4, 'host': '0.0.0.0', 'port': 5000, 'model': 'Qwen/Qwen3-VL-Embedding-8B', 'runner': 'pooling', 'convert': 'embed', 'trust_remote_code': True, 'max_model_len': 262144, 'download_dir': '/workspace/.cache/huggingface/hub', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.95, 'mm_processor_cache_gb': 8.0, 'mm_processor_cache_type': 'shm'}
(APIServer pid=1) INFO 04-18 15:44:43 [config.py:947] Found sentence-transformers tokenize configuration.
(APIServer pid=1) INFO 04-18 15:44:58 [model.py:554] Resolved architecture: Qwen3VLForConditionalGeneration
(APIServer pid=1) INFO 04-18 15:44:59 [config.py:835] Found sentence-transformers modules configuration.
(APIServer pid=1) WARNING 04-18 15:44:59 [model.py:1970] Your device 'Tesla V100-SXM2-32GB' (with compute capability 7.0) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
(APIServer pid=1) WARNING 04-18 15:44:59 [model.py:2023] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 04-18 15:44:59 [model.py:1685] Using max model len 262144
(APIServer pid=1) INFO 04-18 15:44:59 [vllm.py:834] Asynchronous scheduling is disabled.
(APIServer pid=1) INFO 04-18 15:44:59 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) WARNING 04-18 15:44:59 [vllm.py:1033] Pooling models do not support full cudagraphs. Overriding cudagraph_mode to PIECEWISE.
(APIServer pid=1) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=279) INFO 04-18 15:45:26 [core.py:107[] Initializing a V1 LLM engine (v0.19.1rc1.dev391+g80b18230e) with config: model='Qwen/Qwen3-VL-Embedding-8B', speculative_config=None, tokenizer='Qwen/Qwen3-VL-Embedding-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=262144, download_dir='/workspace/.cache/huggingface/hub', load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-VL-Embedding-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=PoolerConfig(task=None, pooling_type=None, seq_pooling_type='LAST', tok_pooling_type='ALL', use_activation=None, dimensions=None, enable_chunked_processing=False, max_embed_len=None, logit_mean=None, logit_sigma=None, logit_bias=None, logit_scale=None, step_tag_id=None, returned_token_ids=None), compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048[], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512[], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': ]}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=279) WARNING 04-18 15:45:26 [multiproc_executor.py:1029] Reducing Torch parallelism from 39 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=279) INFO 04-18 15:45:26 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.244.112.43 (local), world_size=2, local_world_size=2
`Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
INFO 04-18 15:45:38 [config.py:947] Found sentence-transformers tokenize configuration.
(Worker pid=364) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU0 Tesla V100-SXM2-32GB which is of compute capability (CC) 7.0.
(Worker pid=364) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(Worker pid=364) - 7.5 which supports hardware CC >=7.5,<8.0
(Worker pid=364) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(Worker pid=364) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(Worker pid=364) - 9.0 which supports hardware CC >=9.0,<10.0
(Worker pid=364) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(Worker pid=364) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=364) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=364) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6, 12.8
(Worker pid=364)   _warn_unsupported_code(d, device_cc, code_ccs)
(Worker pid=364) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU1 Tesla V100-SXM2-32GB which is of compute capability (CC) 7.0.
(Worker pid=364) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(Worker pid=364) - 7.5 which supports hardware CC >=7.5,<8.0
(Worker pid=364) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(Worker pid=364) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(Worker pid=364) - 9.0 which supports hardware CC >=9.0,<10.0
(Worker pid=364) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(Worker pid=364) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=364) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=364) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6, 12.8
(Worker pid=364)   _warn_unsupported_code(d, device_cc, code_ccs)
(Worker pid=364) INFO 04-18 15:45:39 [parallel_state.py:1400] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:58837 backend=nccl
`Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
INFO 04-18 15:45:46 [config.py:947] Found sentence-transformers tokenize configuration.
(Worker pid=369) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU0 Tesla V100-SXM2-32GB which is of compute capability (CC) 7.0.
(Worker pid=369) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(Worker pid=369) - 7.5 which supports hardware CC >=7.5,<8.0
(Worker pid=369) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(Worker pid=369) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(Worker pid=369) - 9.0 which supports hardware CC >=9.0,<10.0
(Worker pid=369) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(Worker pid=369) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=369) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=369) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6, 12.8
(Worker pid=369)   _warn_unsupported_code(d, device_cc, code_ccs)
(Worker pid=369) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU1 Tesla V100-SXM2-32GB which is of compute capability (CC) 7.0.
(Worker pid=369) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(Worker pid=369) - 7.5 which supports hardware CC >=7.5,<8.0
(Worker pid=369) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(Worker pid=369) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(Worker pid=369) - 9.0 which supports hardware CC >=9.0,<10.0
(Worker pid=369) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(Worker pid=369) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=369) - 12.0 which supports hardware CC >=12.0,<13.0
(Worker pid=369) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6, 12.8
(Worker pid=369)   _warn_unsupported_code(d, device_cc, code_ccs)
(Worker pid=369) INFO 04-18 15:45:46 [parallel_state.py:1400] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:58837 backend=nccl
(Worker pid=364) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=369) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=364) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=369) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=364) INFO 04-18 15:45:47 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 611, in __init__
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.worker.init_device()
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 317, in init_device
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.worker.init_device()  # type: ignore
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in init_device
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     init_worker_distributed_environment(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 1051, in init_worker_distributed_environment
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     ensure_model_parallel_initialized(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1745, in ensure_model_parallel_initialized
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     initialize_model_parallel(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1576, in initialize_model_parallel
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     _TP = init_model_parallel_group(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]           ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1157, in init_model_parallel_group
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return GroupCoordinator(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 376, in __init__
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.device_communicator = device_comm_cls(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]                                ^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 75, in __init__
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.pynccl_comm = PyNcclCommunicator(
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]                        ^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 143, in __init__
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     data = torch.zeros(1, device=device)
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=369) ERROR 04-18 15:45:48 [multiproc_executor.py:870] 
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 611, in __init__
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.worker.init_device()
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 317, in init_device
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.worker.init_device()  # type: ignore
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in init_device
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     init_worker_distributed_environment(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 1051, in init_worker_distributed_environment
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     ensure_model_parallel_initialized(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1745, in ensure_model_parallel_initialized
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     initialize_model_parallel(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1576, in initialize_model_parallel
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     _TP = init_model_parallel_group(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]           ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1157, in init_model_parallel_group
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     return GroupCoordinator(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 376, in __init__
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.device_communicator = device_comm_cls(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]                                ^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 75, in __init__
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     self.pynccl_comm = PyNcclCommunicator(
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]                        ^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 143, in __init__
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]     data = torch.zeros(1, device=device)
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=364) ERROR 04-18 15:45:48 [multiproc_executor.py:870] 
[rank0[]:[W418 15:45:49.697881747 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132] EngineCore failed to start.
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132] Traceback (most recent call last):
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1106, in run_engine_core
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     return func(*args, **kwargs)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 872, in __init__
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     super().__init__(
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 116, in __init__
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 107, in __init__
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     super().__init__(vllm_config)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     return func(*args, **kwargs)
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     self._init_executor()
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 200, in _init_executor
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=279) Process EngineCore:
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 747, in wait_for_ready
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132]     raise e from None
(EngineCore pid=279) ERROR 04-18 15:45:50 [core.py:1132] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore pid=279) Traceback (most recent call last):
(EngineCore pid=279)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=279)     self.run()
(EngineCore pid=279)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=279)     self._target(*self._args, **self._kwargs)
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1136, in run_engine_core
(EngineCore pid=279)     raise e
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1106, in run_engine_core
(EngineCore pid=279)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=279)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=279)     return func(*args, **kwargs)
(EngineCore pid=279)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 872, in __init__
(EngineCore pid=279)     super().__init__(
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 116, in __init__
(EngineCore pid=279)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=279)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 107, in __init__
(EngineCore pid=279)     super().__init__(vllm_config)
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=279)     return func(*args, **kwargs)
(EngineCore pid=279)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=279)     self._init_executor()
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 200, in _init_executor
(EngineCore pid=279)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=279)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=279)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 747, in wait_for_ready
(EngineCore pid=279)     raise e from None
(EngineCore pid=279) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 718, in <module>
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 219, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 148, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1094, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1153, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

@Harry-Chen
Copy link
Copy Markdown
Member Author

@ehfd PyTorch has dropped volta support from its 12.8 build: pytorch/pytorch#172351

Comment on lines +5 to +9
CUDA_ARCH_X86: "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"
# aarch64 only architectures: 8.7 for Orin, 11.0 for Thor (since CUDA 13)
CUDA_ARCH_AARCH64: "8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX"
CUDA_ARCH_X86_CU129: "7.5 8.0 8.6 8.9 9.0 10.0 12.0"
CUDA_ARCH_AARCH64_CU129: "8.0 8.7 8.9 9.0 10.0 12.0"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CUDA_ARCH_X86: "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"
# aarch64 only architectures: 8.7 for Orin, 11.0 for Thor (since CUDA 13)
CUDA_ARCH_AARCH64: "8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX"
CUDA_ARCH_X86_CU129: "7.5 8.0 8.6 8.9 9.0 10.0 12.0"
CUDA_ARCH_AARCH64_CU129: "8.0 8.7 8.9 9.0 10.0 12.0"
CUDA_ARCH_X86: "7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX"
# aarch64 only architectures: 8.7 for Orin, 11.0 for Thor (since CUDA 13)
CUDA_ARCH_AARCH64: "8.0 8.6 8.7 8.9 9.0 10.0 10.3 11.0 12.0+PTX"
CUDA_ARCH_X86_CU129: "7.5 8.0 8.6 8.9 9.0 10.0 12.0"
CUDA_ARCH_AARCH64_CU129: "8.0 8.6 8.7 8.9 9.0 10.0 12.0"

Both 8.6 and 10.3 applicable to x86 and aarch64
see https://developer.nvidia.com/cuda/gpus
8.6 - covers long list of PCIe GPUs
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

  • We are not compiling for sm_86 since the beginning on aarch64, and IIRC it could use the cubin from sm_80.
  • For CUDA 13+, we use family specifier 10.0f so 10.3 is included in kernels that requires tcgen05, and for other kernels 10.0 is enough to be compatible, I think.

export MAX_JOBS=1
# Make sure release wheels are built for the following architectures
export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"
export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX"

see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is not used anymore since we do not run GitHub actions. It's kept for reference only.

Comment thread docker/Dockerfile

# install kv_connectors if requested
ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'
ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX'
ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX'

see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above -- 10.3 is absorbed by 10.0f family on CUDA 13 in supported kernels, and generally can reuse the 10.0 cubin.

Comment thread docker/Dockerfile
# See https://github.com/pytorch/pytorch/pull/123243
# From versions.json: .torch.cuda_arch_list
ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'
ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX'
ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX'

see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above -- 10.3 is absorbed by 10.0f family on CUDA 13 in supported kernels, and generally can reuse the 10.0 cubin.

Comment thread docker/versions.json
},
"TORCH_CUDA_ARCH_LIST": {
"default": "7.0 7.5 8.0 8.9 9.0 10.0 12.0"
"default": "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"default": "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX"
"default": "7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX"

see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above -- 10.3 is absorbed by 10.0f family on CUDA 13 in supported kernels, and generally can reuse the 10.0 cubin.

Comment thread CMakeLists.txt
Comment on lines +100 to +103
set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0")
elseif(DEFINED CMAKE_CUDA_COMPILER_VERSION AND
CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0;12.1")
set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;10.3;12.0;12.1")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0")
elseif(DEFINED CMAKE_CUDA_COMPILER_VERSION AND
CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0;12.1")
set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;10.3;12.0;12.1")
set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;10.3;12.0;12.1")
elseif(DEFINED CMAKE_CUDA_COMPILER_VERSION AND
CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0")

I think the lists need to be flipped.
Cuda 12.8 doesn't support 10.3, 11.0, 12.1.
see https://docs.nvidia.com/cuda/archive/12.8.2/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only using CUDA 12.9, so it should be safe to use thouse architectures. And please see the comments above that explains why we are not specifying 10.3 and 12.1 since CUDA 13.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. It could be misleading to users that logic is for cuda 12.8 but in reality it's for cuda 12.9.
BTW 12.9 doesn't support CC 11.0 - see https://docs.nvidia.com/cuda/archive/12.9.1/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list

Copy link
Copy Markdown
Contributor

@dmitry-tokarev-nv dmitry-tokarev-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posted comments with suggestions

@github-project-automation github-project-automation Bot moved this to In review in NVIDIA Apr 23, 2026
@youkaichao youkaichao merged commit 3ed5231 into vllm-project:main Apr 23, 2026
143 of 146 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in NVIDIA Apr 23, 2026
khluu pushed a commit that referenced this pull request Apr 23, 2026
…clean up stale build-args (#39878)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
(cherry picked from commit 3ed5231)
@Harry-Chen Harry-Chen deleted the default-cu130 branch April 27, 2026 14:12
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…clean up stale build-args (vllm-project#39878)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Lafunamor pushed a commit to Lafunamor/vllm that referenced this pull request May 1, 2026
…clean up stale build-args (vllm-project#39878)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Adrian <info@zzit.ch>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…clean up stale build-args (vllm-project#39878)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…clean up stale build-args (vllm-project#39878)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…clean up stale build-args (vllm-project#39878)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…clean up stale build-args (vllm-project#39878)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…clean up stale build-args (vllm-project#39878)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…clean up stale build-args (vllm-project#39878)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants