[WIP][Core] Update PyTorch to 2.12.0, torchvision to 0.27.0, triton to 3.7.0#40077
[WIP][Core] Update PyTorch to 2.12.0, torchvision to 0.27.0, triton to 3.7.0#40077atalman wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the project's PyTorch dependency from version 2.11.0 to 2.12.0 across multiple configuration files, Dockerfiles, and requirement specifications, while also switching to the PyTorch 'test' index URLs. Feedback highlights critical version mismatches where torchaudio remains at 2.11.0, which could cause ABI compatibility issues. Additionally, the ROCm index URL may need to be updated to the 'test' channel to ensure the new packages are found.
| # Common dependencies | ||
| -r common.txt | ||
|
|
||
| numba == 0.61.2 # Required for N-gram speculative decoding | ||
|
|
||
| # Dependencies for NVIDIA GPUs | ||
| torch==2.11.0 | ||
| torch==2.12.0 |
There was a problem hiding this comment.
The torch version is updated to 2.12.0, but torchaudio (line 9) remains at 2.11.0. PyTorch ecosystem packages (torch, torchvision, torchaudio) are tightly coupled and typically require matching minor versions. A mismatch here will likely lead to dependency resolution failures or runtime ABI compatibility issues. Please ensure torchaudio is also updated to 2.12.0 to match the rest of the ecosystem.
There was a problem hiding this comment.
No this is expected, torchaudio stays on 2.11
| @@ -3,10 +3,10 @@ | |||
|
|
|||
|
|
|||
| --extra-index-url https://download.pytorch.org/whl/rocm7.1 | |||
| torch==2.11.0 | |||
| torchvision==0.26.0 | |||
| torch==2.12.0 | |||
There was a problem hiding this comment.
There are two potential issues here:
- Version Mismatch:
torchis updated to2.12.0, buttorchaudio(line 8) remains at2.11.0. These should be synchronized to avoid compatibility issues. - Index URL: The index URL on line 5 points to the standard ROCm channel. If
torch 2.12.0is only available on thetestchannel (as indicated in the PR description for CUDA/CPU), the ROCm build will fail to find the package. Consider updating the index tohttps://download.pytorch.org/whl/test/rocm7.1if the package is not in the main channel.
| @@ -1219,7 +1219,7 @@ tomli==2.2.1 | |||
| # via schemathesis | |||
| tomli-w==1.2.0 | |||
| # via schemathesis | |||
| torch==2.11.0+cu130 | |||
| torch==2.12.0+cu130 | |||
There was a problem hiding this comment.
There was a problem hiding this comment.
No this is expected, torchaudio stays on 2.11
d78656a to
9ecbbd3
Compare
ea825d8 to
e2a9354
Compare
e2a9354 to
b3b475f
Compare
|
thanks to push this @atalman . I can confirm agx thor triton issues were fixed with 3.7.0 uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install --prerelease=allow --force-reinstall triton --index-url https://download.pytorch.org/whl/test/cu132 |
## Summary Adds a Claude Code skill that automates end-to-end triage of failing vLLM Buildkite CI runs for PyTorch version-bump PRs. Derived from the multi-week triage of vLLM PR vllm-project/vllm#40077 (torch 2.12.0 + triton 3.7.0), which produced umbrella issue pytorch/pytorch#180899 with 25+ tracked sub-issues over a series of daily runs. ## What the skill does - Pulls the failing build's job list from Buildkite REST API and filters to **true hard failures** (excluding `soft_failed=True`, `waiting_failed`, and infra-aborted jobs). - Compares each failing job against recent main `Full CI run - nightly/daily` builds to drop **pre-existing failures**, with the caveat that infra-killed main jobs are not a valid baseline (must be retried first). - Pulls and ANSI/timestamp-strips logs for the survivors and matches them against a curated set of **root-cause signature regexes** (Inductor MetaProxy, triton PassManager, AOT cache pickling, custom-op fake-kernel stride mismatch, GPU contention, FP8 / quantized accuracy drift, etc.). - Routes each root cause to the right repo: pytorch/pytorch (torch / triton / Inductor / Dynamo / AOTAutograd) vs. vllm-project/vllm (multimodal model assertions, custom-op fake-kernel bugs, response APIs). - Drafts upstream issues with reproducibility tables, environment blocks, and tracebacks, with a strict **draft→confirm→post** protocol, and links them under the umbrella. - Manages umbrella checklist hygiene (mark closed, reopen on regressions, retract on false positives like the recent #182549 retraction). ## Notable lessons baked in - `state=failed` + `soft_failed=True` is non-blocking — always filter both. - `Engine core initialization failed. See root cause above.` is a red herring — the actual exception is several lines up in the EngineCore worker output. - Custom-op `assert_size_stride` failures on `torch.ops.vllm.<X>.default` are almost always **vLLM-side fake-kernel bugs**, not torch regressions — inspect the `direct_register_custom_op(... fake_impl=...)` registration first. - Bulk B200 `exit_status=125` + `nvidia-container-cli: device error / driver rpc error: timed out` is agent infra, not a regression — recommend rerun. - When the same B200 infra cluster wipes out *both* the test-PR and the main-build coverage of a job, the comparison is **inconclusive** — ALWAYS request a main rerun before filing. Filing without that baseline produced a wrongful issue (#182549, retracted). - Buildkite REST API rate-limit is 400/min. Token must be in a shell variable before parallel curl in `while read` (inline `$(cat …)` silently produces 0-byte log files). - Title convention: `[vllm] [<sub-area tag>] <concise root cause>`. Always include `[vllm]` from the start (post-hoc edits are noisy). ## Test plan This skill is invoked manually by Claude Code when the user points at a failing Buildkite build. Validation has been the actual triage of vLLM #40077 over 16+ daily runs since 2026-04-20: - 25+ pytorch/pytorch issues filed under umbrella #180899, with reproducibility tables and environment blocks. - Caught real regressions: AsyncTP correctness (#182124), Fullgraph Smoke Test (#182125), Batch Invariance B200 (#181248, fixed by Lucas Kabela's PR), MetaProxy in FP8 fusion (#180906), aten::bmm double-registration (#180905), and others. - Caught the gpt-oss MoE custom-op stride mismatch as a vLLM-side bug (vllm-project/vllm#41645 / #41646), correctly routing it away from pytorch. - Caught and retracted a false positive (#182549) once the main nightly was retried, demonstrating the infra-baseline lesson. No automated test harness exists for Claude skills today; the skill is exercised through the live triage workflow. ## File added - `.claude/skills/vllm-pytorch-ci-triage/SKILL.md` (379 lines) This follows the same convention as the four existing skills under `.claude/skills/` (each is a single-file `SKILL.md` with YAML frontmatter).
Update PyTorch ecosystem versions: - torch: 2.11.0 → 2.12.0 - torchvision: 0.26.0 → 0.27.0 - triton: 3.6.0 → 3.7.0 - torchaudio: stays at 2.11.0 Use PyTorch test index (download.pytorch.org/whl/test/) for CUDA and CPU packages since torch 2.12.0 is currently on the test channel. Co-authored-by: Claude Signed-off-by: atalman <atalman@meta.com>
Update nvidia-cudnn-cu13 (9.19.0.56 -> 9.20.0.48), nvidia-cusparselt-cu13 (0.8.0 -> 0.8.1), and nvidia-nccl-cu13 (2.28.9 -> 2.29.7) to resolve the dependency conflict with torch==2.12.0+cu130. Co-authored-by: Claude <noreply@anthropic.com>
The `vllm-test-deps` stage copies requirements/test/cuda.in to cpu.in, but cuda.in's top-line `--extra-index-url` points at `https://download.pytorch.org/whl/test/cu130` which only has CUDA 13 wheels. The old command also passed `--torch-backend cpu`, which pins torch lookups to the stable CPU channel (`whl/cpu`) — neither index has torch==2.12.0 yet, so uv pip compile fails with "No solution found... no version of torch==2.12.0". Rewrite the extra-index-url in the seeded cpu.in to `whl/test/cpu` (which has torch-2.12.0+cpu wheels) and drop `--torch-backend cpu` so uv uses that index directly. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: atalman <atalman@meta.com>
…DE compat tests Fixes CPU-Compatibility Tests on torch 2.12. The SDE-emulation tests previously set TORCH_COMPILE_DISABLE=1 to skip torch.compile (slow under SDE). On torch 2.11 this turned every torch.compile call site into a silent no-op. On torch 2.12, call sites that pass fullgraph=True now raise: RuntimeError: Worker failed with error 'torch.compile with fullgraph=True found no compiled frames. The frame was likely skipped (...).' Engine init goes through vLLM's piecewise-compile path, which uses fullgraph=True, so init crashes inside determine_available_memory. Use vLLM's canonical --enforce-eager engine flag instead, which never constructs a torch.compile wrapper at all. Same speedup, no contract violation, works on both torch 2.11 and 2.12. Tracked upstream as pytorch/pytorch#181247 (under umbrella pytorch/pytorch#180899). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: atalman <atalman@meta.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
…d-safe Hugging Face fast-tokenizer wrappers (vllm-project#41181)" This reverts commit 20dcd98.
…ad-safe Hugging Face fast-tokenizer wrappers (vllm-project#41181)" This reverts commit 4ee7407.
Update PyTorch ecosystem versions: - torch: 2.11.0 -> 2.12.0 - torchvision: 0.26.0 -> 0.27.0 - triton: 3.6.0 -> 3.7.0 - torchaudio: stays at 2.11.0 Bump CUDA 13 deps to match torch 2.12.0+cu130: - nvidia-cudnn-cu13: 9.19.0.56 -> 9.20.0.48 - nvidia-cusparselt-cu13: 0.8.0 -> 0.8.1 - nvidia-nccl-cu13: 2.28.9 -> 2.29.7 Use --enforce-eager instead of TORCH_COMPILE_DISABLE=1 in the CPU SDE compat test. On torch 2.11 TORCH_COMPILE_DISABLE turned torch.compile call sites into silent no-ops; on torch 2.12 sites that pass fullgraph=True now raise "found no compiled frames", which crashes engine init via vLLM's piecewise-compile path. --enforce-eager skips the wrapper entirely on both versions. Supersedes vllm-project#40077 (release wheels are now published, so the download.pytorch.org/whl/test/ indexes are no longer needed). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: atalman <atalman@meta.com>
Update PyTorch ecosystem versions: - torch: 2.11.0 -> 2.12.0 - torchvision: 0.26.0 -> 0.27.0 - triton: 3.6.0 -> 3.7.0 - torchaudio: stays at 2.11.0 Bump CUDA 13 deps to match torch 2.12.0+cu130: - nvidia-cudnn-cu13: 9.19.0.56 -> 9.20.0.48 - nvidia-cusparselt-cu13: 0.8.0 -> 0.8.1 - nvidia-nccl-cu13: 2.28.9 -> 2.29.7 Use --enforce-eager instead of TORCH_COMPILE_DISABLE=1 in the CPU SDE compat test. On torch 2.11 TORCH_COMPILE_DISABLE turned torch.compile call sites into silent no-ops; on torch 2.12 sites that pass fullgraph=True now raise "found no compiled frames", which crashes engine init via vLLM's piecewise-compile path. --enforce-eager skips the wrapper entirely on both versions. Supersedes vllm-project#40077 (release wheels are now published, so the download.pytorch.org/whl/test/ indexes are no longer needed). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: atalman <atalman@meta.com>
Update PyTorch ecosystem versions: - torch: 2.11.0 -> 2.12.0 - torchvision: 0.26.0 -> 0.27.0 - triton: 3.6.0 -> 3.7.0 - torchaudio: stays at 2.11.0 Bump CUDA 13 deps to match torch 2.12.0+cu130: - nvidia-cudnn-cu13: 9.19.0.56 -> 9.20.0.48 - nvidia-cusparselt-cu13: 0.8.0 -> 0.8.1 - nvidia-nccl-cu13: 2.28.9 -> 2.29.7 Use --enforce-eager instead of TORCH_COMPILE_DISABLE=1 in the CPU SDE compat test. On torch 2.11 TORCH_COMPILE_DISABLE turned torch.compile call sites into silent no-ops; on torch 2.12 sites that pass fullgraph=True now raise "found no compiled frames", which crashes engine init via vLLM's piecewise-compile path. --enforce-eager skips the wrapper entirely on both versions. Supersedes vllm-project#40077 (release wheels are now published, so the download.pytorch.org/whl/test/ indexes are no longer needed). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: atalman <atalman@meta.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Superseded by #42848. |
Update PyTorch ecosystem versions:
Use PyTorch test index (download.pytorch.org/whl/test/) for CUDA and CPU packages since torch 2.12.0 is currently on the test channel.
Co-authored-by: Claude
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.