Add vllm-pytorch-ci-triage skill by atalman · Pull Request #8045 · pytorch/test-infra

atalman · 2026-05-05T21:36:26Z

Summary

Adds a Claude Code skill that automates end-to-end triage of failing vLLM Buildkite CI runs for PyTorch version-bump PRs.

Derived from the multi-week triage of vLLM PR vllm-project/vllm#40077 (torch 2.12.0 + triton 3.7.0), which produced umbrella issue pytorch/pytorch#180899 with 25+ tracked sub-issues over a series of daily runs.

What the skill does

Pulls the failing build's job list from Buildkite REST API and filters to true hard failures (excluding soft_failed=True, waiting_failed, and infra-aborted jobs).
Compares each failing job against recent main Full CI run - nightly/daily builds to drop pre-existing failures, with the caveat that infra-killed main jobs are not a valid baseline (must be retried first).
Pulls and ANSI/timestamp-strips logs for the survivors and matches them against a curated set of root-cause signature regexes (Inductor MetaProxy, triton PassManager, AOT cache pickling, custom-op fake-kernel stride mismatch, GPU contention, FP8 / quantized accuracy drift, etc.).
Routes each root cause to the right repo: pytorch/pytorch (torch / triton / Inductor / Dynamo / AOTAutograd) vs. vllm-project/vllm (multimodal model assertions, custom-op fake-kernel bugs, response APIs).
Drafts upstream issues with reproducibility tables, environment blocks, and tracebacks, with a strict draft→confirm→post protocol, and links them under the umbrella.
Manages umbrella checklist hygiene (mark closed, reopen on regressions, retract on false positives like the recent #182549 retraction).

Notable lessons baked in

state=failed + soft_failed=True is non-blocking — always filter both.
Engine core initialization failed. See root cause above. is a red herring — the actual exception is several lines up in the EngineCore worker output.
Custom-op assert_size_stride failures on torch.ops.vllm.<X>.default are almost always vLLM-side fake-kernel bugs, not torch regressions — inspect the direct_register_custom_op(... fake_impl=...) registration first.
Bulk B200 exit_status=125 + nvidia-container-cli: device error / driver rpc error: timed out is agent infra, not a regression — recommend rerun.
When the same B200 infra cluster wipes out both the test-PR and the main-build coverage of a job, the comparison is inconclusive — ALWAYS request a main rerun before filing. Filing without that baseline produced a wrongful issue (#182549, retracted).
Buildkite REST API rate-limit is 400/min. Token must be in a shell variable before parallel curl in while read (inline $(cat …) silently produces 0-byte log files).
Title convention: [vllm] [<sub-area tag>] <concise root cause>. Always include [vllm] from the start (post-hoc edits are noisy).

Test plan

This skill is invoked manually by Claude Code when the user points at a failing Buildkite build. Validation has been the actual triage of vLLM #40077 over 16+ daily runs since 2026-04-20:

16+ pytorch/pytorch issues filed under umbrella #180899, with reproducibility tables and environment blocks.
Caught real regressions: AsyncTP correctness (#182124), Fullgraph Smoke Test (#182125), Batch Invariance B200 (#181248, fixed by Lucas Kabela's PR), MetaProxy in FP8 fusion (#180906), aten::bmm double-registration (#180905), and others.
Caught the gpt-oss MoE custom-op stride mismatch as a vLLM-side bug ([Bug]: gpt-oss MoE moe_forward fake-kernel shape mismatch breaks torch.compile + TP > 1 on Blackwell vllm-project/vllm#41645 / #41646), correctly routing it away from pytorch.
Caught and retracted a false positive (#182549) once the main nightly was retried, demonstrating the infra-baseline lesson.

No automated test harness exists for Claude skills today; the skill is exercised through the live triage workflow.

File added

.claude/skills/vllm-pytorch-ci-triage/SKILL.md (379 lines)

This follows the same convention as the four existing skills under .claude/skills/ (each is a single-file SKILL.md with YAML frontmatter).

Adds a Claude Code skill that automates triage of failing vLLM Buildkite CI runs for PyTorch version-bump PRs (e.g. vLLM #40077 for torch 2.12.0 + triton 3.7.0). The workflow: - pulls the failing build's job list and filters to true hard failures (excluding soft_failed and waiting_failed cascades), - compares each failing job against recent main "Full CI run - nightly/daily" builds to drop pre-existing failures, - classifies remaining failures by root-cause signature (Inductor MetaProxy, triton PassManager, AOT cache pickling, CUDA OOM / contention, custom-op fake-kernel stride mismatch, accuracy drift, etc.) and routes them to pytorch/pytorch vs vllm-project/vllm, - drafts upstream issues with reproducibility tables, environment blocks, and tracebacks, then posts them under an umbrella issue. Derived from the multi-week triage that produced umbrella pytorch/pytorch#180899. Signed-off-by: atalman <atalman@meta.com>

vercel · 2026-05-05T21:36:30Z

@atalman is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

atalman and others added 3 commits July 11, 2022 13:58

Migrating to CUDA 11.7

797a2f3

CUDA versioning switch case

3d7a662

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2026

atalman closed this May 5, 2026

atalman deleted the add-vllm-pytorch-ci-triage-skill branch May 5, 2026 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vllm-pytorch-ci-triage skill#8045

Add vllm-pytorch-ci-triage skill#8045
atalman wants to merge 3 commits into
pytorch:mainfrom
atalman:add-vllm-pytorch-ci-triage-skill

atalman commented May 5, 2026

Uh oh!

vercel Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atalman commented May 5, 2026

Summary

What the skill does

Notable lessons baked in

Test plan

File added

Uh oh!

vercel Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant