Skip to content

Add vllm-pytorch-ci-triage skill#8045

Closed
atalman wants to merge 3 commits into
pytorch:mainfrom
atalman:add-vllm-pytorch-ci-triage-skill
Closed

Add vllm-pytorch-ci-triage skill#8045
atalman wants to merge 3 commits into
pytorch:mainfrom
atalman:add-vllm-pytorch-ci-triage-skill

Conversation

@atalman
Copy link
Copy Markdown
Contributor

@atalman atalman commented May 5, 2026

Summary

Adds a Claude Code skill that automates end-to-end triage of failing vLLM Buildkite CI runs for PyTorch version-bump PRs.

Derived from the multi-week triage of vLLM PR vllm-project/vllm#40077 (torch 2.12.0 + triton 3.7.0), which produced umbrella issue pytorch/pytorch#180899 with 25+ tracked sub-issues over a series of daily runs.

What the skill does

  • Pulls the failing build's job list from Buildkite REST API and filters to true hard failures (excluding soft_failed=True, waiting_failed, and infra-aborted jobs).
  • Compares each failing job against recent main Full CI run - nightly/daily builds to drop pre-existing failures, with the caveat that infra-killed main jobs are not a valid baseline (must be retried first).
  • Pulls and ANSI/timestamp-strips logs for the survivors and matches them against a curated set of root-cause signature regexes (Inductor MetaProxy, triton PassManager, AOT cache pickling, custom-op fake-kernel stride mismatch, GPU contention, FP8 / quantized accuracy drift, etc.).
  • Routes each root cause to the right repo: pytorch/pytorch (torch / triton / Inductor / Dynamo / AOTAutograd) vs. vllm-project/vllm (multimodal model assertions, custom-op fake-kernel bugs, response APIs).
  • Drafts upstream issues with reproducibility tables, environment blocks, and tracebacks, with a strict draft→confirm→post protocol, and links them under the umbrella.
  • Manages umbrella checklist hygiene (mark closed, reopen on regressions, retract on false positives like the recent #182549 retraction).

Notable lessons baked in

  • state=failed + soft_failed=True is non-blocking — always filter both.
  • Engine core initialization failed. See root cause above. is a red herring — the actual exception is several lines up in the EngineCore worker output.
  • Custom-op assert_size_stride failures on torch.ops.vllm.<X>.default are almost always vLLM-side fake-kernel bugs, not torch regressions — inspect the direct_register_custom_op(... fake_impl=...) registration first.
  • Bulk B200 exit_status=125 + nvidia-container-cli: device error / driver rpc error: timed out is agent infra, not a regression — recommend rerun.
  • When the same B200 infra cluster wipes out both the test-PR and the main-build coverage of a job, the comparison is inconclusive — ALWAYS request a main rerun before filing. Filing without that baseline produced a wrongful issue (#182549, retracted).
  • Buildkite REST API rate-limit is 400/min. Token must be in a shell variable before parallel curl in while read (inline $(cat …) silently produces 0-byte log files).
  • Title convention: [vllm] [<sub-area tag>] <concise root cause>. Always include [vllm] from the start (post-hoc edits are noisy).

Test plan

This skill is invoked manually by Claude Code when the user points at a failing Buildkite build. Validation has been the actual triage of vLLM #40077 over 16+ daily runs since 2026-04-20:

  • 16+ pytorch/pytorch issues filed under umbrella #180899, with reproducibility tables and environment blocks.
  • Caught real regressions: AsyncTP correctness (#182124), Fullgraph Smoke Test (#182125), Batch Invariance B200 (#181248, fixed by Lucas Kabela's PR), MetaProxy in FP8 fusion (#180906), aten::bmm double-registration (#180905), and others.
  • Caught the gpt-oss MoE custom-op stride mismatch as a vLLM-side bug ([Bug]: gpt-oss MoE moe_forward fake-kernel shape mismatch breaks torch.compile + TP > 1 on Blackwell vllm-project/vllm#41645 / #41646), correctly routing it away from pytorch.
  • Caught and retracted a false positive (#182549) once the main nightly was retried, demonstrating the infra-baseline lesson.

No automated test harness exists for Claude skills today; the skill is exercised through the live triage workflow.

File added

  • .claude/skills/vllm-pytorch-ci-triage/SKILL.md (379 lines)

This follows the same convention as the four existing skills under .claude/skills/ (each is a single-file SKILL.md with YAML frontmatter).

atalman and others added 3 commits July 11, 2022 13:58
Adds a Claude Code skill that automates triage of failing vLLM
Buildkite CI runs for PyTorch version-bump PRs (e.g. vLLM #40077 for
torch 2.12.0 + triton 3.7.0). The workflow:

- pulls the failing build's job list and filters to true hard failures
  (excluding soft_failed and waiting_failed cascades),
- compares each failing job against recent main "Full CI run -
  nightly/daily" builds to drop pre-existing failures,
- classifies remaining failures by root-cause signature (Inductor
  MetaProxy, triton PassManager, AOT cache pickling, CUDA OOM /
  contention, custom-op fake-kernel stride mismatch, accuracy drift,
  etc.) and routes them to pytorch/pytorch vs vllm-project/vllm,
- drafts upstream issues with reproducibility tables, environment
  blocks, and tracebacks, then posts them under an umbrella issue.

Derived from the multi-week triage that produced umbrella
pytorch/pytorch#180899.

Signed-off-by: atalman <atalman@meta.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 5, 2026

@atalman is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2026
@atalman atalman closed this May 5, 2026
@atalman atalman deleted the add-vllm-pytorch-ci-triage-skill branch May 5, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant