From 24846a7d79ac71ce629a0733547b3e68a4ce41ba Mon Sep 17 00:00:00 2001
From: Kangyan Zhou <zky314343421@gmail.com>
Date: Sun, 1 Mar 2026 20:09:54 -0800
Subject: [PATCH 1/3] Add Claude Code skill for bisecting CI regressions

Adds a reusable skill that systematically investigates failing CI tests
through temporal bisection, runner/hardware analysis, and optional remote
reproduction on GPU servers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 .../sglang-bisect-ci-regression/SKILL.md      | 205 ++++++++++++++++++
 1 file changed, 205 insertions(+)
 create mode 100644 .claude/skills/sglang-bisect-ci-regression/SKILL.md

diff --git a/.claude/skills/sglang-bisect-ci-regression/SKILL.md b/.claude/skills/sglang-bisect-ci-regression/SKILL.md
new file mode 100644
index 000000000000..8ddbe3a09d27
--- /dev/null
+++ b/.claude/skills/sglang-bisect-ci-regression/SKILL.md
@@ -0,0 +1,205 @@
+# SGLang Bisect CI Regression
+
+Investigate a consistently failing CI test to find the root cause - whether it's a code regression from a specific PR, a hardware/runner-specific issue, or an environment change. Optionally reproduce the failure on a remote GPU server.
+
+## Slash Command
+
+`/sglang-bisect-ci-regression <test_name_or_ci_url> [ssh_target] [docker_container]`
+
+## When to Use This Skill
+
+- A CI test is failing consistently on main (scheduled runs)
+- You need to find which PR introduced a regression
+- You suspect a runner-specific or GPU-specific issue
+- You want to reproduce a CI failure on a remote server
+
+## Arguments
+
+- **First argument (required)**: Test file name (e.g. `test_lora_tp.py`) or a GitHub Actions job URL
+- **Second argument (optional)**: SSH target for remote reproduction (e.g. `user@host`)
+- **Third argument (optional)**: Docker container name on the SSH target (e.g. `sglang_dev`)
+
+If SSH target and docker container are not provided, the skill will only perform the CI log analysis and bisection, without remote reproduction. **Ask the user** for these if reproduction is needed and they weren't provided.
+
+## Workflow
+
+### Phase 1: Extract the Failure Signature
+
+1. **Get the failing test details from CI logs.** If given a URL, fetch logs directly. If given a test name, find recent schedule runs that failed:
+
+```bash
+# List recent scheduled runs
+gh run list --workflow="pr-test.yml" --event schedule --branch main --limit 20 --json databaseId,conclusion,createdAt,headSha
+
+# Find the job containing the test
+gh run view {RUN_ID} --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name, conclusion, databaseId}'
+
+# Get the failure details
+gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -B 5 -A 30 "AssertionError\|FAIL\|Error\|{TEST_NAME}"
+```
+
+2. **Record the failure signature:**
+   - Exact error message and assertion
+   - Affected test method name
+   - Model/config involved
+   - Numeric values (e.g., tolerance diffs, scores)
+   - Whether the failure is deterministic (same values across runs)
+
+### Phase 2: Temporal Bisection
+
+3. **Find the boundary between passing and failing runs.** Check the schedule run history to identify:
+   - Last known PASSING run (sha + date)
+   - First known FAILING run (sha + date)
+
+```bash
+# For each run, check the specific partition/job status
+gh run view {RUN_ID} --json jobs --jq '.jobs[] | select(.name == "{JOB_NAME}") | {conclusion, databaseId}'
+
+# Verify a specific test passed or failed in a run
+gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -E "{TEST_NAME}|PASSED|FAILED|logprobs mismatch" | head -10
+```
+
+4. **List commits between the boundary:**
+
+```bash
+git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA}
+```
+
+5. **Filter for relevant commits** that touch files related to the failing test (model layers, kernels, test utilities, etc.):
+
+```bash
+git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA} -- {relevant_paths}
+```
+
+### Phase 3: Runner/Hardware Analysis
+
+6. **Check if the failure is runner-specific.** Extract the runner identity from each failing and passing run:
+
+```bash
+# Get runner name and machine
+gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -E "Runner name|Machine name" | head -5
+
+# Get GPU/driver info
+gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -i "NVIDIA-SMI\|Driver Version\|CUDA Version" | head -5
+
+# Get package versions
+gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -E "sgl.kernel.*==|flashinfer.*==" | head -5
+```
+
+7. **Correlate runners with pass/fail outcomes.** Build a table:
+
+| Run ID | Date | Runner | GPU Type | Driver | Result |
+|--------|------|--------|----------|--------|--------|
+
+If all failures map to a specific runner type/GPU and all passes map to another, the issue is **hardware-specific**, not a code regression.
+
+### Phase 4: Code Analysis
+
+8. **If a code regression is suspected** (failures not runner-specific), examine the candidate commits:
+   - Read the changed files
+   - Understand how the changes could affect the failing test
+   - Look for prefill-vs-decode differences, TP-specific paths, kernel changes
+
+9. **If a hardware issue is suspected**, analyze:
+   - Kernel compatibility (CUDA compute capability)
+   - Driver version differences
+   - All-reduce / NCCL behavior differences
+   - CUDA graph capture differences across GPU architectures
+
+### Phase 5: Remote Reproduction (Optional)
+
+Only if SSH target and docker container were provided.
+
+10. **Verify the remote environment:**
+
+```bash
+ssh {SSH_TARGET} "docker exec {CONTAINER} nvidia-smi --query-gpu=name,driver_version --format=csv"
+ssh {SSH_TARGET} "docker exec {CONTAINER} pip show sgl-kernel sglang flashinfer-python 2>&1 | grep -E 'Name:|Version:'"
+```
+
+11. **Ensure latest code is installed.** If the container is stale, update:
+
+```bash
+# Try fetching latest main
+ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && git fetch origin main && git checkout origin/main'"
+# Or download tarball if git auth fails
+ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'curl -L https://github.com/sgl-project/sglang/archive/refs/heads/main.tar.gz -o /tmp/sglang-main.tar.gz && cd /tmp && tar xzf sglang-main.tar.gz'"
+# Reinstall
+ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && pip install -e \"python[all]\"'"
+# Install test dependencies if needed
+ssh {SSH_TARGET} "docker exec {CONTAINER} pip install peft rouge-score"
+```
+
+12. **Create a minimal reproduction script** that:
+    - Uses `if __name__ == '__main__'` with `mp.set_start_method("spawn")`
+    - Runs the specific failing test configuration
+    - Prints key metrics (diffs, scores, outputs)
+    - Exits with code 1 on failure
+
+13. **Copy and run the reproduction script:**
+
+```bash
+scp /tmp/repro_script.py {SSH_TARGET}:/tmp/
+ssh {SSH_TARGET} "docker cp /tmp/repro_script.py {CONTAINER}:/tmp/"
+ssh {SSH_TARGET} "docker exec -e CUDA_VISIBLE_DEVICES=0,1 {CONTAINER} python3 /tmp/repro_script.py"
+```
+
+14. **Run control experiments** to isolate the variable:
+    - If suspecting TP issue: run with TP=1 as control
+    - If suspecting GPU issue: compare same code on different GPU
+    - If suspecting a specific commit: test before/after that commit
+
+### Phase 6: Report
+
+15. **Produce a structured report:**
+
+```markdown
+## CI Regression Bisection Report
+
+### Failure Signature
+- **Test**: {test_file}::{test_method}
+- **Error**: {exact error message}
+- **Key metrics**: {numeric values}
+- **Deterministic**: Yes/No
+
+### Root Cause Classification
+One of:
+- **Code Regression**: PR #{number} introduced the bug
+- **Hardware-Specific**: Fails on {GPU_TYPE}, passes on others
+- **Environment Change**: New runner/driver/package version
+- **Pre-existing Flakiness**: Intermittent, not a new regression
+
+### Evidence
+| Condition | Result |
+|-----------|--------|
+| {condition1} | PASS/FAIL |
+| {condition2} | PASS/FAIL |
+
+### Timeline
+- {date}: Last known pass ({sha}, {runner})
+- {date}: First known fail ({sha}, {runner})
+- {date}: Confirmed reproduction on {server}
+
+### Recommended Fix
+- **Short-term**: {workaround}
+- **Long-term**: {proper fix}
+```
+
+## Key Patterns to Recognize
+
+| Pattern | Diagnosis |
+|---------|-----------|
+| Same SHA passes on runner A, fails on runner B | Hardware/runner-specific |
+| All runners fail after commit X | Code regression from commit X |
+| Intermittent - same runner sometimes passes/fails | Flaky test or race condition |
+| Prefill OK but decode fails | TP/all-reduce issue in decode path |
+| Works with TP=1, fails with TP>1 | Tensor parallelism bug |
+| Exact same numeric diff every time | Deterministic bug, not flakiness |
+
+## Important Notes
+
+- **Always check runner identity** before concluding it's a code regression. Many "consistent" failures are actually runner-specific.
+- **Test partition assignments change over time** as tests are added/removed. A test may move between partitions, landing on different runner types.
+- **H200 runners** use `/root/actions-runner/` path and machine names like `gpu-h200-worker-*`. Non-H200 runners use `/public_sglang_ci/runner-*` paths.
+- When running remote reproduction, use `run_in_background` for long-running tests and check output with `TaskOutput`.
+- Container environments may be stale - always verify package versions match CI before drawing conclusions.

From 36acef2c05e36a095b4e7b27609851c8576c1234 Mon Sep 17 00:00:00 2001
From: Kangyan Zhou <zky314343421@gmail.com>
Date: Sun, 1 Mar 2026 20:14:06 -0800
Subject: [PATCH 2/3] Add scheduled CI run context to bisect skill

Document the pr-test.yml scheduled runs on main as the primary data
source for regression bisection, with dashboard link and --repo flags.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 .../sglang-bisect-ci-regression/SKILL.md      | 38 +++++++++++++------
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/.claude/skills/sglang-bisect-ci-regression/SKILL.md b/.claude/skills/sglang-bisect-ci-regression/SKILL.md
index 8ddbe3a09d27..4ba74164fa32 100644
--- a/.claude/skills/sglang-bisect-ci-regression/SKILL.md
+++ b/.claude/skills/sglang-bisect-ci-regression/SKILL.md
@@ -21,21 +21,35 @@ Investigate a consistently failing CI test to find the root cause - whether it's
 
 If SSH target and docker container are not provided, the skill will only perform the CI log analysis and bisection, without remote reproduction. **Ask the user** for these if reproduction is needed and they weren't provided.
 
+## Background: Scheduled CI Runs
+
+SGLang uses the `pr-test.yml` workflow with **scheduled runs** (cron-triggered) to periodically test the `main` branch. These runs are the primary data source for detecting regressions:
+
+- **Workflow**: `pr-test.yml` with `event: schedule`
+- **Branch**: `main`
+- **Dashboard**: https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule
+- **Frequency**: Runs multiple times daily, each pinned to the HEAD of `main` at trigger time
+- **Purpose**: Catches regressions that slip through PR-level CI (e.g., interaction bugs between merged PRs, hardware-specific issues)
+
+Always use these scheduled runs (not PR-triggered runs) when bisecting regressions on `main`. The `--event schedule` filter in `gh run list` ensures you only see these periodic main-branch runs.
+
 ## Workflow
 
 ### Phase 1: Extract the Failure Signature
 
-1. **Get the failing test details from CI logs.** If given a URL, fetch logs directly. If given a test name, find recent schedule runs that failed:
+1. **Get the failing test details from CI logs.** If given a URL, fetch logs directly. If given a test name, find recent scheduled runs of `pr-test.yml` on `main` that failed:
 
 ```bash
-# List recent scheduled runs
-gh run list --workflow="pr-test.yml" --event schedule --branch main --limit 20 --json databaseId,conclusion,createdAt,headSha
+# List recent scheduled runs targeting main (the primary source of truth for regressions)
+# These are cron-triggered runs visible at:
+# https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule
+gh run list --repo sgl-project/sglang --workflow="pr-test.yml" --event schedule --branch main --limit 20 --json databaseId,conclusion,createdAt,headSha
 
 # Find the job containing the test
-gh run view {RUN_ID} --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name, conclusion, databaseId}'
+gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name, conclusion, databaseId}'
 
 # Get the failure details
-gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -B 5 -A 30 "AssertionError\|FAIL\|Error\|{TEST_NAME}"
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -B 5 -A 30 "AssertionError\|FAIL\|Error\|{TEST_NAME}"
 ```
 
 2. **Record the failure signature:**
@@ -47,16 +61,16 @@ gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -B 5 -A 30 "AssertionError
 
 ### Phase 2: Temporal Bisection
 
-3. **Find the boundary between passing and failing runs.** Check the schedule run history to identify:
+3. **Find the boundary between passing and failing runs.** Walk through the scheduled run history (from the `pr-test.yml` schedule runs on `main`) to identify:
    - Last known PASSING run (sha + date)
    - First known FAILING run (sha + date)
 
 ```bash
-# For each run, check the specific partition/job status
-gh run view {RUN_ID} --json jobs --jq '.jobs[] | select(.name == "{JOB_NAME}") | {conclusion, databaseId}'
+# For each scheduled run, check the specific partition/job status
+gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.name == "{JOB_NAME}") | {conclusion, databaseId}'
 
 # Verify a specific test passed or failed in a run
-gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -E "{TEST_NAME}|PASSED|FAILED|logprobs mismatch" | head -10
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "{TEST_NAME}|PASSED|FAILED|logprobs mismatch" | head -10
 ```
 
 4. **List commits between the boundary:**
@@ -77,13 +91,13 @@ git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA} -- {relevant_paths}
 
 ```bash
 # Get runner name and machine
-gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -E "Runner name|Machine name" | head -5
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "Runner name|Machine name" | head -5
 
 # Get GPU/driver info
-gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -i "NVIDIA-SMI\|Driver Version\|CUDA Version" | head -5
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -i "NVIDIA-SMI\|Driver Version\|CUDA Version" | head -5
 
 # Get package versions
-gh run view {RUN_ID} --job {JOB_ID} --log 2>&1 | grep -E "sgl.kernel.*==|flashinfer.*==" | head -5
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "sgl.kernel.*==|flashinfer.*==" | head -5
 ```
 
 7. **Correlate runners with pass/fail outcomes.** Build a table:

From 499ebac4da6772e163f32678a386e772959b4218 Mon Sep 17 00:00:00 2001
From: Kangyan Zhou <zky314343421@gmail.com>
Date: Sun, 1 Mar 2026 20:51:40 -0800
Subject: [PATCH 3/3] Address review comments: use grep -E for portable regex

- Use `grep -E` with `|` instead of `\|` for alternation (3 places)
- Clarify tarball vs git fetch reinstall instructions to avoid ambiguity

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 .claude/skills/sglang-bisect-ci-regression/SKILL.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/.claude/skills/sglang-bisect-ci-regression/SKILL.md b/.claude/skills/sglang-bisect-ci-regression/SKILL.md
index 4ba74164fa32..4eb39227c16e 100644
--- a/.claude/skills/sglang-bisect-ci-regression/SKILL.md
+++ b/.claude/skills/sglang-bisect-ci-regression/SKILL.md
@@ -49,7 +49,7 @@ gh run list --repo sgl-project/sglang --workflow="pr-test.yml" --event schedule
 gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name, conclusion, databaseId}'
 
 # Get the failure details
-gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -B 5 -A 30 "AssertionError\|FAIL\|Error\|{TEST_NAME}"
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E -B 5 -A 30 "AssertionError|FAIL|Error|{TEST_NAME}"
 ```
 
 2. **Record the failure signature:**
@@ -94,7 +94,7 @@ git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA} -- {relevant_paths}
 gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "Runner name|Machine name" | head -5
 
 # Get GPU/driver info
-gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -i "NVIDIA-SMI\|Driver Version\|CUDA Version" | head -5
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -i -E "NVIDIA-SMI|Driver Version|CUDA Version" | head -5
 
 # Get package versions
 gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "sgl.kernel.*==|flashinfer.*==" | head -5
@@ -136,9 +136,9 @@ ssh {SSH_TARGET} "docker exec {CONTAINER} pip show sgl-kernel sglang flashinfer-
 ```bash
 # Try fetching latest main
 ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && git fetch origin main && git checkout origin/main'"
-# Or download tarball if git auth fails
-ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'curl -L https://github.com/sgl-project/sglang/archive/refs/heads/main.tar.gz -o /tmp/sglang-main.tar.gz && cd /tmp && tar xzf sglang-main.tar.gz'"
-# Reinstall
+# Or download and install from tarball if git auth fails
+ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /tmp && curl -L https://github.com/sgl-project/sglang/archive/refs/heads/main.tar.gz | tar xz && cd sglang-main && pip install -e \"python[all]\"'"
+# Reinstall (after git fetch)
 ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && pip install -e \"python[all]\"'"
 # Install test dependencies if needed
 ssh {SSH_TARGET} "docker exec {CONTAINER} pip install peft rouge-score"