[AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) by yctseng0211 · Pull Request #20815 · sgl-project/sglang

yctseng0211 · 2026-03-18T05:35:23Z

Motivation

Modifications

Accuracy Tests

[PASSED] https://github.com/sgl-project/sglang/actions/runs/23234956586

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-18T05:35:28Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-03-19T00:09:26Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yctseng0211 · 2026-03-19T07:30:09Z

@amd-bot ci-status

yctseng0211 · 2026-03-19T09:05:17Z

@amd-bot ci-status

bingxche · 2026-03-19T09:19:25Z

@amd-bot review

bingxche · 2026-03-19T09:22:05Z

@yctseng0211 requested CI status check

CI Status for PR #20815

Total: 81 checks

Passed: 29
Failed: 4
Pending: 18

Failed Checks:

stage-b-test-large-1-gpu (0) (https://github.com/sgl-project/sglang/actions/runs/23285539283/job/67709136676)
stage-b-test-large-1-gpu (2) (https://github.com/sgl-project/sglang/actions/runs/23285539283/job/67709136660)
wait-for-stage-b (https://github.com/sgl-project/sglang/actions/runs/23285539283/job/67709136576)
build-test (all) (https://github.com/sgl-project/sglang/actions/runs/23285539237/job/67708038447)

Detailed Analysis

Method: Progressive step-by-step analysis (all steps examined per job)

Job: `stage-b-test-large-1-gpu (0)`

Link: stage-b-test-large-1-gpu (0)
Failed Steps: (unknown)

CI Failure Analysis: `stage-b-test-large-1-gpu (0)`

1. Root Cause Analysis

Port 11000 was already bound by a zombie/orphan process when test_eagle_infer_b.py attempted to start the sglang server. The test harness launched the server 3 times in succession; each time the model loaded successfully and CUDA graphs were captured, but the HTTP server could not bind to 127.0.0.1:11000. After 300 seconds the watchdog killed the server, the test retried, and ultimately the 1800-second per-file timeout was exhausted — preventing the remaining 2 test files from ever executing.

The most likely source of the stale process on port 11000 is insufficient cleanup from a prior CI run or a prior test file within the same runner session. The shared runner h100-radixark-host1-gpu-3 runs inside a persistent Docker container (8e1df5b34166), so processes from previous jobs can survive if they aren't explicitly killed. The git clean -ffdx in step 10 only cleans the filesystem — it does not kill orphan server processes.

2. Failure Details

Error Messages (repeated 3 times)

ERROR: [Errno 98] error while attempting to bind on address ('127.0.0.1', 11000): address already in use

TokenizerManager watchdog timeout (self.watchdog_timeout=300, self.soft=True)

py-spy: Permission denied (os error 13)

Test Results

Test File	Result	Duration
`test_eagle_infer_b.py`	TIMEOUT	1800s (budget exhausted)
`test_lora_overlap_loading.py`	NOT RUN	—
`test_utils_update_weights.py`	NOT RUN	—
`test_update_weights_from_disk.py`	SKIPPED	— (see #14021)

Summary: 0/3 passed

Three Server Launch Attempts (all failed identically)

08:41 — max_running_requests=8, default KV cache (86756 tokens) → port 11000 in use → watchdog timeout 300s
08:51 — Same config (retry) → same failure
09:01 — max_running_requests=64, max_total_tokens=4500, chunked_prefill_size=128 → same failure

3. Suggested Fixes

Immediate / Short-term

Kill stale processes before test execution — Add a pre-test cleanup step in the CI workflow or in run_suite.py:

# Kill any process holding port 11000 (and other common sglang ports)
for port in 11000 11001 11002 30000 30001; do
  fuser -k ${port}/tcp 2>/dev/null || true
done
# Also kill any orphan sglang/python server processes
pkill -f "python.*launch_server" 2>/dev/null || true
pkill -f "sglang.launch_server" 2>/dev/null || true

Use dynamic/random ports in tests — Instead of hardcoding port 11000, let the OS assign a free port:

# In test fixtures / server launch helpers
import socket
def get_free_port():
    with socket.socket() as s:
        s.bind(('', 0))
        return s.getsockname()[1]

This eliminates port conflicts entirely between concurrent or sequential runs.

Add port-availability check with fast-fail — In the server startup path (launch_server.py or the test harness), detect EADDRINUSE immediately and either pick another port or fail fast with a clear message instead of waiting 300s for the watchdog:

# Before launching the server
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    if s.connect_ex(('127.0.0.1', port)) == 0:
        raise RuntimeError(f"Port {port} already in use — aborting immediately")

Medium-term

Fix py-spy permissions — The watchdog's diagnostic dump fails with Permission denied (os error 13). Either:
- Run CI with SYS_PTRACE capability: docker run --cap-add SYS_PTRACE ...
- Or set kernel.yama.ptrace_scope=0 in the runner container
- This won't fix the port issue but will provide better diagnostics when watchdog timeouts occur.
Isolate test runs with process namespaces — Run each test file inside its own PID namespace or network namespace to guarantee no port leakage:
```
unshare --net --map-root-user -- python -m pytest test_eagle_infer_b.py
```
Add post-test cleanup in the test harness — run_suite.py should kill all child processes after each test file completes (or times out), not just at the end of the suite.

Workflow-level fix (`.github/workflows/`)

- name: Kill stale server processes
  run: |
    fuser -k 11000/tcp 2>/dev/null || true
    pkill -9 -f "sglang" 2>/dev/null || true
    sleep 2
  if: always()

Add this before the test run step and also as a post-test cleanup with if: always().

4. Priority

High — This is a flaky infrastructure failure that:

Caused 0/3 tests to pass and blocked 2 tests from running at all
Is non-deterministic (depends on runner state from prior jobs)
Will recur on the same runner unless addressed
Masks real test results for EAGLE speculative decoding and LoRA overlap loading

5. Environment Context

Component	Value
Runner	`h100-radixark-host1-gpu-3` (persistent Docker container `8e1df5b34166`)
GPU	NVIDIA H100 80GB HBM3
Driver	NVIDIA 580.126.09, CUDA 13.0
PyTorch	2.9.1 (CUDA 12.8)
sglang	0.0.0.dev1+g5252bd422 (commit `5252bd4`)
sglang-kernel	0.4.0 (pre-built)
flashinfer	0.6.6
transformers	5.3.0
Python	3.10.12
Conflicting port	`127.0.0.1:11000` (EADDRINUSE)
Model	`meta-llama/Llama-2-7b-chat-hf` + `lmsys/sglang-EAGLE-llama2-chat-7B`
Test timeout	1800s per file, 300s watchdog

Job: `stage-b-test-large-1-gpu (2)`

Link: stage-b-test-large-1-gpu (2)
Failed Steps: (unknown)

CI Failure Analysis: `stage-b-test-large-1-gpu (2)`

1. Root Cause Analysis

Two MiniCPM vision-language models (MiniCPM-V-4 and MiniCPM-o-2_6) fail to launch their sglang server process, crashing with exit code 1 during setUpClass. The server fails first in offline mode (HF_HUB_OFFLINE=1), and then again on the online retry.

The most likely root cause is a model loading incompatibility caused by the combination of:

transformers 5.3.0 — a very recent/bleeding-edge version that may have changed model class registrations, config schemas, or trust_remote_code handling for MiniCPM architectures.
Missing trust_remote_code files in HF cache — the HF cache validation in Step 19 explicitly flagged failures for MiniCPM models (among others) with missing files, which are required for custom model architectures like MiniCPM-V-4 and MiniCPM-o-2_6.
No CUDA coredumps were generated, confirming this is an application-level/Python crash (model loading or config parsing), not a GPU driver issue.

The fact that other VLMs (DeepSeek-OCR, Gemma-3, Qwen2.5-VL, Qwen2-VL, Qwen2-Audio, Qwen3-Omni, Qwen3-VL) all serve and pass successfully strongly points to a MiniCPM-specific model loading regression.

2. Failure Details

Failed Tests:

Test Class	Model	Error
`TestMiniCPMV4Server`	`openbmb/MiniCPM-V-4`	`Exception: Server process exited with code 1` in `setUpClass`
`TestMiniCPMo26Server`	`openbmb/MiniCPM-o-2_6`	`Exception: Server process exited with code 1` in `setUpClass`

Test Summary: 51 tests ran in 535.2s — 2 errors, 6 skipped. Exit code 255.

Failure Pattern:

Test harness attempts to launch sglang server with HF_HUB_OFFLINE=1 → server process exits with code 1
Retries without offline mode → server process exits with code 1 again
setUpClass raises exception, all tests in the class are marked as errors

HF Cache Validation Failures (from Step 19):
The cache integrity check showed 32 FAILs including MiniCPM models missing trust_remote_code files (custom Python modules needed to load these architectures). These models use custom architectures not natively in transformers and rely on downloaded Python files from the HuggingFace Hub.

3. Suggested Fixes

Immediate / Short-term

Investigate MiniCPM server crash logs: The current test output only shows exit code 1. Add or surface the actual server stderr/stdout to diagnose the exact crash point:
```
# In the test harness, capture and print server logs on failure
```
Fix HF cache for MiniCPM models: Re-download the model repos with trust_remote_code files:
```
huggingface-cli download openbmb/MiniCPM-V-4 --local-dir-use-symlinks False
huggingface-cli download openbmb/MiniCPM-o-2_6 --local-dir-use-symlinks False
```
Ensure the runner's HF cache includes the custom .py files (not just weights/config).
Pin or test transformers version compatibility: transformers 5.3.0 is very new. Test whether downgrading resolves the issue:
```
pip install transformers==4.51.0
```
MiniCPM custom code may reference internal transformers APIs that changed in v5.x.

Medium-term

Add per-model cache validation to CI: Before running tests, validate that each tested model has all required files (including *.py for trust_remote_code models):

# Validate trust_remote_code models have their custom code files
for model_id in ["openbmb/MiniCPM-V-4", "openbmb/MiniCPM-o-2_6"]:
    snapshot = huggingface_hub.snapshot_download(model_id, local_files_only=True)
    assert glob.glob(os.path.join(snapshot, "*.py")), f"Missing custom code for {model_id}"

Consider skipping MiniCPM tests when trust_remote_code files are unavailable, with a clear skip message rather than a hard crash.

4. Priority

High

The failure blocks the entire test suite partition (0/4 files pass; remaining 3 test files never execute)
Two actively supported VLM models are broken in CI
The root cause (likely transformers 5.x incompatibility or incomplete HF cache) could affect other trust_remote_code models as well
Not Critical because: other model tests pass, and this is partition 2/14 — other partitions are likely unaffected

5. Environment Context

Component	Value
GPU	NVIDIA H100 80GB HBM3 (sm_90)
Driver	580.126.09
CUDA (PyTorch)	12.8
PyTorch	2.9.1
transformers	5.3.0 ⚠️
tokenizers	0.22.2
sglang	0.0.0.dev1+g5252bd422
flash-attn-4	4.0.0b4
flashinfer	0.6.6
Python	3.10.12
Commit	`5252bd4222d72a32e9c14e5f393c9ed0dac239fb`
Runner	`h100-radixark-host1-gpu-6`
HF Cache	32 FAIL (incomplete models including MiniCPM)

Job: `wait-for-stage-b`

Link: wait-for-stage-b
Failed Steps: (unknown)

CI Failure Analysis: `wait-for-stage-b` — Job `stage-b-test-large-1-gpu (2)` Failed

1. Root Cause Analysis

The wait-for-stage-b gating job detected that stage-b-test-large-1-gpu (2) completed with conclusion=failure. This polling job itself functioned correctly — it did exactly what it was designed to do: detect the failure and fail-fast.

The root cause is not determinable from the logs of this job alone. The wait-for-stage-b job is a monitoring/gating job that only polls the GitHub Actions API for job statuses. It does not contain any test output, error messages, stack traces, or build logs from the failing job. The actual failure details (test errors, OOM, timeout, compilation failure, etc.) exist exclusively in the logs of stage-b-test-large-1-gpu (2) itself.

What we can infer:

The failure occurred early — stage-b-test-large-1-gpu (2) completed by polling attempt 8 (~14 minutes into the run), while 24/27 jobs were still queued/in-progress. This suggests the failure was not a timeout but rather a fast, hard failure (crash, assertion error, compilation error, or early test failure).
Only 3/27 jobs had completed at termination: stage-b-test-small-1-gpu (0) ✅, stage-b-test-4-gpu-b200 ✅, and stage-b-test-large-1-gpu (2) ❌.
The PR under test is PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815 (merge commit 5252bd4, merging 27142c2 into cb8105f).

2. Failure Details

Field	Value
Failed job	`stage-b-test-large-1-gpu (2)`
Status	`completed`
Conclusion	`failure`
Error message	`##[error]stage-b jobs failed: stage-b-test-large-1-gpu (2)`
Specific test failures	❌ Not available — must inspect the failed job's own logs
Stack traces	❌ Not available — must inspect the failed job's own logs
Time to failure	~14 minutes or less (detected at attempt 8, polling every 120s)

Direct link to the failing job's logs:
👉 https://github.com/sgl-project/sglang/actions/runs/23285539283 — look for stage-b-test-large-1-gpu (2) in the job list.

3. Suggested Fixes

Since the actual failure details are not in this job's logs, the immediate actions are:

Immediate (Diagnostic)

Inspect the actual failing job logs: Navigate to stage-b-test-large-1-gpu (2) in this workflow run and examine:
- The test execution step for pytest output / stack traces
- Server launch logs for OOM or GPU errors
- Any SGLANG_CUDA_COREDUMP artifacts if a GPU crash occurred
Check PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815 diff: Review what 27142c2 changed — particularly any modifications to large model tests, serving configs, or GPU memory handling that would affect single-GPU large model tests.
Check if flaky: Search for prior failures of stage-b-test-large-1-gpu (2) on main to determine if this is a pre-existing flake or a regression introduced by PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815.

If the failure is a test regression in PR #20815

Fix the code in the PR branch and re-push

If the failure is a flaky test

Re-run the failed job
If it recurs, open a separate issue to track the flaky test with [Flaky] label

Housekeeping (Non-blocking)

Node.js 20 deprecation: Update actions/checkout and actions/github-script to versions using Node.js 24 before June 2, 2026 (warning detected in cleanup step)

4. Priority

High — This blocks the entire stage-b gate (fail-fast stopped evaluation of 24 remaining jobs) and likely blocks PR #20815 from merging. However, priority should be reassessed after inspecting the actual failure: if it's a known flake, it drops to Medium.

5. Environment Context

Component	Value
Runner OS	Ubuntu 24.04.3 LTS
Runner image	`ubuntu-24.04` v`20260309.50.1`
Git version	2.53.0
Region	`northcentralus` (Azure)
PR	#20815 (merge commit `5252bd4`)
CI env vars	`SGLANG_IS_IN_CI=true`, `SGLANG_CUDA_COREDUMP=1`, `SGLANG_JIT_DEEPGEMM_FAST_WARMUP=true`
GPU context	Large-1-GPU test shard (shard index 2 of 14) — likely AMD GPU based on project scope
Stage-b total	27 jobs expected (8+14+4+1)

⚠️ Key action required: The root cause can only be determined by examining the logs of stage-b-test-large-1-gpu (2). This analysis confirms the gating infrastructure worked correctly; the problem is in the test job itself.

Job: `build-test (all)`

Link: build-test (all)
Failed Steps: (unknown)

CI Failure Analysis: `build-test (all)` — CPU BMM FP8 Test Failure

1. Root Cause Analysis

The sgl_kernel CPU BMM kernel (torch.ops.sgl_kernel.bmm_cpu) does not implement FP8 weight support, but test_bmm.py unconditionally tests FP8 BMM paths on the CPU platform. This is a test–kernel feature mismatch: the test was written (or modified in PR #20815, branch fix_amd_ci_multimodaltest) without gating the FP8 BMM subtests behind a CPU capability check.

This is not an environment, dependency, or infrastructure issue. The kernel explicitly raises:

RuntimeError: bmm: do not support fp8 weight for now.

This is a deliberate guard in the C++/kernel code — the feature simply hasn't been implemented for the CPU backend yet.

2. Failure Details

Field	Value
Failed test file	`test/srt/cpu/test_bmm.py`
Failed test method	`TestBmm.test_bmm` (parameterized)
Number of errors	96 errors across all parameter combinations
Exit code	`1` (test file), `255` (suite runner)
Suite behavior	`continue_on_error=False` → aborted after first file failure; 18 of 21 tests never ran

Error Message & Stack Trace

RuntimeError: bmm: do not support fp8 weight for now.

Call chain:

test_bmm.py:91  →  _fp8_bmm():67  →  torch.ops.sgl_kernel.bmm_cpu(mat3, mat1, mat2_q_t, False, mat2_s)

Parameter Space (all 96 combinations failed)

B ∈ {1, 16, 17}
M ∈ {1, 2, 11, 111}
N ∈ {160, 512}
K ∈ {160, 544}
chunk ∈ {True, False}

Tests Blocked (never executed)

test_causal_conv1d.py, test_cpu_graph.py, test_decode.py, test_extend.py, test_flash_attn.py, test_gemm.py, test_intel_amx_attention_backend_{a,b,c}.py, test_mamba.py, test_mla.py, test_moe.py, test_norm.py, test_qkv_proj_with_rope.py, test_qwen3.py, test_rope.py, test_shared_expert.py, test_topk.py

3. Suggested Fixes

Option A: Skip FP8 subtests on CPU (Recommended — quickest fix)

In test/srt/cpu/test_bmm.py, gate FP8 test paths:

import unittest

# In the test method or at parameterization level:
@unittest.skipIf(
    not torch.cuda.is_available(),  # or check SGLANG_USE_CPU_ENGINE
    "FP8 BMM not supported on CPU kernel"
)
def test_bmm_fp8(self, ...):
    ...

Or more precisely, inside _fp8_bmm() or at test dispatch:

import os

if os.environ.get("SGLANG_USE_CPU_ENGINE") == "1":
    self.skipTest("bmm_cpu does not support fp8 weights")

Option B: Implement FP8 support in the CPU BMM kernel

In sgl-kernel/src/sgl-kernel/cpu/bmm.cpp (or equivalent), add an FP8 code path. This is a larger effort and likely not the intent of PR #20815.

Option C: Separate test parameterization for CPU vs GPU

Refactor test_bmm.py to have distinct parameter sets:

GPU: includes FP8 weight tests
CPU: excludes FP8 weight tests (or tests only BF16/FP32)

CPU_DTYPES = ["bf16", "fp32"]
GPU_DTYPES = ["bf16", "fp32", "fp8"]

dtypes = CPU_DTYPES if os.environ.get("SGLANG_USE_CPU_ENGINE") == "1" else GPU_DTYPES

Additional: Enable `continue_on_error` or fix test ordering

Consider setting continue_on_error=True in the suite runner so that one early failure doesn't block 18 other tests from providing signal. This won't fix the root cause but improves CI observability.

4. Priority

High

The failure blocks 18 out of 21 CPU tests from running, providing zero signal on the rest of the CPU test suite.
The fix is straightforward (skip/gate FP8 tests on CPU).
PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815 (fix_amd_ci_multimodaltest) cannot merge with this failure.

5. Environment Context

Component	Value
Platform	Intel Xeon (CPU-only, `SGLANG_USE_CPU_ENGINE=1`)
CPU Feature	Intel AMX confirmed available
Docker image	`sglang_xeon` from `docker/xeon.Dockerfile` (Ubuntu 24.04)
Python	3.12.13 (via `uv` venv)
PyTorch	`2.9.0+cpu`
sglang	`0.5.6.post3.dev3008+g27142c2ee` (branch `fix_amd_ci_multimodaltest`)
sglang-kernel-cpu	`0.4.0` (built from source)
Triton	`3.5.0` (not functional on CPU — expected)
Runner	`gnr88001599-1` / `sdp`
PR	#20815

Cross-Job Analysis

Unified CI Failure Analysis — PR #20815 (`fix_amd_ci_multimodaltest`)

1. Common Patterns

Pattern	Jobs Affected	Shared?
Port conflict / stale processes on persistent runners	`stage-b-test-large-1-gpu (0)`	No — infra-specific
MiniCPM model loading crash (likely `transformers` 5.3.0 or HF cache)	`stage-b-test-large-1-gpu (2)`	No — model-specific
FP8 BMM not implemented on CPU	`build-test (all)`	No — test-code bug
Gating job propagating upstream failure	`wait-for-stage-b`	Derivative only
Single failure blocking remaining tests in suite	All 3 real failures	✅ Yes — `continue_on_error=False` amplifies every failure

These are three independent root causes. There is no single broken dependency, shared infra issue, or common code change that unifies them. The only shared pattern is that each failure is amplified by the test runner's continue_on_error=False behavior, turning one broken test file into a full suite wipeout.

2. Cross-Job Dependencies

stage-b-test-large-1-gpu (0)  ──[independent]──  stage-b-test-large-1-gpu (2)
                                                           │
                                                           ▼
                                                   wait-for-stage-b  (derivative — gating job)
                                                           │
                                                           ▼
                                                   All remaining stage-b jobs killed (24 jobs lost signal)

build-test (all)  ──[independent, separate pipeline]

wait-for-stage-b is purely derivative — it failed because it correctly detected stage-b-test-large-1-gpu (2) failing. Fixing job (2) fixes the gate.
stage-b-test-large-1-gpu (0) and (2) run on different runners with different models. No causal link.
build-test (all) is a CPU-only build pipeline on an Intel Xeon runner — completely isolated from the H100 GPU jobs.

3. Unified Root Cause Assessment

There is no single unified root cause. The three real failures are:

#	Category	Root Cause	Introduced By
1	Infra	Zombie process holding port 11000 on persistent Docker runner	Pre-existing runner state (not PR code)
2	Compatibility	MiniCPM models fail to load — likely `transformers` 5.3.0 regression or incomplete HF cache for `trust_remote_code` models	Environment / dependency drift
3	Test bug	`test_bmm.py` calls FP8 BMM on CPU, which the kernel explicitly doesn't support	PR #20815 code or pre-existing test gap

The PR branch name (fix_amd_ci_multimodaltest) suggests the author was fixing AMD CI for multimodal tests — the CPU BMM FP8 failure (#3) may be a pre-existing issue newly exposed by this PR's test scope changes, or a test file added/modified without CPU gating.

4. Priority Ranking

Priority	Job	Rationale	Fix Effort
🔴 P1	`build-test (all)` — FP8 BMM	Blocks PR merge. Directly fixable in PR code. 96 errors, 18 tests blocked. Simple skip/gate fix.	🟢 ~10 min
🟠 P2	`stage-b-test-large-1-gpu (2)` — MiniCPM	Blocks stage-b gate (kills 24 jobs via `wait-for-stage-b`). Needs investigation — either fix HF cache, pin `transformers`, or skip MiniCPM tests.	🟡 ~1–2 hrs
🟡 P3	`stage-b-test-large-1-gpu (0)` — Port conflict	Flaky infra issue, not caused by PR code. Add `fuser -k` cleanup step. Won't recur on a different runner.	🟢 ~15 min (workflow change)
⚪ P4	`wait-for-stage-b`	No action needed — automatically resolves when P2 is fixed.	N/A

5. Overall Recommendation

PR #20815 has three independent failures that should be addressed in order: (1) add a CPU/FP8 skip guard in test_bmm.py — this is a ~10-minute code fix that unblocks the entire CPU test suite and is almost certainly within the PR author's scope; (2) investigate the MiniCPM server crash on stage-b-test-large-1-gpu (2) by examining its server stderr logs, then either re-download the HF cache with trust_remote_code files, pin transformers<5.0, or temporarily skip MiniCPM tests — this is the highest-impact fix since the fail-fast gate killed 24 downstream jobs; (3) add a fuser -k <port>/tcp pre-cleanup step to the CI workflow to prevent port conflicts on persistent runners, which is a one-line workflow fix that prevents future flakes. None of these failures share a common root cause, so they must be addressed independently — but the good news is that each has a straightforward, well-scoped fix.

Automated CI analysis by amd-bot — progressive step analysis

bingxche · 2026-03-19T09:22:45Z

@bingxche requested a review

Claude Code Review

PR #20815: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)
Reviewed at 2026-03-19 09:22 UTC

Code Review: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)

1. Summary

This PR makes three changes to fix AMD CI stability:

Increases the number of partitions for multimodal/diffusion server tests from 2 to 4 (reducing tests per partition to avoid timeouts or resource exhaustion).
Moves test_vlm_models.py from the deterministic AMD CI suite to the nondeterministic suite (acknowledging flaky behavior).
Moves test_multi_lora_backend.py from the nondeterministic AMD CI suite back to the deterministic suite.

2. Code Quality

Bug / Inconsistency

.github/workflows/pr-test-amd.yml, line 542: The comment was not updated after changing the partition count:
```
part: [0, 1, 2, 3]  # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each
```
This is misleading — it now has 4 partitions, not 2. The comment should read something like # 4 partitions: 11 tests ÷ 4 = ~2-3 tests each. The ROCm 7.2 workflow file correctly removed the stale comment.

Timeout Adjustment

.github/workflows/pr-test-amd.yml, line 624: The timeout was increased from 70 to 90 minutes. With 4 partitions (fewer tests per job) and max-parallel: 1, each partition should run fewer tests. A 90-minute timeout for ~2-3 tests seems generous but safe. Worth confirming this is intentional and not masking a deeper performance issue.
.github/workflows/pr-test-amd-rocm720.yml: The timeout was not similarly increased. This could be intentional (ROCm 7.2 may be faster) or an oversight. Worth confirming consistency.

Test Suite Moves

Moving test_vlm_models.py to nondeterministic and test_multi_lora_backend.py to deterministic appears reasonable given the linked CI results showing passing tests. However, there's no documentation or comment explaining why VLM tests are flaky on AMD — this context would help future maintainers.

3. Performance

Increasing partitions from 2 to 4 with max-parallel: 1 means the total wall-clock time for the diffusion test job will increase (4 sequential jobs instead of 2). This is a tradeoff of CI reliability vs. CI speed. Given the motivation (avoiding resource exhaustion and evictions), this seems acceptable.
No impact on serving/inference performance.

4. Security

No security concerns.

5. Testing

The PR links a passing CI run: https://github.com/sgl-project/sglang/actions/runs/23234956586
The changes are CI configuration changes, so the CI run itself serves as the test.
No new unit tests are needed.

6. Suggestions

Fix the stale comment in pr-test-amd.yml:

# Before (incorrect):
part: [0, 1, 2, 3]  # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each

# After (correct):
part: [0, 1, 2, 3]  # 4 partitions: 11 tests ÷ 4 = ~2-3 tests each

Consider aligning timeout between the two workflow files. If pr-test-amd.yml needs 90 minutes, confirm pr-test-amd-rocm720.yml doesn't also need it (currently unchanged at its existing value — appears to be 70 minutes based on the diff context).

Add a brief comment explaining VLM flakiness on AMD in test_vlm_models.py:

# VLM tests are nondeterministic on AMD due to [brief reason, e.g., numerical precision differences]
register_amd_ci(est_time=850, suite="stage-b-test-small-1-gpu-amd-nondeterministic")

Minor: Consider whether max-parallel: 1 is still strictly necessary with 4 smaller partitions — if each partition uses fewer resources, it might be possible to run 2 in parallel, reducing overall CI time. (This is a follow-up consideration, not blocking.)

7. Overall Assessment

Approve (with minor nit on stale comment)

The changes are straightforward CI fixes: splitting tests into more partitions to reduce per-job resource pressure, adjusting a timeout, and correctly categorizing flaky vs. stable tests. The linked CI run validates the changes. The only actionable issue is the stale comment in pr-test-amd.yml which should be fixed before merge.

Automated review by amd-bot using Claude. This is an AI-generated review — please use your judgment.

bingxche · 2026-03-19T09:23:39Z

@bingxche requested a review

Claude Code Review

PR #20815: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)
Reviewed at 2026-03-19 09:23 UTC

Code Review: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)

1. Summary

This PR makes three targeted changes to fix AMD CI stability:

Increases the number of partitions for multimodal/diffusion server tests from 2 to 4 (reducing tests per partition), likely to avoid OOM or timeout issues on AMD MI325 GPUs.
Moves test_vlm_models.py from the deterministic AMD CI suite to the non-deterministic suite (acknowledging flaky behavior).
Moves test_multi_lora_backend.py from the non-deterministic suite back to the deterministic suite (indicating it's now stable).

2. Code Quality

Minor Issue: Stale Comment

File: .github/workflows/pr-test-amd.yml, line 542

part: [0, 1, 2, 3]  # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each

The comment still says "2 partitions: 11 tests ÷ 2 = ~5-6 tests each" but now there are 4 partitions. This should be updated to something like # 4 partitions: 11 tests ÷ 4 = ~2-3 tests each. The ROCm 7.20 workflow correctly removed the old comment.

Consistency

The ROCm 7.20 workflow (pr-test-amd-rocm720.yml) does not have a timeout increase, while pr-test-amd.yml increases the timeout from 70 to 90 minutes. This seems intentional (different hardware/software performance profiles), but worth confirming.

Logic

Increasing partitions from 2→4 while keeping max-parallel: 1 means the total wall-clock time for the diffusion test job increases (4 sequential runs instead of 2), but each individual run processes fewer tests, reducing memory pressure and per-partition timeout risk. This is a reasonable trade-off for CI stability.

3. Performance

No impact on serving/inference performance. This is purely a CI configuration change. The max-parallel: 1 constraint means the 4 partitions run sequentially, which will increase total CI time but improve reliability.

4. Security

No security concerns. CI workflow changes don't affect the runtime codebase.

5. Testing

The PR links a passing CI run: https://github.com/sgl-project/sglang/actions/runs/23234956586
The test suite registration changes (test_vlm_models.py → nondeterministic, test_multi_lora_backend.py → deterministic) are appropriate if backed by empirical evidence of flakiness/stability, which the passing CI run supports.
No new tests are needed — this is a CI infrastructure fix.

6. Suggestions

Fix stale comment in pr-test-amd.yml:

# Before (incorrect):
part: [0, 1, 2, 3]  # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each

# After (correct):
part: [0, 1, 2, 3]  # 4 partitions: 11 tests ÷ 4 = ~2-3 tests each

Consider adding the timeout increase to pr-test-amd-rocm720.yml as well if the same timeout pressure applies there, or add a comment explaining why it's not needed.

Consider adding a brief comment in test_vlm_models.py explaining why it's in the nondeterministic suite for AMD, to help future contributors understand:

# VLM tests are non-deterministic on AMD due to [brief reason, e.g., "output variance on MI325"]
register_amd_ci(est_time=850, suite="stage-b-test-small-1-gpu-amd-nondeterministic")

7. Overall Assessment

Approve — with a minor nit on the stale comment.

This is a straightforward, well-scoped CI fix. The changes are logical: splitting partitions to reduce per-run load, increasing timeout for the larger workload, and correctly categorizing flaky vs. stable tests. The linked CI run demonstrates the fix works. The only actionable item is fixing the outdated comment in pr-test-amd.yml.

Automated review by amd-bot using Claude. This is an AI-generated review — please use your judgment.

bingxche · 2026-03-19T09:34:38Z

@yctseng0211 requested CI status check

CI Status for PR #20815

Total: 81 checks

Passed: 30
Failed: 4
Pending: 17

Failed Checks:

stage-b-test-large-1-gpu (0) (https://github.com/sgl-project/sglang/actions/runs/23285539283/job/67709136676)
stage-b-test-large-1-gpu (2) (https://github.com/sgl-project/sglang/actions/runs/23285539283/job/67709136660)
wait-for-stage-b (https://github.com/sgl-project/sglang/actions/runs/23285539283/job/67709136576)
build-test (all) (https://github.com/sgl-project/sglang/actions/runs/23285539237/job/67708038447)

Detailed Analysis

Method: Progressive step-by-step analysis (all steps examined per job)

Job: `stage-b-test-large-1-gpu (0)`

Link: stage-b-test-large-1-gpu (0)
Failed Steps: (unknown)

CI Failure Analysis: `stage-b-test-large-1-gpu (0)`

1. Root Cause Analysis

The test registered/spec/eagle/test_eagle_infer_b.py failed because the EAGLE speculative decoding server could not bind to port 11000 — the port was already in use by another process. This happened identically across all 3 retry attempts, indicating a persistent stale process or another service occupying the port on the runner.

Why the port was occupied: The most likely scenario is that a previous CI job (or a previous test within a prior run) on the same runner (h100-radixark-host1-gpu-3) left behind a zombie/orphan server process still bound to port 11000. The runner reuses the same Docker container (8e1df5b34166), and the cleanup between jobs did not kill lingering processes listening on that port. Each retry attempt within the test also compounded the problem — after the first attempt's server started and hit the bind error, the server process itself may not have been fully killed before the next attempt launched.

Contributing factor: py-spy (used for debugging stuck processes) failed with Permission denied (os error 13), preventing the watchdog from collecting useful diagnostic stack traces of the process holding the port.

2. Failure Details

Failed Test

File: registered/spec/eagle/test_eagle_infer_b.py
Failure mode: TIMEOUT (1800s hard limit exceeded)

Error Message (repeated 3 times)

ERROR: [Errno 98] error while attempting to bind on address ('127.0.0.1', 11000): address already in use

Failure Timeline (all 3 attempts identical pattern)

Attempt	Server Start	Bind Error	Watchdog Timeout
1	08:41:03	08:41:33 (~30s later)	08:46:04 (300s soft)
2	08:51:03	08:51:25 (~22s later)	08:56:05 (300s soft)
3	09:01:04	09:01:29 (~25s later)	09:06:05 (300s soft)

Cascading Impact

The 1800s timeout on the first test (3 × 300s watchdog + startup overhead) prevented the remaining 2 tests from executing:

registered/lora/test_lora_overlap_loading.py — SKIPPED (never ran)
registered/model_loading/test_utils_update_weights.py — SKIPPED (never ran)

Secondary Error

py-spy: Permission denied (os error 13)

py-spy could not attach to the stuck process to gather diagnostics, likely due to ptrace restrictions in the container (SYS_PTRACE capability not granted or kernel.yama.ptrace_scope > 0).

3. Suggested Fixes

Immediate / Short-term

Kill stale processes before test execution — Add a pre-test cleanup step in ci_install_dependency.sh or at the start of run_suite.py:

# Kill any process listening on common test ports
for port in 11000 11001 11002 11003 30000 30001; do
  fuser -k ${port}/tcp 2>/dev/null || true
done

Use dynamic/random ports in tests — Modify test_eagle_infer_b.py (and similar tests) to use port=0 or a randomly selected free port instead of hardcoded 11000:

import socket
def get_free_port():
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(('', 0))
        return s.getsockname()[1]

Add port cleanup between retry attempts — In the test's retry logic, explicitly kill any process on the target port before retrying:

import subprocess
subprocess.run(["fuser", "-k", f"{port}/tcp"], capture_output=True)
time.sleep(2)  # Allow socket TIME_WAIT to clear

Enable SO_REUSEADDR / SO_REUSEPORT — If using uvicorn/uvloop, ensure the server socket sets SO_REUSEADDR:

# In server launch config
uvicorn.run(..., host="127.0.0.1", port=port)
# uvicorn sets SO_REUSEADDR by default, but verify it's not overridden

Medium-term

Grant SYS_PTRACE capability to the CI container so py-spy can attach and provide useful diagnostics:

# In the runner's Docker configuration
docker run --cap-add SYS_PTRACE ...

Or set the sysctl:

echo 0 > /proc/sys/kernel/yama/ptrace_scope

Add a pre-test port availability check in the test harness (run_suite.py):

def assert_port_free(port, host="127.0.0.1"):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        result = s.connect_ex((host, port))
        if result == 0:
            # Port is in use — find and log the offending process
            subprocess.run(["lsof", "-i", f":{port}"])
            raise RuntimeError(f"Port {port} is already in use")

Restart the Docker container between CI jobs on h100-radixark-host1-gpu-3 to ensure a clean process namespace, or use --rm flag so containers are ephemeral.

Long-term

Refactor test infrastructure to use ephemeral port allocation project-wide, eliminating hardcoded ports as a class of CI failures.

4. Priority

High 🔴

3 out of 3 tests failed in this partition (1 timeout + 2 never ran)
This is an infrastructure/environment issue, not a code bug — it will likely recur on the same runner
The port conflict pattern is deterministic and will block all EAGLE speculative decoding tests until resolved
No useful diagnostics can be collected due to py-spy permission issue

5. Environment Context

Detail	Value
Runner	`h100-radixark-host1-gpu-3`
Container	`8e1df5b34166` (persistent, not ephemeral)
GPU	NVIDIA H100 80GB HBM3 (sm_90)
CUDA Driver	580.126.09
PyTorch	2.9.1 (CUDA 12.8)
sglang	0.0.0.dev1+g5252bd422
FlashInfer	0.6.6
Python	3.10.12
OS	Ubuntu 22.04
Commit	`5252bd4222d72a32e9c14e5f393c9ed0dac239fb`
Conflicting Port	`127.0.0.1:11000` (EADDRINUSE across all 3 attempts)
Test Suite	`stage-b-test-large-1-gpu`, partition 0/14
Timeout	1800s per file, 300s watchdog per server launch attempt

Job: `stage-b-test-large-1-gpu (2)`

Link: stage-b-test-large-1-gpu (2)
Failed Steps: (unknown)

CI Failure Analysis: `stage-b-test-large-1-gpu (2)`

1. Root Cause Analysis

Two VLM models — openbmb/MiniCPM-V-4 and openbmb/MiniCPM-o-2_6 — fail to start their inference servers (exit code 1) during test setup.

The most likely root cause is incompatibility between these MiniCPM models and transformers==5.3.0. Several clues support this:

HuggingFace cache validation in Step 19 flagged 32 FAILs, including models missing trust_remote_code files — MiniCPM models are known to rely heavily on custom modeling code via trust_remote_code=True.
Transformers 5.3.0 is very new (bleeding-edge) and introduced RoPE compatibility warnings across all models tested. MiniCPM's custom code may reference internal transformers APIs that changed or were removed in v5.x.
The easydict module was reported missing (for DeepSeek-OCR), indicating the environment may be missing optional dependencies that custom model code requires. MiniCPM models similarly depend on custom tokenizer/processor code that may import packages not installed in this environment.
Both failures are Python-level (exit code 1, no CUDA coredumps generated), consistent with an import error or model initialization crash rather than a GPU issue.
The CI retried with HF_HUB_OFFLINE=0 (re-downloading model files), which also failed — ruling out a simple cache corruption issue and pointing to a code-level incompatibility.

2. Failure Details

Failed Tests

Test Class	Model	Error
`TestMiniCPMV4Server`	`openbmb/MiniCPM-V-4`	`setUpClass` — Server process exited with code 1
`TestMiniCPMo26Server`	`openbmb/MiniCPM-o-2_6`	`setUpClass` — Server process exited with code 1

Error Messages

ERROR: setUpClass (__main__.TestMiniCPMV4Server)
Exception: Server process exited with code 1. Check server logs for errors.

ERROR: setUpClass (__main__.TestMiniCPMo26Server)
Exception: Server process exited with code 1. Check server logs for errors.

Test File Results

File: test/srt/models/test_vision_openai_server_a.py
51 tests ran in 535.2s: 43 passed, 2 errors, 6 skipped
Exit code: 1 (file-level failure)
Suite exit code: 255 (0/4 test files passed — remaining 3 files likely didn't execute due to early abort or are not shown)

Key Warnings

Missing MoE kernel configs for E=64,N=896 and E=128,N=768 on H100 (triton 3.5.1)
easydict module not found (DeepSeek-OCR)
Transformers 5.3.0 RoPE compatibility warnings across all models
Multimodal embedding cache full warnings for video inputs

3. Suggested Fixes

Immediate (unblock CI)

Retrieve server logs for MiniCPM models — The current error message says "Check server logs for errors" but the actual server stderr/stdout is not captured in CI output. Modify the test harness or CI script to dump server logs on failure:
```
# In test setup, capture and print server stderr on crash
if process.returncode != 0:
    print(f"SERVER STDERR:\n{process.stderr.read()}")
```
Pin transformers to a known-good version until MiniCPM compatibility is confirmed:
```
pip install transformers==4.51.0
```
Install missing optional dependencies that MiniCPM custom code may require:
```
pip install easydict timm
```

Skip MiniCPM tests temporarily if blocking other CI work (add to skip list in run_suite.py or the test file):

@unittest.skip("MiniCPM-V-4 incompatible with transformers 5.3.0 — see #XXXX")
class TestMiniCPMV4Server(...):

Medium-term

Add transformers version compatibility matrix — Test MiniCPM models against transformers 4.x vs 5.x to identify the breaking version boundary.
Improve server crash diagnostics — The test framework should always capture and surface server process logs when exit code != 0, not just say "check server logs."
Fix the suite runner — The suite reports 0/4 files passed, suggesting the 3 other test files (test_anthropic_tool_use.py, test_session_latency.py, test_mrope.py) may not have executed at all after the first file failed. Verify the runner continues to subsequent files on failure.

4. Priority

High

The failures are reproducible (both offline and online retries failed)
They block the entire partition (exit code 255, 0/4 files passed)
MiniCPM is a supported model family in sglang
However, this is not a regression in sglang core — it's likely a dependency version issue, so not Critical

5. Environment Context

Component	Value
GPU	NVIDIA H100 80GB HBM3 (sm_90)
Driver	580.126.09
CUDA	13.0 (torch built for 12.8)
Python	3.10.12
PyTorch	2.9.1+cu128
transformers	5.3.0 ⚠️ (bleeding edge)
sglang	0.0.0.dev1+g5252bd422
sglang-kernel	0.4.0
FlashInfer	0.6.6
flash-attn-4	4.0.0b4
Triton	3.5.1
torchao	0.9.0
Runner	h100-radixark-host1-gpu-6
Commit	`5252bd4222d72a32e9c14e5f393c9ed0dac239fb`
Date	2026-03-19

Key observation: transformers==5.3.0 is an unusually new version. The RoPE warnings across all models and the MiniCPM crashes strongly suggest this is the primary environmental factor causing the failures.

Job: `wait-for-stage-b`

Link: wait-for-stage-b
Failed Steps: (unknown)

CI Failure Analysis: `wait-for-stage-b` — Job `stage-b-test-large-1-gpu (2)` Failed

1. Root Cause Analysis

The wait-for-stage-b job is a gate/polling job — it did not fail itself. It correctly detected that a downstream job, stage-b-test-large-1-gpu (2), completed with conclusion=failure and triggered its fail-fast mechanism.

The root cause is NOT in this job's logs. This job only monitors other jobs via the GitHub API. The actual failure occurred in:

stage-b-test-large-1-gpu (2) — matrix index 2 of the stage-b-test-large-1-gpu job group (14 total jobs)

The logs from this polling job provide no information about what failed inside stage-b-test-large-1-gpu (2) — no error messages, stack traces, or test output are surfaced here. The failure could be:

A test failure in one of the large-1-GPU test suites
A model loading / OOM issue on the GPU runner
A server startup timeout
An infrastructure issue (container crash, GPU fault, network timeout)
A flaky test

Without the logs from stage-b-test-large-1-gpu (2), the specific root cause cannot be determined from this job alone.

2. Failure Details

Field	Value
Failed job	`stage-b-test-large-1-gpu (2)`
Detection method	Fail-fast in `wait-for-stage-b` polling script
Error message	`##[error]stage-b jobs failed: stage-b-test-large-1-gpu (2)`
Polling attempt	Attempt 8 (~14 min into monitoring)
Jobs completed at failure	3 / 27
Other completed jobs	`stage-b-test-small-1-gpu (0)` ✅, `stage-b-test-4-gpu-b200` ✅
Remaining jobs	24 jobs not yet completed (skipped due to fail-fast)
Specific error/stack trace	❌ Not available in this job's logs

3. Suggested Fixes

Immediate Actions

Inspect the actual failed job's logs:
- Navigate to: Workflow Run #23285539283
- Find and expand stage-b-test-large-1-gpu (2)
- Look at the test execution step for error messages, stack traces, and which specific test(s) failed
Check if this is a flaky failure:
- Search recent CI runs for other failures in stage-b-test-large-1-gpu (2) specifically
- If the same matrix index fails intermittently, identify the specific test file assigned to index (2) in the matrix configuration and check for known flaky tests
Re-run the failed job:
- If the failure appears transient (GPU fault, timeout, network), re-run only the failed job from the workflow run page
- If it fails again, it's likely a real regression introduced by PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815

Investigation Path (in the failed job's logs)

Look for:
- pytest output with FAILED/ERROR markers
- CUDA OOM errors ("OutOfMemoryError")
- Server startup timeouts ("Timeout waiting for server")
- CUDA coredumps (SGLANG_CUDA_COREDUMP=1 is enabled)
- Container/process crashes (exit code != 0)

If This Is a PR Regression (PR #20815)

Compare the test results against the main branch (cb8105f)
Identify what changed in commit 27142c2 that could affect large-1-GPU test scenarios
The merge commit is 5252bd4

4. Priority

High — This blocks the entire stage-b gate for PR #20815. The fail-fast behavior means 24 of 27 jobs were effectively abandoned. However, until the actual failure in stage-b-test-large-1-gpu (2) is examined, it's unclear whether this is a real regression or a transient/infrastructure issue.

5. Environment Context

Variable	Value
Runner OS	Ubuntu 24.04.3 LTS
Runner Image	`ubuntu-24.04` / `20260309.50.1`
Git	2.53.0
PR	#20815 (merge commit `5252bd4`)
PR head	`27142c2eec85718a0927303c4ee9eb86382d2b7e`
Base	`cb8105fe282fc373b5baed63d5df38682418a373`
`SGLANG_IS_IN_CI`	`true`
`SGLANG_CUDA_COREDUMP`	`1` (coredumps enabled — check for dumps in failed job)
`SGLANG_JIT_DEEPGEMM_FAST_WARMUP`	`true`
Node.js deprecation	`actions/checkout@v4` and `actions/github-script@v7` using Node.js 20 (EOL June 2, 2026) — not related to failure

⚠️ Action Required: The actual root cause can only be determined by examining the logs of stage-b-test-large-1-gpu (2). This analysis confirms the wait-for-stage-b gate job operated correctly — it is purely a messenger of the downstream failure.

Job: `build-test (all)`

Link: build-test (all)
Failed Steps: (unknown)

CI Failure Analysis: `build-test (all)` — CPU FP8 BMM Kernel Not Implemented

1. Root Cause Analysis

The test suite per-commit-cpu fails because test_bmm.py exercises FP8 (8-bit floating point) batch matrix multiplication on the CPU kernel, which explicitly does not support FP8 weights. The CPU implementation of sgl_kernel.bmm_cpu raises a hard error when called with FP8-quantized inputs. This is a known, intentional limitation in the CPU kernel ("bmm: do not support fp8 weight for now."), but the test suite does not skip or guard against this unsupported code path on CPU.

Because the suite runs with continue_on_error=False and enable_retry=False, the first test file failure (test_bmm.py) causes an immediate abort — 19 of 21 test files were never executed, making the failure impact appear worse than the underlying issue.

This is not a regression from PR #20815 (fix_amd_ci_multimodaltest). It is a pre-existing gap between test coverage expectations and CPU kernel capabilities.

2. Failure Details

Failed Test

File	Tests	Errors	Exit Code
`cpu/test_bmm.py`	1 test (96 parameterized subtests)	96	1

Error Message

RuntimeError: bmm: do not support fp8 weight for now.

Stack Trace

File "test_bmm.py", line 67, in _fp8_bmm
    torch.ops.sgl_kernel.bmm_cpu(mat3, mat1, mat2_q_t, False, mat2_s)
RuntimeError: bmm: do not support fp8 weight for now.

Parameter Space (all 96 combinations failed identically)

B (batch): {1, 16, 17}
M: {1, 2, 11, 111}
N: {160, 512}
K: {160, 544}
chunk: {True, False}

Suite Execution Summary

Passed:  2/21  (cpu/test_activation.py, cpu/test_binding.py)
Failed:  1/21  (cpu/test_bmm.py)
Skipped: 19/21 (never reached — early abort)
Exit code: 255

Non-Fatal Warnings

Triton is not supported on current platform, roll back to CPU
Only CUDA, HIP and XPU support AWQ currently
Only CUDA and MUSA support GGUF quantization currently
numa_migrate_pages failed / get_mempolicy: Operation not permitted (Docker NUMA policy restriction)

3. Suggested Fixes

Option A: Skip FP8 tests on CPU (quick fix, recommended)

In test/srt/cpu/test_bmm.py, guard the FP8 test path:

import unittest
import torch

# At the test method or class level:
@unittest.skipUnless(
    torch.cuda.is_available(),
    "FP8 BMM is not supported on CPU kernels"
)
def test_fp8_bmm(self):
    ...

Or, if the test is parameterized and mixes FP8/non-FP8 paths, add an inline skip:

def _fp8_bmm(self, B, M, N, K, chunk):
    if not torch.cuda.is_available():
        self.skipTest("bmm: FP8 weight not supported on CPU")
    ...

Option B: Enable `continue_on_error` for the CPU suite

In run_suite.py or the suite configuration for per-commit-cpu, set continue_on_error=True so that subsequent test files still execute even when one file fails. This doesn't fix the root issue but prevents 19 tests from being silently skipped.

Option C: Implement FP8 BMM support in the CPU kernel (longer-term)

The error string "do not support fp8 weight for now" suggests this is a planned feature. Track implementation in a separate issue for the sgl_kernel_cpu package.

Option D: Remove `test_bmm.py` from the `per-commit-cpu` suite

If FP8 BMM on CPU is not planned near-term, exclude the file from the suite definition until the kernel supports it.

4. Priority

Medium

The failure is deterministic and blocks the entire CPU test suite (19/21 tests never run), masking potential real regressions.
However, it is not a regression from this PR — it's a pre-existing test/kernel mismatch.
The fix is trivial (Option A: add a skipTest guard).

5. Environment Context

Component	Value
Platform	CPU (Xeon), no GPU
CPU Feature	AMX tile instructions confirmed available
Docker Image	`sglang_xeon` (Ubuntu 24.04)
Python	3.12.13 (via `uv` 0.10.11)
PyTorch	2.9.0+cpu
sglang	0.5.6.post3.dev3008+g27142c2ee
sglang-kernel-cpu	0.4.0 (built from source)
Triton	3.5.0 (not functional on CPU — falls back)
PR Branch	`fix_amd_ci_multimodaltest` (#20815)
Runner	`gnr88001599-1` / `sdp`
Suite Config	`per-commit-cpu`, `continue_on_error=False`, `enable_retry=False`, timeout 1500s/file

Cross-Job Analysis

Unified CI Failure Analysis — PR #20815 (`fix_amd_ci_multimodaltest`)

1. Common Patterns

Pattern	Jobs Affected	Nature
Environment/infra issue, not code regression	All 4	None of the failures trace back to code changes in PR #20815
Pre-existing issues exposed by CI environment	Jobs 0, 2, `build-test`	Stale processes, dependency skew, missing test guards
Cascading abort hiding true test coverage	Jobs 0, `build-test`	Early failures prevent remaining tests from running (19/21 skipped in CPU; 2/3 skipped in Job 0)
`wait-for-stage-b` is purely derivative	`wait-for-stage-b`	Gate job correctly propagated Job 2's failure — not an independent issue

The failures are NOT related to each other. They share no common root cause. Each is an independent issue:

Job	Root Cause Category
Job 0	Runner infra — stale process holding port 11000
Job 2	Dependency skew — `transformers==5.3.0` breaks MiniCPM custom model code
`wait-for-stage-b`	Derivative — mirrors Job 2's failure
`build-test (all)`	Test gap — FP8 BMM test not guarded for CPU-only execution

2. Cross-Job Dependencies

stage-b-test-large-1-gpu (2) ──FAILED──► wait-for-stage-b ──FAILED──► 24 jobs ABANDONED

wait-for-stage-b is a direct consequence of Job 2's failure. Fixing Job 2 eliminates this failure entirely.
Job 0 and Job 2 are independent matrix partitions on different runners (gpu-3 vs gpu-6) — no causal relationship.
build-test (all) runs on a CPU-only runner (gnr88001599-1) and is completely independent of the GPU jobs.
The fail-fast in wait-for-stage-b caused 24 of 27 stage-b jobs to be abandoned, massively amplifying the blast radius of Job 2's failure.

3. Unified Root Cause

There is no single unified root cause. These are three independent failures:

#	Cause	Scope
1	Port 11000 occupied by orphan process on `h100-radixark-host1-gpu-3`	Runner-specific infra
2	`transformers==5.3.0` incompatible with MiniCPM custom modeling code	Environment-wide dependency
3	`test_bmm.py` calls FP8 kernel on CPU without skip guard	Test code gap

If forced to identify a theme: the CI environment is fragile — hardcoded ports, bleeding-edge unpinned dependencies, and missing test guards all independently cause failures that are not caught by the test framework before cascading into suite-level aborts.

4. Priority Ranking

Priority	Job	Fix Effort	Blast Radius	Action
🔴 P0	Job 2 (MiniCPM / transformers 5.3.0)	Medium	Critical — kills `wait-for-stage-b` → abandons 24 jobs	Pin `transformers<5.0` or patch MiniCPM model code for v5 compatibility
🟠 P1	Job 0 (port 11000 conflict)	Low	High — 3 tests blocked, will recur on same runner	Add `fuser -k 11000/tcp` pre-test cleanup; migrate to dynamic ports
🟡 P2	`build-test` (FP8 BMM on CPU)	Trivial	Medium — 19/21 CPU tests never run	Add `skipUnless(torch.cuda.is_available())` to FP8 test path
⚪ P3	`wait-for-stage-b`	None	N/A — derivative	Automatically resolves when Job 2 is fixed

5. Overall Recommendation

None of these failures are caused by PR #20815's code changes (fix_amd_ci_multimodaltest) — they are three independent pre-existing CI environment issues. The highest-impact fix is pinning transformers to a stable release (e.g., transformers>=4.45,<5.0) since the MiniCPM crash in Job 2 triggers the fail-fast gate and abandons 24 downstream jobs, making it the single largest contributor to CI instability. In parallel, add a fuser -k port cleanup step before EAGLE tests to resolve the persistent port conflict on Job 0, and add a one-line skipTest guard in test_bmm.py to unblock the CPU suite. Once these three targeted fixes are applied, the PR should be re-run and is expected to pass cleanly. No changes to the PR's own code are needed.

Automated CI analysis by amd-bot — progressive step analysis

…inistic group) (sgl-project#20815)

Update pr-test-amd.yml

460f559

github-actions bot added the amd label Mar 18, 2026

Update pr-test-amd.yml

caf738e

yctseng0211 added the run-ci label Mar 19, 2026

yctseng0211 marked this pull request as ready for review March 19, 2026 00:09

yctseng0211 requested review from Fridge003, Kangyan-Zhou, bingxche, ispobock and merrymercy as code owners March 19, 2026 00:09

Merge branch 'main' into fix_amd_ci_multimodaltest

d94900c

yctseng0211 changed the title ~~[AMD] CI - Fix multimodal test~~ [AMD] CI - Fix AMD CI Mar 19, 2026

yctseng0211 changed the title ~~[AMD] CI - Fix AMD CI~~ [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) Mar 19, 2026

update amd ci settings

27142c2

github-actions bot added lora Multi-modal multi-modal language model labels Mar 19, 2026

HaiShaw approved these changes Mar 19, 2026

View reviewed changes

HaiShaw merged commit 9e629d3 into main Mar 19, 2026
93 of 106 checks passed

HaiShaw deleted the fix_amd_ci_multimodaltest branch March 19, 2026 18:50

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-determ…

1a22767

…inistic group) (sgl-project#20815)

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

[AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-determ…

363c43e

…inistic group) (sgl-project#20815)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)#20815

[AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)#20815
HaiShaw merged 4 commits intomainfrom
fix_amd_ci_multimodaltest

yctseng0211 commented Mar 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

yctseng0211 commented Mar 19, 2026

Uh oh!

yctseng0211 commented Mar 19, 2026

Uh oh!

bingxche commented Mar 19, 2026

Uh oh!

bingxche commented Mar 19, 2026

Uh oh!

bingxche commented Mar 19, 2026

Uh oh!

bingxche commented Mar 19, 2026

Uh oh!

bingxche commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yctseng0211 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

yctseng0211 commented Mar 19, 2026

Uh oh!

yctseng0211 commented Mar 19, 2026

Uh oh!

bingxche commented Mar 19, 2026

Uh oh!

bingxche commented Mar 19, 2026

CI Status for PR #20815

Failed Checks:

Detailed Analysis

Job: stage-b-test-large-1-gpu (0)

CI Failure Analysis: stage-b-test-large-1-gpu (0)

1. Root Cause Analysis

2. Failure Details

Error Messages (repeated 3 times)

Test Results

Three Server Launch Attempts (all failed identically)

3. Suggested Fixes

Immediate / Short-term

Medium-term

Workflow-level fix (.github/workflows/)

4. Priority

5. Environment Context

Job: stage-b-test-large-1-gpu (2)

CI Failure Analysis: stage-b-test-large-1-gpu (2)

1. Root Cause Analysis

2. Failure Details

3. Suggested Fixes

Immediate / Short-term

Medium-term

4. Priority

5. Environment Context

Job: wait-for-stage-b

CI Failure Analysis: wait-for-stage-b — Job stage-b-test-large-1-gpu (2) Failed

1. Root Cause Analysis

What we can infer:

2. Failure Details

3. Suggested Fixes

Immediate (Diagnostic)

If the failure is a test regression in PR #20815

If the failure is a flaky test

Housekeeping (Non-blocking)

4. Priority

5. Environment Context

Job: build-test (all)

CI Failure Analysis: build-test (all) — CPU BMM FP8 Test Failure

1. Root Cause Analysis

2. Failure Details

Error Message & Stack Trace

Parameter Space (all 96 combinations failed)

Tests Blocked (never executed)

3. Suggested Fixes

Option A: Skip FP8 subtests on CPU (Recommended — quickest fix)

Option B: Implement FP8 support in the CPU BMM kernel

Option C: Separate test parameterization for CPU vs GPU

Additional: Enable continue_on_error or fix test ordering

4. Priority

5. Environment Context

Cross-Job Analysis

Unified CI Failure Analysis — PR #20815 (fix_amd_ci_multimodaltest)

1. Common Patterns

2. Cross-Job Dependencies

3. Unified Root Cause Assessment

4. Priority Ranking

5. Overall Recommendation

Uh oh!

yctseng0211 commented Mar 18, 2026 •

edited

Loading

Job: `stage-b-test-large-1-gpu (0)`

CI Failure Analysis: `stage-b-test-large-1-gpu (0)`

Workflow-level fix (`.github/workflows/`)

Job: `stage-b-test-large-1-gpu (2)`

CI Failure Analysis: `stage-b-test-large-1-gpu (2)`

Job: `wait-for-stage-b`

CI Failure Analysis: `wait-for-stage-b` — Job `stage-b-test-large-1-gpu (2)` Failed

Job: `build-test (all)`

CI Failure Analysis: `build-test (all)` — CPU BMM FP8 Test Failure

Additional: Enable `continue_on_error` or fix test ordering

Unified CI Failure Analysis — PR #20815 (`fix_amd_ci_multimodaltest`)

Job: `stage-b-test-large-1-gpu (0)`

CI Failure Analysis: `stage-b-test-large-1-gpu (0)`

Job: `stage-b-test-large-1-gpu (2)`

CI Failure Analysis: `stage-b-test-large-1-gpu (2)`

Job: `wait-for-stage-b`

CI Failure Analysis: `wait-for-stage-b` — Job `stage-b-test-large-1-gpu (2)` Failed