[AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)#20815
[AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)#20815
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@amd-bot ci-status |
|
@amd-bot ci-status |
|
@amd-bot review |
CI Status for PR #20815Total: 81 checks
Failed Checks:
Detailed AnalysisMethod: Progressive step-by-step analysis (all steps examined per job) Job:
|
| Test File | Result | Duration |
|---|---|---|
test_eagle_infer_b.py |
TIMEOUT | 1800s (budget exhausted) |
test_lora_overlap_loading.py |
NOT RUN | — |
test_utils_update_weights.py |
NOT RUN | — |
test_update_weights_from_disk.py |
SKIPPED | — (see #14021) |
Summary: 0/3 passed
Three Server Launch Attempts (all failed identically)
- 08:41 —
max_running_requests=8, default KV cache (86756 tokens) → port 11000 in use → watchdog timeout 300s - 08:51 — Same config (retry) → same failure
- 09:01 —
max_running_requests=64,max_total_tokens=4500,chunked_prefill_size=128→ same failure
3. Suggested Fixes
Immediate / Short-term
-
Kill stale processes before test execution — Add a pre-test cleanup step in the CI workflow or in
run_suite.py:# Kill any process holding port 11000 (and other common sglang ports) for port in 11000 11001 11002 30000 30001; do fuser -k ${port}/tcp 2>/dev/null || true done # Also kill any orphan sglang/python server processes pkill -f "python.*launch_server" 2>/dev/null || true pkill -f "sglang.launch_server" 2>/dev/null || true
-
Use dynamic/random ports in tests — Instead of hardcoding port 11000, let the OS assign a free port:
# In test fixtures / server launch helpers import socket def get_free_port(): with socket.socket() as s: s.bind(('', 0)) return s.getsockname()[1]
This eliminates port conflicts entirely between concurrent or sequential runs.
-
Add port-availability check with fast-fail — In the server startup path (
launch_server.pyor the test harness), detectEADDRINUSEimmediately and either pick another port or fail fast with a clear message instead of waiting 300s for the watchdog:# Before launching the server import socket with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: if s.connect_ex(('127.0.0.1', port)) == 0: raise RuntimeError(f"Port {port} already in use — aborting immediately")
Medium-term
-
Fix
py-spypermissions — The watchdog's diagnostic dump fails withPermission denied (os error 13). Either:- Run CI with
SYS_PTRACEcapability:docker run --cap-add SYS_PTRACE ... - Or set
kernel.yama.ptrace_scope=0in the runner container - This won't fix the port issue but will provide better diagnostics when watchdog timeouts occur.
- Run CI with
-
Isolate test runs with process namespaces — Run each test file inside its own PID namespace or network namespace to guarantee no port leakage:
unshare --net --map-root-user -- python -m pytest test_eagle_infer_b.py
-
Add post-test cleanup in the test harness —
run_suite.pyshould kill all child processes after each test file completes (or times out), not just at the end of the suite.
Workflow-level fix (.github/workflows/)
- name: Kill stale server processes
run: |
fuser -k 11000/tcp 2>/dev/null || true
pkill -9 -f "sglang" 2>/dev/null || true
sleep 2
if: always()Add this before the test run step and also as a post-test cleanup with if: always().
4. Priority
High — This is a flaky infrastructure failure that:
- Caused 0/3 tests to pass and blocked 2 tests from running at all
- Is non-deterministic (depends on runner state from prior jobs)
- Will recur on the same runner unless addressed
- Masks real test results for EAGLE speculative decoding and LoRA overlap loading
5. Environment Context
| Component | Value |
|---|---|
| Runner | h100-radixark-host1-gpu-3 (persistent Docker container 8e1df5b34166) |
| GPU | NVIDIA H100 80GB HBM3 |
| Driver | NVIDIA 580.126.09, CUDA 13.0 |
| PyTorch | 2.9.1 (CUDA 12.8) |
| sglang | 0.0.0.dev1+g5252bd422 (commit 5252bd4) |
| sglang-kernel | 0.4.0 (pre-built) |
| flashinfer | 0.6.6 |
| transformers | 5.3.0 |
| Python | 3.10.12 |
| Conflicting port | 127.0.0.1:11000 (EADDRINUSE) |
| Model | meta-llama/Llama-2-7b-chat-hf + lmsys/sglang-EAGLE-llama2-chat-7B |
| Test timeout | 1800s per file, 300s watchdog |
Job: stage-b-test-large-1-gpu (2)
- Link: stage-b-test-large-1-gpu (2)
- Failed Steps: (unknown)
CI Failure Analysis: stage-b-test-large-1-gpu (2)
1. Root Cause Analysis
Two MiniCPM vision-language models (MiniCPM-V-4 and MiniCPM-o-2_6) fail to launch their sglang server process, crashing with exit code 1 during setUpClass. The server fails first in offline mode (HF_HUB_OFFLINE=1), and then again on the online retry.
The most likely root cause is a model loading incompatibility caused by the combination of:
transformers 5.3.0— a very recent/bleeding-edge version that may have changed model class registrations, config schemas, or trust_remote_code handling for MiniCPM architectures.- Missing
trust_remote_codefiles in HF cache — the HF cache validation in Step 19 explicitly flagged failures for MiniCPM models (among others) with missing files, which are required for custom model architectures like MiniCPM-V-4 and MiniCPM-o-2_6. - No CUDA coredumps were generated, confirming this is an application-level/Python crash (model loading or config parsing), not a GPU driver issue.
The fact that other VLMs (DeepSeek-OCR, Gemma-3, Qwen2.5-VL, Qwen2-VL, Qwen2-Audio, Qwen3-Omni, Qwen3-VL) all serve and pass successfully strongly points to a MiniCPM-specific model loading regression.
2. Failure Details
Failed Tests:
| Test Class | Model | Error |
|---|---|---|
TestMiniCPMV4Server |
openbmb/MiniCPM-V-4 |
Exception: Server process exited with code 1 in setUpClass |
TestMiniCPMo26Server |
openbmb/MiniCPM-o-2_6 |
Exception: Server process exited with code 1 in setUpClass |
Test Summary: 51 tests ran in 535.2s — 2 errors, 6 skipped. Exit code 255.
Failure Pattern:
- Test harness attempts to launch sglang server with
HF_HUB_OFFLINE=1→ server process exits with code 1 - Retries without offline mode → server process exits with code 1 again
setUpClassraises exception, all tests in the class are marked as errors
HF Cache Validation Failures (from Step 19):
The cache integrity check showed 32 FAILs including MiniCPM models missing trust_remote_code files (custom Python modules needed to load these architectures). These models use custom architectures not natively in transformers and rely on downloaded Python files from the HuggingFace Hub.
3. Suggested Fixes
Immediate / Short-term
-
Investigate MiniCPM server crash logs: The current test output only shows exit code 1. Add or surface the actual server stderr/stdout to diagnose the exact crash point:
# In the test harness, capture and print server logs on failure -
Fix HF cache for MiniCPM models: Re-download the model repos with
trust_remote_codefiles:huggingface-cli download openbmb/MiniCPM-V-4 --local-dir-use-symlinks False huggingface-cli download openbmb/MiniCPM-o-2_6 --local-dir-use-symlinks False
Ensure the runner's HF cache includes the custom
.pyfiles (not just weights/config). -
Pin or test
transformersversion compatibility:transformers 5.3.0is very new. Test whether downgrading resolves the issue:pip install transformers==4.51.0
MiniCPM custom code may reference internal
transformersAPIs that changed in v5.x.
Medium-term
-
Add per-model cache validation to CI: Before running tests, validate that each tested model has all required files (including
*.pyfortrust_remote_codemodels):# Validate trust_remote_code models have their custom code files for model_id in ["openbmb/MiniCPM-V-4", "openbmb/MiniCPM-o-2_6"]: snapshot = huggingface_hub.snapshot_download(model_id, local_files_only=True) assert glob.glob(os.path.join(snapshot, "*.py")), f"Missing custom code for {model_id}"
-
Consider skipping MiniCPM tests when
trust_remote_codefiles are unavailable, with a clear skip message rather than a hard crash.
4. Priority
High
- The failure blocks the entire test suite partition (0/4 files pass; remaining 3 test files never execute)
- Two actively supported VLM models are broken in CI
- The root cause (likely
transformers5.x incompatibility or incomplete HF cache) could affect othertrust_remote_codemodels as well - Not Critical because: other model tests pass, and this is partition 2/14 — other partitions are likely unaffected
5. Environment Context
| Component | Value |
|---|---|
| GPU | NVIDIA H100 80GB HBM3 (sm_90) |
| Driver | 580.126.09 |
| CUDA (PyTorch) | 12.8 |
| PyTorch | 2.9.1 |
| transformers | 5.3.0 |
| tokenizers | 0.22.2 |
| sglang | 0.0.0.dev1+g5252bd422 |
| flash-attn-4 | 4.0.0b4 |
| flashinfer | 0.6.6 |
| Python | 3.10.12 |
| Commit | 5252bd4222d72a32e9c14e5f393c9ed0dac239fb |
| Runner | h100-radixark-host1-gpu-6 |
| HF Cache | 32 FAIL (incomplete models including MiniCPM) |
Job: wait-for-stage-b
- Link: wait-for-stage-b
- Failed Steps: (unknown)
CI Failure Analysis: wait-for-stage-b — Job stage-b-test-large-1-gpu (2) Failed
1. Root Cause Analysis
The wait-for-stage-b gating job detected that stage-b-test-large-1-gpu (2) completed with conclusion=failure. This polling job itself functioned correctly — it did exactly what it was designed to do: detect the failure and fail-fast.
The root cause is not determinable from the logs of this job alone. The wait-for-stage-b job is a monitoring/gating job that only polls the GitHub Actions API for job statuses. It does not contain any test output, error messages, stack traces, or build logs from the failing job. The actual failure details (test errors, OOM, timeout, compilation failure, etc.) exist exclusively in the logs of stage-b-test-large-1-gpu (2) itself.
What we can infer:
- The failure occurred early —
stage-b-test-large-1-gpu (2)completed by polling attempt 8 (~14 minutes into the run), while 24/27 jobs were still queued/in-progress. This suggests the failure was not a timeout but rather a fast, hard failure (crash, assertion error, compilation error, or early test failure). - Only 3/27 jobs had completed at termination:
stage-b-test-small-1-gpu (0)✅,stage-b-test-4-gpu-b200✅, andstage-b-test-large-1-gpu (2)❌. - The PR under test is PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815 (merge commit
5252bd4, merging27142c2intocb8105f).
2. Failure Details
| Field | Value |
|---|---|
| Failed job | stage-b-test-large-1-gpu (2) |
| Status | completed |
| Conclusion | failure |
| Error message | ##[error]stage-b jobs failed: stage-b-test-large-1-gpu (2) |
| Specific test failures | ❌ Not available — must inspect the failed job's own logs |
| Stack traces | ❌ Not available — must inspect the failed job's own logs |
| Time to failure | ~14 minutes or less (detected at attempt 8, polling every 120s) |
Direct link to the failing job's logs:
👉 https://github.com/sgl-project/sglang/actions/runs/23285539283 — look for stage-b-test-large-1-gpu (2) in the job list.
3. Suggested Fixes
Since the actual failure details are not in this job's logs, the immediate actions are:
Immediate (Diagnostic)
- Inspect the actual failing job logs: Navigate to
stage-b-test-large-1-gpu (2)in this workflow run and examine:- The test execution step for pytest output / stack traces
- Server launch logs for OOM or GPU errors
- Any
SGLANG_CUDA_COREDUMPartifacts if a GPU crash occurred
- Check PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815 diff: Review what
27142c2changed — particularly any modifications to large model tests, serving configs, or GPU memory handling that would affect single-GPU large model tests. - Check if flaky: Search for prior failures of
stage-b-test-large-1-gpu (2)onmainto determine if this is a pre-existing flake or a regression introduced by PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815.
If the failure is a test regression in PR #20815
- Fix the code in the PR branch and re-push
If the failure is a flaky test
- Re-run the failed job
- If it recurs, open a separate issue to track the flaky test with
[Flaky]label
Housekeeping (Non-blocking)
- Node.js 20 deprecation: Update
actions/checkoutandactions/github-scriptto versions using Node.js 24 before June 2, 2026 (warning detected in cleanup step)
4. Priority
High — This blocks the entire stage-b gate (fail-fast stopped evaluation of 24 remaining jobs) and likely blocks PR #20815 from merging. However, priority should be reassessed after inspecting the actual failure: if it's a known flake, it drops to Medium.
5. Environment Context
| Component | Value |
|---|---|
| Runner OS | Ubuntu 24.04.3 LTS |
| Runner image | ubuntu-24.04 v20260309.50.1 |
| Git version | 2.53.0 |
| Region | northcentralus (Azure) |
| PR | #20815 (merge commit 5252bd4) |
| CI env vars | SGLANG_IS_IN_CI=true, SGLANG_CUDA_COREDUMP=1, SGLANG_JIT_DEEPGEMM_FAST_WARMUP=true |
| GPU context | Large-1-GPU test shard (shard index 2 of 14) — likely AMD GPU based on project scope |
| Stage-b total | 27 jobs expected (8+14+4+1) |
⚠️ Key action required: The root cause can only be determined by examining the logs ofstage-b-test-large-1-gpu (2). This analysis confirms the gating infrastructure worked correctly; the problem is in the test job itself.
Job: build-test (all)
- Link: build-test (all)
- Failed Steps: (unknown)
CI Failure Analysis: build-test (all) — CPU BMM FP8 Test Failure
1. Root Cause Analysis
The sgl_kernel CPU BMM kernel (torch.ops.sgl_kernel.bmm_cpu) does not implement FP8 weight support, but test_bmm.py unconditionally tests FP8 BMM paths on the CPU platform. This is a test–kernel feature mismatch: the test was written (or modified in PR #20815, branch fix_amd_ci_multimodaltest) without gating the FP8 BMM subtests behind a CPU capability check.
This is not an environment, dependency, or infrastructure issue. The kernel explicitly raises:
RuntimeError: bmm: do not support fp8 weight for now.
This is a deliberate guard in the C++/kernel code — the feature simply hasn't been implemented for the CPU backend yet.
2. Failure Details
| Field | Value |
|---|---|
| Failed test file | test/srt/cpu/test_bmm.py |
| Failed test method | TestBmm.test_bmm (parameterized) |
| Number of errors | 96 errors across all parameter combinations |
| Exit code | 1 (test file), 255 (suite runner) |
| Suite behavior | continue_on_error=False → aborted after first file failure; 18 of 21 tests never ran |
Error Message & Stack Trace
RuntimeError: bmm: do not support fp8 weight for now.
Call chain:
test_bmm.py:91 → _fp8_bmm():67 → torch.ops.sgl_kernel.bmm_cpu(mat3, mat1, mat2_q_t, False, mat2_s)
Parameter Space (all 96 combinations failed)
B∈ {1, 16, 17}M∈ {1, 2, 11, 111}N∈ {160, 512}K∈ {160, 544}chunk∈ {True, False}
Tests Blocked (never executed)
test_causal_conv1d.py, test_cpu_graph.py, test_decode.py, test_extend.py, test_flash_attn.py, test_gemm.py, test_intel_amx_attention_backend_{a,b,c}.py, test_mamba.py, test_mla.py, test_moe.py, test_norm.py, test_qkv_proj_with_rope.py, test_qwen3.py, test_rope.py, test_shared_expert.py, test_topk.py
3. Suggested Fixes
Option A: Skip FP8 subtests on CPU (Recommended — quickest fix)
In test/srt/cpu/test_bmm.py, gate FP8 test paths:
import unittest
# In the test method or at parameterization level:
@unittest.skipIf(
not torch.cuda.is_available(), # or check SGLANG_USE_CPU_ENGINE
"FP8 BMM not supported on CPU kernel"
)
def test_bmm_fp8(self, ...):
...Or more precisely, inside _fp8_bmm() or at test dispatch:
import os
if os.environ.get("SGLANG_USE_CPU_ENGINE") == "1":
self.skipTest("bmm_cpu does not support fp8 weights")Option B: Implement FP8 support in the CPU BMM kernel
In sgl-kernel/src/sgl-kernel/cpu/bmm.cpp (or equivalent), add an FP8 code path. This is a larger effort and likely not the intent of PR #20815.
Option C: Separate test parameterization for CPU vs GPU
Refactor test_bmm.py to have distinct parameter sets:
- GPU: includes FP8 weight tests
- CPU: excludes FP8 weight tests (or tests only BF16/FP32)
CPU_DTYPES = ["bf16", "fp32"]
GPU_DTYPES = ["bf16", "fp32", "fp8"]
dtypes = CPU_DTYPES if os.environ.get("SGLANG_USE_CPU_ENGINE") == "1" else GPU_DTYPESAdditional: Enable continue_on_error or fix test ordering
Consider setting continue_on_error=True in the suite runner so that one early failure doesn't block 18 other tests from providing signal. This won't fix the root cause but improves CI observability.
4. Priority
High
- The failure blocks 18 out of 21 CPU tests from running, providing zero signal on the rest of the CPU test suite.
- The fix is straightforward (skip/gate FP8 tests on CPU).
- PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815 (
fix_amd_ci_multimodaltest) cannot merge with this failure.
5. Environment Context
| Component | Value |
|---|---|
| Platform | Intel Xeon (CPU-only, SGLANG_USE_CPU_ENGINE=1) |
| CPU Feature | Intel AMX confirmed available |
| Docker image | sglang_xeon from docker/xeon.Dockerfile (Ubuntu 24.04) |
| Python | 3.12.13 (via uv venv) |
| PyTorch | 2.9.0+cpu |
| sglang | 0.5.6.post3.dev3008+g27142c2ee (branch fix_amd_ci_multimodaltest) |
| sglang-kernel-cpu | 0.4.0 (built from source) |
| Triton | 3.5.0 (not functional on CPU — expected) |
| Runner | gnr88001599-1 / sdp |
| PR | #20815 |
Cross-Job Analysis
Unified CI Failure Analysis — PR #20815 (fix_amd_ci_multimodaltest)
1. Common Patterns
| Pattern | Jobs Affected | Shared? |
|---|---|---|
| Port conflict / stale processes on persistent runners | stage-b-test-large-1-gpu (0) |
No — infra-specific |
MiniCPM model loading crash (likely transformers 5.3.0 or HF cache) |
stage-b-test-large-1-gpu (2) |
No — model-specific |
| FP8 BMM not implemented on CPU | build-test (all) |
No — test-code bug |
| Gating job propagating upstream failure | wait-for-stage-b |
Derivative only |
| Single failure blocking remaining tests in suite | All 3 real failures | ✅ Yes — continue_on_error=False amplifies every failure |
These are three independent root causes. There is no single broken dependency, shared infra issue, or common code change that unifies them. The only shared pattern is that each failure is amplified by the test runner's continue_on_error=False behavior, turning one broken test file into a full suite wipeout.
2. Cross-Job Dependencies
stage-b-test-large-1-gpu (0) ──[independent]── stage-b-test-large-1-gpu (2)
│
▼
wait-for-stage-b (derivative — gating job)
│
▼
All remaining stage-b jobs killed (24 jobs lost signal)
build-test (all) ──[independent, separate pipeline]
wait-for-stage-bis purely derivative — it failed because it correctly detectedstage-b-test-large-1-gpu (2)failing. Fixing job (2) fixes the gate.stage-b-test-large-1-gpu (0)and (2) run on different runners with different models. No causal link.build-test (all)is a CPU-only build pipeline on an Intel Xeon runner — completely isolated from the H100 GPU jobs.
3. Unified Root Cause Assessment
There is no single unified root cause. The three real failures are:
| # | Category | Root Cause | Introduced By |
|---|---|---|---|
| 1 | Infra | Zombie process holding port 11000 on persistent Docker runner | Pre-existing runner state (not PR code) |
| 2 | Compatibility | MiniCPM models fail to load — likely transformers 5.3.0 regression or incomplete HF cache for trust_remote_code models |
Environment / dependency drift |
| 3 | Test bug | test_bmm.py calls FP8 BMM on CPU, which the kernel explicitly doesn't support |
PR #20815 code or pre-existing test gap |
The PR branch name (fix_amd_ci_multimodaltest) suggests the author was fixing AMD CI for multimodal tests — the CPU BMM FP8 failure (#3) may be a pre-existing issue newly exposed by this PR's test scope changes, or a test file added/modified without CPU gating.
4. Priority Ranking
| Priority | Job | Rationale | Fix Effort |
|---|---|---|---|
| 🔴 P1 | build-test (all) — FP8 BMM |
Blocks PR merge. Directly fixable in PR code. 96 errors, 18 tests blocked. Simple skip/gate fix. | 🟢 ~10 min |
| 🟠 P2 | stage-b-test-large-1-gpu (2) — MiniCPM |
Blocks stage-b gate (kills 24 jobs via wait-for-stage-b). Needs investigation — either fix HF cache, pin transformers, or skip MiniCPM tests. |
🟡 ~1–2 hrs |
| 🟡 P3 | stage-b-test-large-1-gpu (0) — Port conflict |
Flaky infra issue, not caused by PR code. Add fuser -k cleanup step. Won't recur on a different runner. |
🟢 ~15 min (workflow change) |
| ⚪ P4 | wait-for-stage-b |
No action needed — automatically resolves when P2 is fixed. | N/A |
5. Overall Recommendation
PR #20815 has three independent failures that should be addressed in order: (1) add a CPU/FP8 skip guard in test_bmm.py — this is a ~10-minute code fix that unblocks the entire CPU test suite and is almost certainly within the PR author's scope; (2) investigate the MiniCPM server crash on stage-b-test-large-1-gpu (2) by examining its server stderr logs, then either re-download the HF cache with trust_remote_code files, pin transformers<5.0, or temporarily skip MiniCPM tests — this is the highest-impact fix since the fail-fast gate killed 24 downstream jobs; (3) add a fuser -k <port>/tcp pre-cleanup step to the CI workflow to prevent port conflicts on persistent runners, which is a one-line workflow fix that prevents future flakes. None of these failures share a common root cause, so they must be addressed independently — but the good news is that each has a straightforward, well-scoped fix.
Automated CI analysis by amd-bot — progressive step analysis
Claude Code Review
Code Review: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)1. SummaryThis PR makes three changes to fix AMD CI stability:
2. Code QualityBug / Inconsistency
Timeout Adjustment
Test Suite Moves
3. Performance
4. SecurityNo security concerns. 5. Testing
6. Suggestions
7. Overall AssessmentApprove (with minor nit on stale comment) The changes are straightforward CI fixes: splitting tests into more partitions to reduce per-job resource pressure, adjusting a timeout, and correctly categorizing flaky vs. stable tests. The linked CI run validates the changes. The only actionable issue is the stale comment in Automated review by amd-bot using Claude. This is an AI-generated review — please use your judgment. |
Claude Code Review
Code Review: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)1. SummaryThis PR makes three targeted changes to fix AMD CI stability:
2. Code QualityMinor Issue: Stale CommentFile: part: [0, 1, 2, 3] # 2 partitions: 11 tests ÷ 2 = ~5-6 tests eachThe comment still says "2 partitions: 11 tests ÷ 2 = ~5-6 tests each" but now there are 4 partitions. This should be updated to something like ConsistencyThe ROCm 7.20 workflow ( Logic
3. PerformanceNo impact on serving/inference performance. This is purely a CI configuration change. The 4. SecurityNo security concerns. CI workflow changes don't affect the runtime codebase. 5. Testing
6. Suggestions
7. Overall AssessmentApprove — with a minor nit on the stale comment. This is a straightforward, well-scoped CI fix. The changes are logical: splitting partitions to reduce per-run load, increasing timeout for the larger workload, and correctly categorizing flaky vs. stable tests. The linked CI run demonstrates the fix works. The only actionable item is fixing the outdated comment in Automated review by amd-bot using Claude. This is an AI-generated review — please use your judgment. |
CI Status for PR #20815Total: 81 checks
Failed Checks:
Detailed AnalysisMethod: Progressive step-by-step analysis (all steps examined per job) Job:
|
| Attempt | Server Start | Bind Error | Watchdog Timeout |
|---|---|---|---|
| 1 | 08:41:03 | 08:41:33 (~30s later) | 08:46:04 (300s soft) |
| 2 | 08:51:03 | 08:51:25 (~22s later) | 08:56:05 (300s soft) |
| 3 | 09:01:04 | 09:01:29 (~25s later) | 09:06:05 (300s soft) |
Cascading Impact
The 1800s timeout on the first test (3 × 300s watchdog + startup overhead) prevented the remaining 2 tests from executing:
registered/lora/test_lora_overlap_loading.py— SKIPPED (never ran)registered/model_loading/test_utils_update_weights.py— SKIPPED (never ran)
Secondary Error
py-spy: Permission denied (os error 13)
py-spy could not attach to the stuck process to gather diagnostics, likely due to ptrace restrictions in the container (SYS_PTRACE capability not granted or kernel.yama.ptrace_scope > 0).
3. Suggested Fixes
Immediate / Short-term
-
Kill stale processes before test execution — Add a pre-test cleanup step in
ci_install_dependency.shor at the start ofrun_suite.py:# Kill any process listening on common test ports for port in 11000 11001 11002 11003 30000 30001; do fuser -k ${port}/tcp 2>/dev/null || true done
-
Use dynamic/random ports in tests — Modify
test_eagle_infer_b.py(and similar tests) to useport=0or a randomly selected free port instead of hardcoded11000:import socket def get_free_port(): with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.bind(('', 0)) return s.getsockname()[1]
-
Add port cleanup between retry attempts — In the test's retry logic, explicitly kill any process on the target port before retrying:
import subprocess subprocess.run(["fuser", "-k", f"{port}/tcp"], capture_output=True) time.sleep(2) # Allow socket TIME_WAIT to clear
-
Enable
SO_REUSEADDR/SO_REUSEPORT— If using uvicorn/uvloop, ensure the server socket setsSO_REUSEADDR:# In server launch config uvicorn.run(..., host="127.0.0.1", port=port) # uvicorn sets SO_REUSEADDR by default, but verify it's not overridden
Medium-term
-
Grant
SYS_PTRACEcapability to the CI container sopy-spycan attach and provide useful diagnostics:# In the runner's Docker configuration docker run --cap-add SYS_PTRACE ...
Or set the sysctl:
echo 0 > /proc/sys/kernel/yama/ptrace_scope
-
Add a pre-test port availability check in the test harness (
run_suite.py):def assert_port_free(port, host="127.0.0.1"): with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: result = s.connect_ex((host, port)) if result == 0: # Port is in use — find and log the offending process subprocess.run(["lsof", "-i", f":{port}"]) raise RuntimeError(f"Port {port} is already in use")
-
Restart the Docker container between CI jobs on
h100-radixark-host1-gpu-3to ensure a clean process namespace, or use--rmflag so containers are ephemeral.
Long-term
- Refactor test infrastructure to use ephemeral port allocation project-wide, eliminating hardcoded ports as a class of CI failures.
4. Priority
High 🔴
- 3 out of 3 tests failed in this partition (1 timeout + 2 never ran)
- This is an infrastructure/environment issue, not a code bug — it will likely recur on the same runner
- The port conflict pattern is deterministic and will block all EAGLE speculative decoding tests until resolved
- No useful diagnostics can be collected due to
py-spypermission issue
5. Environment Context
| Detail | Value |
|---|---|
| Runner | h100-radixark-host1-gpu-3 |
| Container | 8e1df5b34166 (persistent, not ephemeral) |
| GPU | NVIDIA H100 80GB HBM3 (sm_90) |
| CUDA Driver | 580.126.09 |
| PyTorch | 2.9.1 (CUDA 12.8) |
| sglang | 0.0.0.dev1+g5252bd422 |
| FlashInfer | 0.6.6 |
| Python | 3.10.12 |
| OS | Ubuntu 22.04 |
| Commit | 5252bd4222d72a32e9c14e5f393c9ed0dac239fb |
| Conflicting Port | 127.0.0.1:11000 (EADDRINUSE across all 3 attempts) |
| Test Suite | stage-b-test-large-1-gpu, partition 0/14 |
| Timeout | 1800s per file, 300s watchdog per server launch attempt |
Job: stage-b-test-large-1-gpu (2)
- Link: stage-b-test-large-1-gpu (2)
- Failed Steps: (unknown)
CI Failure Analysis: stage-b-test-large-1-gpu (2)
1. Root Cause Analysis
Two VLM models — openbmb/MiniCPM-V-4 and openbmb/MiniCPM-o-2_6 — fail to start their inference servers (exit code 1) during test setup.
The most likely root cause is incompatibility between these MiniCPM models and transformers==5.3.0. Several clues support this:
- HuggingFace cache validation in Step 19 flagged 32 FAILs, including models missing
trust_remote_codefiles — MiniCPM models are known to rely heavily on custom modeling code viatrust_remote_code=True. - Transformers 5.3.0 is very new (bleeding-edge) and introduced RoPE compatibility warnings across all models tested. MiniCPM's custom code may reference internal transformers APIs that changed or were removed in v5.x.
- The
easydictmodule was reported missing (for DeepSeek-OCR), indicating the environment may be missing optional dependencies that custom model code requires. MiniCPM models similarly depend on custom tokenizer/processor code that may import packages not installed in this environment. - Both failures are Python-level (exit code 1, no CUDA coredumps generated), consistent with an import error or model initialization crash rather than a GPU issue.
- The CI retried with
HF_HUB_OFFLINE=0(re-downloading model files), which also failed — ruling out a simple cache corruption issue and pointing to a code-level incompatibility.
2. Failure Details
Failed Tests
| Test Class | Model | Error |
|---|---|---|
TestMiniCPMV4Server |
openbmb/MiniCPM-V-4 |
setUpClass — Server process exited with code 1 |
TestMiniCPMo26Server |
openbmb/MiniCPM-o-2_6 |
setUpClass — Server process exited with code 1 |
Error Messages
ERROR: setUpClass (__main__.TestMiniCPMV4Server)
Exception: Server process exited with code 1. Check server logs for errors.
ERROR: setUpClass (__main__.TestMiniCPMo26Server)
Exception: Server process exited with code 1. Check server logs for errors.
Test File Results
- File:
test/srt/models/test_vision_openai_server_a.py - 51 tests ran in 535.2s: 43 passed, 2 errors, 6 skipped
- Exit code: 1 (file-level failure)
- Suite exit code: 255 (0/4 test files passed — remaining 3 files likely didn't execute due to early abort or are not shown)
Key Warnings
- Missing MoE kernel configs for E=64,N=896 and E=128,N=768 on H100 (triton 3.5.1)
easydictmodule not found (DeepSeek-OCR)- Transformers 5.3.0 RoPE compatibility warnings across all models
- Multimodal embedding cache full warnings for video inputs
3. Suggested Fixes
Immediate (unblock CI)
-
Retrieve server logs for MiniCPM models — The current error message says "Check server logs for errors" but the actual server stderr/stdout is not captured in CI output. Modify the test harness or CI script to dump server logs on failure:
# In test setup, capture and print server stderr on crash if process.returncode != 0: print(f"SERVER STDERR:\n{process.stderr.read()}")
-
Pin
transformersto a known-good version until MiniCPM compatibility is confirmed:pip install transformers==4.51.0
-
Install missing optional dependencies that MiniCPM custom code may require:
pip install easydict timm
-
Skip MiniCPM tests temporarily if blocking other CI work (add to skip list in
run_suite.pyor the test file):@unittest.skip("MiniCPM-V-4 incompatible with transformers 5.3.0 — see #XXXX") class TestMiniCPMV4Server(...):
Medium-term
-
Add transformers version compatibility matrix — Test MiniCPM models against transformers 4.x vs 5.x to identify the breaking version boundary.
-
Improve server crash diagnostics — The test framework should always capture and surface server process logs when
exit code != 0, not just say "check server logs." -
Fix the suite runner — The suite reports 0/4 files passed, suggesting the 3 other test files (
test_anthropic_tool_use.py,test_session_latency.py,test_mrope.py) may not have executed at all after the first file failed. Verify the runner continues to subsequent files on failure.
4. Priority
High
- The failures are reproducible (both offline and online retries failed)
- They block the entire partition (exit code 255, 0/4 files passed)
- MiniCPM is a supported model family in sglang
- However, this is not a regression in sglang core — it's likely a dependency version issue, so not Critical
5. Environment Context
| Component | Value |
|---|---|
| GPU | NVIDIA H100 80GB HBM3 (sm_90) |
| Driver | 580.126.09 |
| CUDA | 13.0 (torch built for 12.8) |
| Python | 3.10.12 |
| PyTorch | 2.9.1+cu128 |
| transformers | 5.3.0 |
| sglang | 0.0.0.dev1+g5252bd422 |
| sglang-kernel | 0.4.0 |
| FlashInfer | 0.6.6 |
| flash-attn-4 | 4.0.0b4 |
| Triton | 3.5.1 |
| torchao | 0.9.0 |
| Runner | h100-radixark-host1-gpu-6 |
| Commit | 5252bd4222d72a32e9c14e5f393c9ed0dac239fb |
| Date | 2026-03-19 |
Key observation: transformers==5.3.0 is an unusually new version. The RoPE warnings across all models and the MiniCPM crashes strongly suggest this is the primary environmental factor causing the failures.
Job: wait-for-stage-b
- Link: wait-for-stage-b
- Failed Steps: (unknown)
CI Failure Analysis: wait-for-stage-b — Job stage-b-test-large-1-gpu (2) Failed
1. Root Cause Analysis
The wait-for-stage-b job is a gate/polling job — it did not fail itself. It correctly detected that a downstream job, stage-b-test-large-1-gpu (2), completed with conclusion=failure and triggered its fail-fast mechanism.
The root cause is NOT in this job's logs. This job only monitors other jobs via the GitHub API. The actual failure occurred in:
stage-b-test-large-1-gpu (2)— matrix index 2 of thestage-b-test-large-1-gpujob group (14 total jobs)
The logs from this polling job provide no information about what failed inside stage-b-test-large-1-gpu (2) — no error messages, stack traces, or test output are surfaced here. The failure could be:
- A test failure in one of the large-1-GPU test suites
- A model loading / OOM issue on the GPU runner
- A server startup timeout
- An infrastructure issue (container crash, GPU fault, network timeout)
- A flaky test
Without the logs from stage-b-test-large-1-gpu (2), the specific root cause cannot be determined from this job alone.
2. Failure Details
| Field | Value |
|---|---|
| Failed job | stage-b-test-large-1-gpu (2) |
| Detection method | Fail-fast in wait-for-stage-b polling script |
| Error message | ##[error]stage-b jobs failed: stage-b-test-large-1-gpu (2) |
| Polling attempt | Attempt 8 (~14 min into monitoring) |
| Jobs completed at failure | 3 / 27 |
| Other completed jobs | stage-b-test-small-1-gpu (0) ✅, stage-b-test-4-gpu-b200 ✅ |
| Remaining jobs | 24 jobs not yet completed (skipped due to fail-fast) |
| Specific error/stack trace | ❌ Not available in this job's logs |
3. Suggested Fixes
Immediate Actions
-
Inspect the actual failed job's logs:
- Navigate to: Workflow Run #23285539283
- Find and expand
stage-b-test-large-1-gpu (2) - Look at the test execution step for error messages, stack traces, and which specific test(s) failed
-
Check if this is a flaky failure:
- Search recent CI runs for other failures in
stage-b-test-large-1-gpu (2)specifically - If the same matrix index fails intermittently, identify the specific test file assigned to index
(2)in the matrix configuration and check for known flaky tests
- Search recent CI runs for other failures in
-
Re-run the failed job:
- If the failure appears transient (GPU fault, timeout, network), re-run only the failed job from the workflow run page
- If it fails again, it's likely a real regression introduced by PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815
Investigation Path (in the failed job's logs)
Look for:
- pytest output with FAILED/ERROR markers
- CUDA OOM errors ("OutOfMemoryError")
- Server startup timeouts ("Timeout waiting for server")
- CUDA coredumps (SGLANG_CUDA_COREDUMP=1 is enabled)
- Container/process crashes (exit code != 0)
If This Is a PR Regression (PR #20815)
- Compare the test results against the
mainbranch (cb8105f) - Identify what changed in commit
27142c2that could affect large-1-GPU test scenarios - The merge commit is
5252bd4
4. Priority
High — This blocks the entire stage-b gate for PR #20815. The fail-fast behavior means 24 of 27 jobs were effectively abandoned. However, until the actual failure in stage-b-test-large-1-gpu (2) is examined, it's unclear whether this is a real regression or a transient/infrastructure issue.
5. Environment Context
| Variable | Value |
|---|---|
| Runner OS | Ubuntu 24.04.3 LTS |
| Runner Image | ubuntu-24.04 / 20260309.50.1 |
| Git | 2.53.0 |
| PR | #20815 (merge commit 5252bd4) |
| PR head | 27142c2eec85718a0927303c4ee9eb86382d2b7e |
| Base | cb8105fe282fc373b5baed63d5df38682418a373 |
SGLANG_IS_IN_CI |
true |
SGLANG_CUDA_COREDUMP |
1 (coredumps enabled — check for dumps in failed job) |
SGLANG_JIT_DEEPGEMM_FAST_WARMUP |
true |
| Node.js deprecation | actions/checkout@v4 and actions/github-script@v7 using Node.js 20 (EOL June 2, 2026) — not related to failure |
⚠️ Action Required: The actual root cause can only be determined by examining the logs ofstage-b-test-large-1-gpu (2). This analysis confirms thewait-for-stage-bgate job operated correctly — it is purely a messenger of the downstream failure.
Job: build-test (all)
- Link: build-test (all)
- Failed Steps: (unknown)
CI Failure Analysis: build-test (all) — CPU FP8 BMM Kernel Not Implemented
1. Root Cause Analysis
The test suite per-commit-cpu fails because test_bmm.py exercises FP8 (8-bit floating point) batch matrix multiplication on the CPU kernel, which explicitly does not support FP8 weights. The CPU implementation of sgl_kernel.bmm_cpu raises a hard error when called with FP8-quantized inputs. This is a known, intentional limitation in the CPU kernel ("bmm: do not support fp8 weight for now."), but the test suite does not skip or guard against this unsupported code path on CPU.
Because the suite runs with continue_on_error=False and enable_retry=False, the first test file failure (test_bmm.py) causes an immediate abort — 19 of 21 test files were never executed, making the failure impact appear worse than the underlying issue.
This is not a regression from PR #20815 (fix_amd_ci_multimodaltest). It is a pre-existing gap between test coverage expectations and CPU kernel capabilities.
2. Failure Details
Failed Test
| File | Tests | Errors | Exit Code |
|---|---|---|---|
cpu/test_bmm.py |
1 test (96 parameterized subtests) | 96 | 1 |
Error Message
RuntimeError: bmm: do not support fp8 weight for now.
Stack Trace
File "test_bmm.py", line 67, in _fp8_bmm
torch.ops.sgl_kernel.bmm_cpu(mat3, mat1, mat2_q_t, False, mat2_s)
RuntimeError: bmm: do not support fp8 weight for now.
Parameter Space (all 96 combinations failed identically)
- B (batch):
{1, 16, 17} - M:
{1, 2, 11, 111} - N:
{160, 512} - K:
{160, 544} - chunk:
{True, False}
Suite Execution Summary
Passed: 2/21 (cpu/test_activation.py, cpu/test_binding.py)
Failed: 1/21 (cpu/test_bmm.py)
Skipped: 19/21 (never reached — early abort)
Exit code: 255
Non-Fatal Warnings
Triton is not supported on current platform, roll back to CPUOnly CUDA, HIP and XPU support AWQ currentlyOnly CUDA and MUSA support GGUF quantization currentlynuma_migrate_pages failed/get_mempolicy: Operation not permitted(Docker NUMA policy restriction)
3. Suggested Fixes
Option A: Skip FP8 tests on CPU (quick fix, recommended)
In test/srt/cpu/test_bmm.py, guard the FP8 test path:
import unittest
import torch
# At the test method or class level:
@unittest.skipUnless(
torch.cuda.is_available(),
"FP8 BMM is not supported on CPU kernels"
)
def test_fp8_bmm(self):
...Or, if the test is parameterized and mixes FP8/non-FP8 paths, add an inline skip:
def _fp8_bmm(self, B, M, N, K, chunk):
if not torch.cuda.is_available():
self.skipTest("bmm: FP8 weight not supported on CPU")
...Option B: Enable continue_on_error for the CPU suite
In run_suite.py or the suite configuration for per-commit-cpu, set continue_on_error=True so that subsequent test files still execute even when one file fails. This doesn't fix the root issue but prevents 19 tests from being silently skipped.
Option C: Implement FP8 BMM support in the CPU kernel (longer-term)
The error string "do not support fp8 weight for now" suggests this is a planned feature. Track implementation in a separate issue for the sgl_kernel_cpu package.
Option D: Remove test_bmm.py from the per-commit-cpu suite
If FP8 BMM on CPU is not planned near-term, exclude the file from the suite definition until the kernel supports it.
4. Priority
Medium
- The failure is deterministic and blocks the entire CPU test suite (19/21 tests never run), masking potential real regressions.
- However, it is not a regression from this PR — it's a pre-existing test/kernel mismatch.
- The fix is trivial (Option A: add a
skipTestguard).
5. Environment Context
| Component | Value |
|---|---|
| Platform | CPU (Xeon), no GPU |
| CPU Feature | AMX tile instructions confirmed available |
| Docker Image | sglang_xeon (Ubuntu 24.04) |
| Python | 3.12.13 (via uv 0.10.11) |
| PyTorch | 2.9.0+cpu |
| sglang | 0.5.6.post3.dev3008+g27142c2ee |
| sglang-kernel-cpu | 0.4.0 (built from source) |
| Triton | 3.5.0 (not functional on CPU — falls back) |
| PR Branch | fix_amd_ci_multimodaltest (#20815) |
| Runner | gnr88001599-1 / sdp |
| Suite Config | per-commit-cpu, continue_on_error=False, enable_retry=False, timeout 1500s/file |
Cross-Job Analysis
Unified CI Failure Analysis — PR #20815 (fix_amd_ci_multimodaltest)
1. Common Patterns
| Pattern | Jobs Affected | Nature |
|---|---|---|
| Environment/infra issue, not code regression | All 4 | None of the failures trace back to code changes in PR #20815 |
| Pre-existing issues exposed by CI environment | Jobs 0, 2, build-test |
Stale processes, dependency skew, missing test guards |
| Cascading abort hiding true test coverage | Jobs 0, build-test |
Early failures prevent remaining tests from running (19/21 skipped in CPU; 2/3 skipped in Job 0) |
wait-for-stage-b is purely derivative |
wait-for-stage-b |
Gate job correctly propagated Job 2's failure — not an independent issue |
The failures are NOT related to each other. They share no common root cause. Each is an independent issue:
| Job | Root Cause Category |
|---|---|
| Job 0 | Runner infra — stale process holding port 11000 |
| Job 2 | Dependency skew — transformers==5.3.0 breaks MiniCPM custom model code |
wait-for-stage-b |
Derivative — mirrors Job 2's failure |
build-test (all) |
Test gap — FP8 BMM test not guarded for CPU-only execution |
2. Cross-Job Dependencies
stage-b-test-large-1-gpu (2) ──FAILED──► wait-for-stage-b ──FAILED──► 24 jobs ABANDONED
wait-for-stage-bis a direct consequence of Job 2's failure. Fixing Job 2 eliminates this failure entirely.- Job 0 and Job 2 are independent matrix partitions on different runners (
gpu-3vsgpu-6) — no causal relationship. build-test (all)runs on a CPU-only runner (gnr88001599-1) and is completely independent of the GPU jobs.- The fail-fast in
wait-for-stage-bcaused 24 of 27 stage-b jobs to be abandoned, massively amplifying the blast radius of Job 2's failure.
3. Unified Root Cause
There is no single unified root cause. These are three independent failures:
| # | Cause | Scope |
|---|---|---|
| 1 | Port 11000 occupied by orphan process on h100-radixark-host1-gpu-3 |
Runner-specific infra |
| 2 | transformers==5.3.0 incompatible with MiniCPM custom modeling code |
Environment-wide dependency |
| 3 | test_bmm.py calls FP8 kernel on CPU without skip guard |
Test code gap |
If forced to identify a theme: the CI environment is fragile — hardcoded ports, bleeding-edge unpinned dependencies, and missing test guards all independently cause failures that are not caught by the test framework before cascading into suite-level aborts.
4. Priority Ranking
| Priority | Job | Fix Effort | Blast Radius | Action |
|---|---|---|---|---|
| 🔴 P0 | Job 2 (MiniCPM / transformers 5.3.0) | Medium | Critical — kills wait-for-stage-b → abandons 24 jobs |
Pin transformers<5.0 or patch MiniCPM model code for v5 compatibility |
| 🟠 P1 | Job 0 (port 11000 conflict) | Low | High — 3 tests blocked, will recur on same runner | Add fuser -k 11000/tcp pre-test cleanup; migrate to dynamic ports |
| 🟡 P2 | build-test (FP8 BMM on CPU) |
Trivial | Medium — 19/21 CPU tests never run | Add skipUnless(torch.cuda.is_available()) to FP8 test path |
| ⚪ P3 | wait-for-stage-b |
None | N/A — derivative | Automatically resolves when Job 2 is fixed |
5. Overall Recommendation
None of these failures are caused by PR #20815's code changes (fix_amd_ci_multimodaltest) — they are three independent pre-existing CI environment issues. The highest-impact fix is pinning transformers to a stable release (e.g., transformers>=4.45,<5.0) since the MiniCPM crash in Job 2 triggers the fail-fast gate and abandons 24 downstream jobs, making it the single largest contributor to CI instability. In parallel, add a fuser -k port cleanup step before EAGLE tests to resolve the persistent port conflict on Job 0, and add a one-line skipTest guard in test_bmm.py to unblock the CPU suite. Once these three targeted fixes are applied, the PR should be re-run and is expected to pass cleanly. No changes to the PR's own code are needed.
Automated CI analysis by amd-bot — progressive step analysis
Motivation
Modifications
Accuracy Tests
[PASSED] https://github.com/sgl-project/sglang/actions/runs/23234956586
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci