Skip to content

[AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)#20815

Merged
HaiShaw merged 4 commits intomainfrom
fix_amd_ci_multimodaltest
Mar 19, 2026
Merged

[AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)#20815
HaiShaw merged 4 commits intomainfrom
fix_amd_ci_multimodaltest

Conversation

@yctseng0211
Copy link
Copy Markdown
Collaborator

@yctseng0211 yctseng0211 commented Mar 18, 2026

Motivation

Modifications

Accuracy Tests

[PASSED] https://github.com/sgl-project/sglang/actions/runs/23234956586

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the amd label Mar 18, 2026
@yctseng0211 yctseng0211 marked this pull request as ready for review March 19, 2026 00:09
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yctseng0211
Copy link
Copy Markdown
Collaborator Author

@amd-bot ci-status

@yctseng0211 yctseng0211 changed the title [AMD] CI - Fix multimodal test [AMD] CI - Fix AMD CI Mar 19, 2026
@yctseng0211 yctseng0211 changed the title [AMD] CI - Fix AMD CI [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) Mar 19, 2026
@github-actions github-actions bot added lora Multi-modal multi-modal language model labels Mar 19, 2026
@yctseng0211
Copy link
Copy Markdown
Collaborator Author

@amd-bot ci-status

@bingxche
Copy link
Copy Markdown
Collaborator

@amd-bot review

@bingxche
Copy link
Copy Markdown
Collaborator

@yctseng0211 requested CI status check

CI Status for PR #20815

Total: 81 checks

  • Passed: 29
  • Failed: 4
  • Pending: 18

Failed Checks:


Detailed Analysis

Method: Progressive step-by-step analysis (all steps examined per job)

Job: stage-b-test-large-1-gpu (0)

CI Failure Analysis: stage-b-test-large-1-gpu (0)

1. Root Cause Analysis

Port 11000 was already bound by a zombie/orphan process when test_eagle_infer_b.py attempted to start the sglang server. The test harness launched the server 3 times in succession; each time the model loaded successfully and CUDA graphs were captured, but the HTTP server could not bind to 127.0.0.1:11000. After 300 seconds the watchdog killed the server, the test retried, and ultimately the 1800-second per-file timeout was exhausted — preventing the remaining 2 test files from ever executing.

The most likely source of the stale process on port 11000 is insufficient cleanup from a prior CI run or a prior test file within the same runner session. The shared runner h100-radixark-host1-gpu-3 runs inside a persistent Docker container (8e1df5b34166), so processes from previous jobs can survive if they aren't explicitly killed. The git clean -ffdx in step 10 only cleans the filesystem — it does not kill orphan server processes.

2. Failure Details

Error Messages (repeated 3 times)

ERROR: [Errno 98] error while attempting to bind on address ('127.0.0.1', 11000): address already in use
TokenizerManager watchdog timeout (self.watchdog_timeout=300, self.soft=True)
py-spy: Permission denied (os error 13)

Test Results

Test File Result Duration
test_eagle_infer_b.py TIMEOUT 1800s (budget exhausted)
test_lora_overlap_loading.py NOT RUN
test_utils_update_weights.py NOT RUN
test_update_weights_from_disk.py SKIPPED — (see #14021)

Summary: 0/3 passed

Three Server Launch Attempts (all failed identically)

  1. 08:41max_running_requests=8, default KV cache (86756 tokens) → port 11000 in use → watchdog timeout 300s
  2. 08:51 — Same config (retry) → same failure
  3. 09:01max_running_requests=64, max_total_tokens=4500, chunked_prefill_size=128 → same failure

3. Suggested Fixes

Immediate / Short-term

  1. Kill stale processes before test execution — Add a pre-test cleanup step in the CI workflow or in run_suite.py:

    # Kill any process holding port 11000 (and other common sglang ports)
    for port in 11000 11001 11002 30000 30001; do
      fuser -k ${port}/tcp 2>/dev/null || true
    done
    # Also kill any orphan sglang/python server processes
    pkill -f "python.*launch_server" 2>/dev/null || true
    pkill -f "sglang.launch_server" 2>/dev/null || true
  2. Use dynamic/random ports in tests — Instead of hardcoding port 11000, let the OS assign a free port:

    # In test fixtures / server launch helpers
    import socket
    def get_free_port():
        with socket.socket() as s:
            s.bind(('', 0))
            return s.getsockname()[1]

    This eliminates port conflicts entirely between concurrent or sequential runs.

  3. Add port-availability check with fast-fail — In the server startup path (launch_server.py or the test harness), detect EADDRINUSE immediately and either pick another port or fail fast with a clear message instead of waiting 300s for the watchdog:

    # Before launching the server
    import socket
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        if s.connect_ex(('127.0.0.1', port)) == 0:
            raise RuntimeError(f"Port {port} already in use — aborting immediately")

Medium-term

  1. Fix py-spy permissions — The watchdog's diagnostic dump fails with Permission denied (os error 13). Either:

    • Run CI with SYS_PTRACE capability: docker run --cap-add SYS_PTRACE ...
    • Or set kernel.yama.ptrace_scope=0 in the runner container
    • This won't fix the port issue but will provide better diagnostics when watchdog timeouts occur.
  2. Isolate test runs with process namespaces — Run each test file inside its own PID namespace or network namespace to guarantee no port leakage:

    unshare --net --map-root-user -- python -m pytest test_eagle_infer_b.py
  3. Add post-test cleanup in the test harnessrun_suite.py should kill all child processes after each test file completes (or times out), not just at the end of the suite.

Workflow-level fix (.github/workflows/)

- name: Kill stale server processes
  run: |
    fuser -k 11000/tcp 2>/dev/null || true
    pkill -9 -f "sglang" 2>/dev/null || true
    sleep 2
  if: always()

Add this before the test run step and also as a post-test cleanup with if: always().

4. Priority

High — This is a flaky infrastructure failure that:

  • Caused 0/3 tests to pass and blocked 2 tests from running at all
  • Is non-deterministic (depends on runner state from prior jobs)
  • Will recur on the same runner unless addressed
  • Masks real test results for EAGLE speculative decoding and LoRA overlap loading

5. Environment Context

Component Value
Runner h100-radixark-host1-gpu-3 (persistent Docker container 8e1df5b34166)
GPU NVIDIA H100 80GB HBM3
Driver NVIDIA 580.126.09, CUDA 13.0
PyTorch 2.9.1 (CUDA 12.8)
sglang 0.0.0.dev1+g5252bd422 (commit 5252bd4)
sglang-kernel 0.4.0 (pre-built)
flashinfer 0.6.6
transformers 5.3.0
Python 3.10.12
Conflicting port 127.0.0.1:11000 (EADDRINUSE)
Model meta-llama/Llama-2-7b-chat-hf + lmsys/sglang-EAGLE-llama2-chat-7B
Test timeout 1800s per file, 300s watchdog

Job: stage-b-test-large-1-gpu (2)

CI Failure Analysis: stage-b-test-large-1-gpu (2)

1. Root Cause Analysis

Two MiniCPM vision-language models (MiniCPM-V-4 and MiniCPM-o-2_6) fail to launch their sglang server process, crashing with exit code 1 during setUpClass. The server fails first in offline mode (HF_HUB_OFFLINE=1), and then again on the online retry.

The most likely root cause is a model loading incompatibility caused by the combination of:

  • transformers 5.3.0 — a very recent/bleeding-edge version that may have changed model class registrations, config schemas, or trust_remote_code handling for MiniCPM architectures.
  • Missing trust_remote_code files in HF cache — the HF cache validation in Step 19 explicitly flagged failures for MiniCPM models (among others) with missing files, which are required for custom model architectures like MiniCPM-V-4 and MiniCPM-o-2_6.
  • No CUDA coredumps were generated, confirming this is an application-level/Python crash (model loading or config parsing), not a GPU driver issue.

The fact that other VLMs (DeepSeek-OCR, Gemma-3, Qwen2.5-VL, Qwen2-VL, Qwen2-Audio, Qwen3-Omni, Qwen3-VL) all serve and pass successfully strongly points to a MiniCPM-specific model loading regression.

2. Failure Details

Failed Tests:

Test Class Model Error
TestMiniCPMV4Server openbmb/MiniCPM-V-4 Exception: Server process exited with code 1 in setUpClass
TestMiniCPMo26Server openbmb/MiniCPM-o-2_6 Exception: Server process exited with code 1 in setUpClass

Test Summary: 51 tests ran in 535.2s — 2 errors, 6 skipped. Exit code 255.

Failure Pattern:

  1. Test harness attempts to launch sglang server with HF_HUB_OFFLINE=1 → server process exits with code 1
  2. Retries without offline mode → server process exits with code 1 again
  3. setUpClass raises exception, all tests in the class are marked as errors

HF Cache Validation Failures (from Step 19):
The cache integrity check showed 32 FAILs including MiniCPM models missing trust_remote_code files (custom Python modules needed to load these architectures). These models use custom architectures not natively in transformers and rely on downloaded Python files from the HuggingFace Hub.

3. Suggested Fixes

Immediate / Short-term

  1. Investigate MiniCPM server crash logs: The current test output only shows exit code 1. Add or surface the actual server stderr/stdout to diagnose the exact crash point:

    # In the test harness, capture and print server logs on failure
  2. Fix HF cache for MiniCPM models: Re-download the model repos with trust_remote_code files:

    huggingface-cli download openbmb/MiniCPM-V-4 --local-dir-use-symlinks False
    huggingface-cli download openbmb/MiniCPM-o-2_6 --local-dir-use-symlinks False

    Ensure the runner's HF cache includes the custom .py files (not just weights/config).

  3. Pin or test transformers version compatibility: transformers 5.3.0 is very new. Test whether downgrading resolves the issue:

    pip install transformers==4.51.0

    MiniCPM custom code may reference internal transformers APIs that changed in v5.x.

Medium-term

  1. Add per-model cache validation to CI: Before running tests, validate that each tested model has all required files (including *.py for trust_remote_code models):

    # Validate trust_remote_code models have their custom code files
    for model_id in ["openbmb/MiniCPM-V-4", "openbmb/MiniCPM-o-2_6"]:
        snapshot = huggingface_hub.snapshot_download(model_id, local_files_only=True)
        assert glob.glob(os.path.join(snapshot, "*.py")), f"Missing custom code for {model_id}"
  2. Consider skipping MiniCPM tests when trust_remote_code files are unavailable, with a clear skip message rather than a hard crash.

4. Priority

High

  • The failure blocks the entire test suite partition (0/4 files pass; remaining 3 test files never execute)
  • Two actively supported VLM models are broken in CI
  • The root cause (likely transformers 5.x incompatibility or incomplete HF cache) could affect other trust_remote_code models as well
  • Not Critical because: other model tests pass, and this is partition 2/14 — other partitions are likely unaffected

5. Environment Context

Component Value
GPU NVIDIA H100 80GB HBM3 (sm_90)
Driver 580.126.09
CUDA (PyTorch) 12.8
PyTorch 2.9.1
transformers 5.3.0 ⚠️
tokenizers 0.22.2
sglang 0.0.0.dev1+g5252bd422
flash-attn-4 4.0.0b4
flashinfer 0.6.6
Python 3.10.12
Commit 5252bd4222d72a32e9c14e5f393c9ed0dac239fb
Runner h100-radixark-host1-gpu-6
HF Cache 32 FAIL (incomplete models including MiniCPM)

Job: wait-for-stage-b

CI Failure Analysis: wait-for-stage-b — Job stage-b-test-large-1-gpu (2) Failed

1. Root Cause Analysis

The wait-for-stage-b gating job detected that stage-b-test-large-1-gpu (2) completed with conclusion=failure. This polling job itself functioned correctly — it did exactly what it was designed to do: detect the failure and fail-fast.

The root cause is not determinable from the logs of this job alone. The wait-for-stage-b job is a monitoring/gating job that only polls the GitHub Actions API for job statuses. It does not contain any test output, error messages, stack traces, or build logs from the failing job. The actual failure details (test errors, OOM, timeout, compilation failure, etc.) exist exclusively in the logs of stage-b-test-large-1-gpu (2) itself.

What we can infer:

  • The failure occurred earlystage-b-test-large-1-gpu (2) completed by polling attempt 8 (~14 minutes into the run), while 24/27 jobs were still queued/in-progress. This suggests the failure was not a timeout but rather a fast, hard failure (crash, assertion error, compilation error, or early test failure).
  • Only 3/27 jobs had completed at termination: stage-b-test-small-1-gpu (0) ✅, stage-b-test-4-gpu-b200 ✅, and stage-b-test-large-1-gpu (2) ❌.
  • The PR under test is PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815 (merge commit 5252bd4, merging 27142c2 into cb8105f).

2. Failure Details

Field Value
Failed job stage-b-test-large-1-gpu (2)
Status completed
Conclusion failure
Error message ##[error]stage-b jobs failed: stage-b-test-large-1-gpu (2)
Specific test failures Not available — must inspect the failed job's own logs
Stack traces Not available — must inspect the failed job's own logs
Time to failure ~14 minutes or less (detected at attempt 8, polling every 120s)

Direct link to the failing job's logs:
👉 https://github.com/sgl-project/sglang/actions/runs/23285539283 — look for stage-b-test-large-1-gpu (2) in the job list.

3. Suggested Fixes

Since the actual failure details are not in this job's logs, the immediate actions are:

Immediate (Diagnostic)

  1. Inspect the actual failing job logs: Navigate to stage-b-test-large-1-gpu (2) in this workflow run and examine:
    • The test execution step for pytest output / stack traces
    • Server launch logs for OOM or GPU errors
    • Any SGLANG_CUDA_COREDUMP artifacts if a GPU crash occurred
  2. Check PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815 diff: Review what 27142c2 changed — particularly any modifications to large model tests, serving configs, or GPU memory handling that would affect single-GPU large model tests.
  3. Check if flaky: Search for prior failures of stage-b-test-large-1-gpu (2) on main to determine if this is a pre-existing flake or a regression introduced by PR [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group) #20815.

If the failure is a test regression in PR #20815

  • Fix the code in the PR branch and re-push

If the failure is a flaky test

  • Re-run the failed job
  • If it recurs, open a separate issue to track the flaky test with [Flaky] label

Housekeeping (Non-blocking)

  • Node.js 20 deprecation: Update actions/checkout and actions/github-script to versions using Node.js 24 before June 2, 2026 (warning detected in cleanup step)

4. Priority

High — This blocks the entire stage-b gate (fail-fast stopped evaluation of 24 remaining jobs) and likely blocks PR #20815 from merging. However, priority should be reassessed after inspecting the actual failure: if it's a known flake, it drops to Medium.

5. Environment Context

Component Value
Runner OS Ubuntu 24.04.3 LTS
Runner image ubuntu-24.04 v20260309.50.1
Git version 2.53.0
Region northcentralus (Azure)
PR #20815 (merge commit 5252bd4)
CI env vars SGLANG_IS_IN_CI=true, SGLANG_CUDA_COREDUMP=1, SGLANG_JIT_DEEPGEMM_FAST_WARMUP=true
GPU context Large-1-GPU test shard (shard index 2 of 14) — likely AMD GPU based on project scope
Stage-b total 27 jobs expected (8+14+4+1)

⚠️ Key action required: The root cause can only be determined by examining the logs of stage-b-test-large-1-gpu (2). This analysis confirms the gating infrastructure worked correctly; the problem is in the test job itself.


Job: build-test (all)

CI Failure Analysis: build-test (all) — CPU BMM FP8 Test Failure

1. Root Cause Analysis

The sgl_kernel CPU BMM kernel (torch.ops.sgl_kernel.bmm_cpu) does not implement FP8 weight support, but test_bmm.py unconditionally tests FP8 BMM paths on the CPU platform. This is a test–kernel feature mismatch: the test was written (or modified in PR #20815, branch fix_amd_ci_multimodaltest) without gating the FP8 BMM subtests behind a CPU capability check.

This is not an environment, dependency, or infrastructure issue. The kernel explicitly raises:

RuntimeError: bmm: do not support fp8 weight for now.

This is a deliberate guard in the C++/kernel code — the feature simply hasn't been implemented for the CPU backend yet.

2. Failure Details

Field Value
Failed test file test/srt/cpu/test_bmm.py
Failed test method TestBmm.test_bmm (parameterized)
Number of errors 96 errors across all parameter combinations
Exit code 1 (test file), 255 (suite runner)
Suite behavior continue_on_error=False → aborted after first file failure; 18 of 21 tests never ran

Error Message & Stack Trace

RuntimeError: bmm: do not support fp8 weight for now.

Call chain:

test_bmm.py:91  →  _fp8_bmm():67  →  torch.ops.sgl_kernel.bmm_cpu(mat3, mat1, mat2_q_t, False, mat2_s)

Parameter Space (all 96 combinations failed)

  • B ∈ {1, 16, 17}
  • M ∈ {1, 2, 11, 111}
  • N ∈ {160, 512}
  • K ∈ {160, 544}
  • chunk ∈ {True, False}

Tests Blocked (never executed)

test_causal_conv1d.py, test_cpu_graph.py, test_decode.py, test_extend.py, test_flash_attn.py, test_gemm.py, test_intel_amx_attention_backend_{a,b,c}.py, test_mamba.py, test_mla.py, test_moe.py, test_norm.py, test_qkv_proj_with_rope.py, test_qwen3.py, test_rope.py, test_shared_expert.py, test_topk.py

3. Suggested Fixes

Option A: Skip FP8 subtests on CPU (Recommended — quickest fix)

In test/srt/cpu/test_bmm.py, gate FP8 test paths:

import unittest

# In the test method or at parameterization level:
@unittest.skipIf(
    not torch.cuda.is_available(),  # or check SGLANG_USE_CPU_ENGINE
    "FP8 BMM not supported on CPU kernel"
)
def test_bmm_fp8(self, ...):
    ...

Or more precisely, inside _fp8_bmm() or at test dispatch:

import os

if os.environ.get("SGLANG_USE_CPU_ENGINE") == "1":
    self.skipTest("bmm_cpu does not support fp8 weights")

Option B: Implement FP8 support in the CPU BMM kernel

In sgl-kernel/src/sgl-kernel/cpu/bmm.cpp (or equivalent), add an FP8 code path. This is a larger effort and likely not the intent of PR #20815.

Option C: Separate test parameterization for CPU vs GPU

Refactor test_bmm.py to have distinct parameter sets:

  • GPU: includes FP8 weight tests
  • CPU: excludes FP8 weight tests (or tests only BF16/FP32)
CPU_DTYPES = ["bf16", "fp32"]
GPU_DTYPES = ["bf16", "fp32", "fp8"]

dtypes = CPU_DTYPES if os.environ.get("SGLANG_USE_CPU_ENGINE") == "1" else GPU_DTYPES

Additional: Enable continue_on_error or fix test ordering

Consider setting continue_on_error=True in the suite runner so that one early failure doesn't block 18 other tests from providing signal. This won't fix the root cause but improves CI observability.

4. Priority

High

5. Environment Context

Component Value
Platform Intel Xeon (CPU-only, SGLANG_USE_CPU_ENGINE=1)
CPU Feature Intel AMX confirmed available
Docker image sglang_xeon from docker/xeon.Dockerfile (Ubuntu 24.04)
Python 3.12.13 (via uv venv)
PyTorch 2.9.0+cpu
sglang 0.5.6.post3.dev3008+g27142c2ee (branch fix_amd_ci_multimodaltest)
sglang-kernel-cpu 0.4.0 (built from source)
Triton 3.5.0 (not functional on CPU — expected)
Runner gnr88001599-1 / sdp
PR #20815

Cross-Job Analysis

Unified CI Failure Analysis — PR #20815 (fix_amd_ci_multimodaltest)

1. Common Patterns

Pattern Jobs Affected Shared?
Port conflict / stale processes on persistent runners stage-b-test-large-1-gpu (0) No — infra-specific
MiniCPM model loading crash (likely transformers 5.3.0 or HF cache) stage-b-test-large-1-gpu (2) No — model-specific
FP8 BMM not implemented on CPU build-test (all) No — test-code bug
Gating job propagating upstream failure wait-for-stage-b Derivative only
Single failure blocking remaining tests in suite All 3 real failures ✅ Yes — continue_on_error=False amplifies every failure

These are three independent root causes. There is no single broken dependency, shared infra issue, or common code change that unifies them. The only shared pattern is that each failure is amplified by the test runner's continue_on_error=False behavior, turning one broken test file into a full suite wipeout.

2. Cross-Job Dependencies

stage-b-test-large-1-gpu (0)  ──[independent]──  stage-b-test-large-1-gpu (2)
                                                           │
                                                           ▼
                                                   wait-for-stage-b  (derivative — gating job)
                                                           │
                                                           ▼
                                                   All remaining stage-b jobs killed (24 jobs lost signal)

build-test (all)  ──[independent, separate pipeline]
  • wait-for-stage-b is purely derivative — it failed because it correctly detected stage-b-test-large-1-gpu (2) failing. Fixing job (2) fixes the gate.
  • stage-b-test-large-1-gpu (0) and (2) run on different runners with different models. No causal link.
  • build-test (all) is a CPU-only build pipeline on an Intel Xeon runner — completely isolated from the H100 GPU jobs.

3. Unified Root Cause Assessment

There is no single unified root cause. The three real failures are:

# Category Root Cause Introduced By
1 Infra Zombie process holding port 11000 on persistent Docker runner Pre-existing runner state (not PR code)
2 Compatibility MiniCPM models fail to load — likely transformers 5.3.0 regression or incomplete HF cache for trust_remote_code models Environment / dependency drift
3 Test bug test_bmm.py calls FP8 BMM on CPU, which the kernel explicitly doesn't support PR #20815 code or pre-existing test gap

The PR branch name (fix_amd_ci_multimodaltest) suggests the author was fixing AMD CI for multimodal tests — the CPU BMM FP8 failure (#3) may be a pre-existing issue newly exposed by this PR's test scope changes, or a test file added/modified without CPU gating.

4. Priority Ranking

Priority Job Rationale Fix Effort
🔴 P1 build-test (all) — FP8 BMM Blocks PR merge. Directly fixable in PR code. 96 errors, 18 tests blocked. Simple skip/gate fix. 🟢 ~10 min
🟠 P2 stage-b-test-large-1-gpu (2) — MiniCPM Blocks stage-b gate (kills 24 jobs via wait-for-stage-b). Needs investigation — either fix HF cache, pin transformers, or skip MiniCPM tests. 🟡 ~1–2 hrs
🟡 P3 stage-b-test-large-1-gpu (0) — Port conflict Flaky infra issue, not caused by PR code. Add fuser -k cleanup step. Won't recur on a different runner. 🟢 ~15 min (workflow change)
⚪ P4 wait-for-stage-b No action needed — automatically resolves when P2 is fixed. N/A

5. Overall Recommendation

PR #20815 has three independent failures that should be addressed in order: (1) add a CPU/FP8 skip guard in test_bmm.py — this is a ~10-minute code fix that unblocks the entire CPU test suite and is almost certainly within the PR author's scope; (2) investigate the MiniCPM server crash on stage-b-test-large-1-gpu (2) by examining its server stderr logs, then either re-download the HF cache with trust_remote_code files, pin transformers<5.0, or temporarily skip MiniCPM tests — this is the highest-impact fix since the fail-fast gate killed 24 downstream jobs; (3) add a fuser -k <port>/tcp pre-cleanup step to the CI workflow to prevent port conflicts on persistent runners, which is a one-line workflow fix that prevents future flakes. None of these failures share a common root cause, so they must be addressed independently — but the good news is that each has a straightforward, well-scoped fix.


Automated CI analysis by amd-bot — progressive step analysis

@bingxche
Copy link
Copy Markdown
Collaborator

@bingxche requested a review

Claude Code Review

PR #20815: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)
Reviewed at 2026-03-19 09:22 UTC

Code Review: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)

1. Summary

This PR makes three changes to fix AMD CI stability:

  1. Increases the number of partitions for multimodal/diffusion server tests from 2 to 4 (reducing tests per partition to avoid timeouts or resource exhaustion).
  2. Moves test_vlm_models.py from the deterministic AMD CI suite to the nondeterministic suite (acknowledging flaky behavior).
  3. Moves test_multi_lora_backend.py from the nondeterministic AMD CI suite back to the deterministic suite.

2. Code Quality

Bug / Inconsistency

  • .github/workflows/pr-test-amd.yml, line 542: The comment was not updated after changing the partition count:
    part: [0, 1, 2, 3]  # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each
    This is misleading — it now has 4 partitions, not 2. The comment should read something like # 4 partitions: 11 tests ÷ 4 = ~2-3 tests each. The ROCm 7.2 workflow file correctly removed the stale comment.

Timeout Adjustment

  • .github/workflows/pr-test-amd.yml, line 624: The timeout was increased from 70 to 90 minutes. With 4 partitions (fewer tests per job) and max-parallel: 1, each partition should run fewer tests. A 90-minute timeout for ~2-3 tests seems generous but safe. Worth confirming this is intentional and not masking a deeper performance issue.

  • .github/workflows/pr-test-amd-rocm720.yml: The timeout was not similarly increased. This could be intentional (ROCm 7.2 may be faster) or an oversight. Worth confirming consistency.

Test Suite Moves

  • Moving test_vlm_models.py to nondeterministic and test_multi_lora_backend.py to deterministic appears reasonable given the linked CI results showing passing tests. However, there's no documentation or comment explaining why VLM tests are flaky on AMD — this context would help future maintainers.

3. Performance

  • Increasing partitions from 2 to 4 with max-parallel: 1 means the total wall-clock time for the diffusion test job will increase (4 sequential jobs instead of 2). This is a tradeoff of CI reliability vs. CI speed. Given the motivation (avoiding resource exhaustion and evictions), this seems acceptable.
  • No impact on serving/inference performance.

4. Security

No security concerns.

5. Testing

6. Suggestions

  1. Fix the stale comment in pr-test-amd.yml:

    # Before (incorrect):
    part: [0, 1, 2, 3]  # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each
    
    # After (correct):
    part: [0, 1, 2, 3]  # 4 partitions: 11 tests ÷ 4 = ~2-3 tests each
  2. Consider aligning timeout between the two workflow files. If pr-test-amd.yml needs 90 minutes, confirm pr-test-amd-rocm720.yml doesn't also need it (currently unchanged at its existing value — appears to be 70 minutes based on the diff context).

  3. Add a brief comment explaining VLM flakiness on AMD in test_vlm_models.py:

    # VLM tests are nondeterministic on AMD due to [brief reason, e.g., numerical precision differences]
    register_amd_ci(est_time=850, suite="stage-b-test-small-1-gpu-amd-nondeterministic")
  4. Minor: Consider whether max-parallel: 1 is still strictly necessary with 4 smaller partitions — if each partition uses fewer resources, it might be possible to run 2 in parallel, reducing overall CI time. (This is a follow-up consideration, not blocking.)

7. Overall Assessment

Approve (with minor nit on stale comment)

The changes are straightforward CI fixes: splitting tests into more partitions to reduce per-job resource pressure, adjusting a timeout, and correctly categorizing flaky vs. stable tests. The linked CI run validates the changes. The only actionable issue is the stale comment in pr-test-amd.yml which should be fixed before merge.


Automated review by amd-bot using Claude. This is an AI-generated review — please use your judgment.

@bingxche
Copy link
Copy Markdown
Collaborator

@bingxche requested a review

Claude Code Review

PR #20815: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)
Reviewed at 2026-03-19 09:23 UTC

Code Review: [AMD] CI - Fix AMD CI (multimodal test, move flaky test to non-deterministic group)

1. Summary

This PR makes three targeted changes to fix AMD CI stability:

  1. Increases the number of partitions for multimodal/diffusion server tests from 2 to 4 (reducing tests per partition), likely to avoid OOM or timeout issues on AMD MI325 GPUs.
  2. Moves test_vlm_models.py from the deterministic AMD CI suite to the non-deterministic suite (acknowledging flaky behavior).
  3. Moves test_multi_lora_backend.py from the non-deterministic suite back to the deterministic suite (indicating it's now stable).

2. Code Quality

Minor Issue: Stale Comment

File: .github/workflows/pr-test-amd.yml, line 542

part: [0, 1, 2, 3]  # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each

The comment still says "2 partitions: 11 tests ÷ 2 = ~5-6 tests each" but now there are 4 partitions. This should be updated to something like # 4 partitions: 11 tests ÷ 4 = ~2-3 tests each. The ROCm 7.20 workflow correctly removed the old comment.

Consistency

The ROCm 7.20 workflow (pr-test-amd-rocm720.yml) does not have a timeout increase, while pr-test-amd.yml increases the timeout from 70 to 90 minutes. This seems intentional (different hardware/software performance profiles), but worth confirming.

Logic

  • Increasing partitions from 2→4 while keeping max-parallel: 1 means the total wall-clock time for the diffusion test job increases (4 sequential runs instead of 2), but each individual run processes fewer tests, reducing memory pressure and per-partition timeout risk. This is a reasonable trade-off for CI stability.

3. Performance

No impact on serving/inference performance. This is purely a CI configuration change. The max-parallel: 1 constraint means the 4 partitions run sequentially, which will increase total CI time but improve reliability.

4. Security

No security concerns. CI workflow changes don't affect the runtime codebase.

5. Testing

6. Suggestions

  1. Fix stale comment in pr-test-amd.yml:

    # Before (incorrect):
    part: [0, 1, 2, 3]  # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each
    
    # After (correct):
    part: [0, 1, 2, 3]  # 4 partitions: 11 tests ÷ 4 = ~2-3 tests each
  2. Consider adding the timeout increase to pr-test-amd-rocm720.yml as well if the same timeout pressure applies there, or add a comment explaining why it's not needed.

  3. Consider adding a brief comment in test_vlm_models.py explaining why it's in the nondeterministic suite for AMD, to help future contributors understand:

    # VLM tests are non-deterministic on AMD due to [brief reason, e.g., "output variance on MI325"]
    register_amd_ci(est_time=850, suite="stage-b-test-small-1-gpu-amd-nondeterministic")

7. Overall Assessment

Approve — with a minor nit on the stale comment.

This is a straightforward, well-scoped CI fix. The changes are logical: splitting partitions to reduce per-run load, increasing timeout for the larger workload, and correctly categorizing flaky vs. stable tests. The linked CI run demonstrates the fix works. The only actionable item is fixing the outdated comment in pr-test-amd.yml.


Automated review by amd-bot using Claude. This is an AI-generated review — please use your judgment.

@bingxche
Copy link
Copy Markdown
Collaborator

@yctseng0211 requested CI status check

CI Status for PR #20815

Total: 81 checks

  • Passed: 30
  • Failed: 4
  • Pending: 17

Failed Checks:


Detailed Analysis

Method: Progressive step-by-step analysis (all steps examined per job)

Job: stage-b-test-large-1-gpu (0)

CI Failure Analysis: stage-b-test-large-1-gpu (0)

1. Root Cause Analysis

The test registered/spec/eagle/test_eagle_infer_b.py failed because the EAGLE speculative decoding server could not bind to port 11000 — the port was already in use by another process. This happened identically across all 3 retry attempts, indicating a persistent stale process or another service occupying the port on the runner.

Why the port was occupied: The most likely scenario is that a previous CI job (or a previous test within a prior run) on the same runner (h100-radixark-host1-gpu-3) left behind a zombie/orphan server process still bound to port 11000. The runner reuses the same Docker container (8e1df5b34166), and the cleanup between jobs did not kill lingering processes listening on that port. Each retry attempt within the test also compounded the problem — after the first attempt's server started and hit the bind error, the server process itself may not have been fully killed before the next attempt launched.

Contributing factor: py-spy (used for debugging stuck processes) failed with Permission denied (os error 13), preventing the watchdog from collecting useful diagnostic stack traces of the process holding the port.

2. Failure Details

Failed Test

  • File: registered/spec/eagle/test_eagle_infer_b.py
  • Failure mode: TIMEOUT (1800s hard limit exceeded)

Error Message (repeated 3 times)

ERROR: [Errno 98] error while attempting to bind on address ('127.0.0.1', 11000): address already in use

Failure Timeline (all 3 attempts identical pattern)

Attempt Server Start Bind Error Watchdog Timeout
1 08:41:03 08:41:33 (~30s later) 08:46:04 (300s soft)
2 08:51:03 08:51:25 (~22s later) 08:56:05 (300s soft)
3 09:01:04 09:01:29 (~25s later) 09:06:05 (300s soft)

Cascading Impact

The 1800s timeout on the first test (3 × 300s watchdog + startup overhead) prevented the remaining 2 tests from executing:

  • registered/lora/test_lora_overlap_loading.pySKIPPED (never ran)
  • registered/model_loading/test_utils_update_weights.pySKIPPED (never ran)

Secondary Error

py-spy: Permission denied (os error 13)

py-spy could not attach to the stuck process to gather diagnostics, likely due to ptrace restrictions in the container (SYS_PTRACE capability not granted or kernel.yama.ptrace_scope > 0).

3. Suggested Fixes

Immediate / Short-term

  1. Kill stale processes before test execution — Add a pre-test cleanup step in ci_install_dependency.sh or at the start of run_suite.py:

    # Kill any process listening on common test ports
    for port in 11000 11001 11002 11003 30000 30001; do
      fuser -k ${port}/tcp 2>/dev/null || true
    done
  2. Use dynamic/random ports in tests — Modify test_eagle_infer_b.py (and similar tests) to use port=0 or a randomly selected free port instead of hardcoded 11000:

    import socket
    def get_free_port():
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            s.bind(('', 0))
            return s.getsockname()[1]
  3. Add port cleanup between retry attempts — In the test's retry logic, explicitly kill any process on the target port before retrying:

    import subprocess
    subprocess.run(["fuser", "-k", f"{port}/tcp"], capture_output=True)
    time.sleep(2)  # Allow socket TIME_WAIT to clear
  4. Enable SO_REUSEADDR / SO_REUSEPORT — If using uvicorn/uvloop, ensure the server socket sets SO_REUSEADDR:

    # In server launch config
    uvicorn.run(..., host="127.0.0.1", port=port)
    # uvicorn sets SO_REUSEADDR by default, but verify it's not overridden

Medium-term

  1. Grant SYS_PTRACE capability to the CI container so py-spy can attach and provide useful diagnostics:

    # In the runner's Docker configuration
    docker run --cap-add SYS_PTRACE ...

    Or set the sysctl:

    echo 0 > /proc/sys/kernel/yama/ptrace_scope
  2. Add a pre-test port availability check in the test harness (run_suite.py):

    def assert_port_free(port, host="127.0.0.1"):
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            result = s.connect_ex((host, port))
            if result == 0:
                # Port is in use — find and log the offending process
                subprocess.run(["lsof", "-i", f":{port}"])
                raise RuntimeError(f"Port {port} is already in use")
  3. Restart the Docker container between CI jobs on h100-radixark-host1-gpu-3 to ensure a clean process namespace, or use --rm flag so containers are ephemeral.

Long-term

  1. Refactor test infrastructure to use ephemeral port allocation project-wide, eliminating hardcoded ports as a class of CI failures.

4. Priority

High 🔴

  • 3 out of 3 tests failed in this partition (1 timeout + 2 never ran)
  • This is an infrastructure/environment issue, not a code bug — it will likely recur on the same runner
  • The port conflict pattern is deterministic and will block all EAGLE speculative decoding tests until resolved
  • No useful diagnostics can be collected due to py-spy permission issue

5. Environment Context

Detail Value
Runner h100-radixark-host1-gpu-3
Container 8e1df5b34166 (persistent, not ephemeral)
GPU NVIDIA H100 80GB HBM3 (sm_90)
CUDA Driver 580.126.09
PyTorch 2.9.1 (CUDA 12.8)
sglang 0.0.0.dev1+g5252bd422
FlashInfer 0.6.6
Python 3.10.12
OS Ubuntu 22.04
Commit 5252bd4222d72a32e9c14e5f393c9ed0dac239fb
Conflicting Port 127.0.0.1:11000 (EADDRINUSE across all 3 attempts)
Test Suite stage-b-test-large-1-gpu, partition 0/14
Timeout 1800s per file, 300s watchdog per server launch attempt

Job: stage-b-test-large-1-gpu (2)

CI Failure Analysis: stage-b-test-large-1-gpu (2)

1. Root Cause Analysis

Two VLM models — openbmb/MiniCPM-V-4 and openbmb/MiniCPM-o-2_6 — fail to start their inference servers (exit code 1) during test setup.

The most likely root cause is incompatibility between these MiniCPM models and transformers==5.3.0. Several clues support this:

  • HuggingFace cache validation in Step 19 flagged 32 FAILs, including models missing trust_remote_code files — MiniCPM models are known to rely heavily on custom modeling code via trust_remote_code=True.
  • Transformers 5.3.0 is very new (bleeding-edge) and introduced RoPE compatibility warnings across all models tested. MiniCPM's custom code may reference internal transformers APIs that changed or were removed in v5.x.
  • The easydict module was reported missing (for DeepSeek-OCR), indicating the environment may be missing optional dependencies that custom model code requires. MiniCPM models similarly depend on custom tokenizer/processor code that may import packages not installed in this environment.
  • Both failures are Python-level (exit code 1, no CUDA coredumps generated), consistent with an import error or model initialization crash rather than a GPU issue.
  • The CI retried with HF_HUB_OFFLINE=0 (re-downloading model files), which also failed — ruling out a simple cache corruption issue and pointing to a code-level incompatibility.

2. Failure Details

Failed Tests

Test Class Model Error
TestMiniCPMV4Server openbmb/MiniCPM-V-4 setUpClass — Server process exited with code 1
TestMiniCPMo26Server openbmb/MiniCPM-o-2_6 setUpClass — Server process exited with code 1

Error Messages

ERROR: setUpClass (__main__.TestMiniCPMV4Server)
Exception: Server process exited with code 1. Check server logs for errors.

ERROR: setUpClass (__main__.TestMiniCPMo26Server)
Exception: Server process exited with code 1. Check server logs for errors.

Test File Results

  • File: test/srt/models/test_vision_openai_server_a.py
  • 51 tests ran in 535.2s: 43 passed, 2 errors, 6 skipped
  • Exit code: 1 (file-level failure)
  • Suite exit code: 255 (0/4 test files passed — remaining 3 files likely didn't execute due to early abort or are not shown)

Key Warnings

  • Missing MoE kernel configs for E=64,N=896 and E=128,N=768 on H100 (triton 3.5.1)
  • easydict module not found (DeepSeek-OCR)
  • Transformers 5.3.0 RoPE compatibility warnings across all models
  • Multimodal embedding cache full warnings for video inputs

3. Suggested Fixes

Immediate (unblock CI)

  1. Retrieve server logs for MiniCPM models — The current error message says "Check server logs for errors" but the actual server stderr/stdout is not captured in CI output. Modify the test harness or CI script to dump server logs on failure:

    # In test setup, capture and print server stderr on crash
    if process.returncode != 0:
        print(f"SERVER STDERR:\n{process.stderr.read()}")
  2. Pin transformers to a known-good version until MiniCPM compatibility is confirmed:

    pip install transformers==4.51.0
  3. Install missing optional dependencies that MiniCPM custom code may require:

    pip install easydict timm
  4. Skip MiniCPM tests temporarily if blocking other CI work (add to skip list in run_suite.py or the test file):

    @unittest.skip("MiniCPM-V-4 incompatible with transformers 5.3.0 — see #XXXX")
    class TestMiniCPMV4Server(...):

Medium-term

  1. Add transformers version compatibility matrix — Test MiniCPM models against transformers 4.x vs 5.x to identify the breaking version boundary.

  2. Improve server crash diagnostics — The test framework should always capture and surface server process logs when exit code != 0, not just say "check server logs."

  3. Fix the suite runner — The suite reports 0/4 files passed, suggesting the 3 other test files (test_anthropic_tool_use.py, test_session_latency.py, test_mrope.py) may not have executed at all after the first file failed. Verify the runner continues to subsequent files on failure.

4. Priority

High

  • The failures are reproducible (both offline and online retries failed)
  • They block the entire partition (exit code 255, 0/4 files passed)
  • MiniCPM is a supported model family in sglang
  • However, this is not a regression in sglang core — it's likely a dependency version issue, so not Critical

5. Environment Context

Component Value
GPU NVIDIA H100 80GB HBM3 (sm_90)
Driver 580.126.09
CUDA 13.0 (torch built for 12.8)
Python 3.10.12
PyTorch 2.9.1+cu128
transformers 5.3.0 ⚠️ (bleeding edge)
sglang 0.0.0.dev1+g5252bd422
sglang-kernel 0.4.0
FlashInfer 0.6.6
flash-attn-4 4.0.0b4
Triton 3.5.1
torchao 0.9.0
Runner h100-radixark-host1-gpu-6
Commit 5252bd4222d72a32e9c14e5f393c9ed0dac239fb
Date 2026-03-19

Key observation: transformers==5.3.0 is an unusually new version. The RoPE warnings across all models and the MiniCPM crashes strongly suggest this is the primary environmental factor causing the failures.


Job: wait-for-stage-b

CI Failure Analysis: wait-for-stage-b — Job stage-b-test-large-1-gpu (2) Failed

1. Root Cause Analysis

The wait-for-stage-b job is a gate/polling job — it did not fail itself. It correctly detected that a downstream job, stage-b-test-large-1-gpu (2), completed with conclusion=failure and triggered its fail-fast mechanism.

The root cause is NOT in this job's logs. This job only monitors other jobs via the GitHub API. The actual failure occurred in:

stage-b-test-large-1-gpu (2) — matrix index 2 of the stage-b-test-large-1-gpu job group (14 total jobs)

The logs from this polling job provide no information about what failed inside stage-b-test-large-1-gpu (2) — no error messages, stack traces, or test output are surfaced here. The failure could be:

  • A test failure in one of the large-1-GPU test suites
  • A model loading / OOM issue on the GPU runner
  • A server startup timeout
  • An infrastructure issue (container crash, GPU fault, network timeout)
  • A flaky test

Without the logs from stage-b-test-large-1-gpu (2), the specific root cause cannot be determined from this job alone.

2. Failure Details

Field Value
Failed job stage-b-test-large-1-gpu (2)
Detection method Fail-fast in wait-for-stage-b polling script
Error message ##[error]stage-b jobs failed: stage-b-test-large-1-gpu (2)
Polling attempt Attempt 8 (~14 min into monitoring)
Jobs completed at failure 3 / 27
Other completed jobs stage-b-test-small-1-gpu (0) ✅, stage-b-test-4-gpu-b200
Remaining jobs 24 jobs not yet completed (skipped due to fail-fast)
Specific error/stack trace Not available in this job's logs

3. Suggested Fixes

Immediate Actions

  1. Inspect the actual failed job's logs:

    • Navigate to: Workflow Run #23285539283
    • Find and expand stage-b-test-large-1-gpu (2)
    • Look at the test execution step for error messages, stack traces, and which specific test(s) failed
  2. Check if this is a flaky failure:

    • Search recent CI runs for other failures in stage-b-test-large-1-gpu (2) specifically
    • If the same matrix index fails intermittently, identify the specific test file assigned to index (2) in the matrix configuration and check for known flaky tests
  3. Re-run the failed job:

Investigation Path (in the failed job's logs)

Look for:
- pytest output with FAILED/ERROR markers
- CUDA OOM errors ("OutOfMemoryError")
- Server startup timeouts ("Timeout waiting for server")
- CUDA coredumps (SGLANG_CUDA_COREDUMP=1 is enabled)
- Container/process crashes (exit code != 0)

If This Is a PR Regression (PR #20815)

  • Compare the test results against the main branch (cb8105f)
  • Identify what changed in commit 27142c2 that could affect large-1-GPU test scenarios
  • The merge commit is 5252bd4

4. Priority

High — This blocks the entire stage-b gate for PR #20815. The fail-fast behavior means 24 of 27 jobs were effectively abandoned. However, until the actual failure in stage-b-test-large-1-gpu (2) is examined, it's unclear whether this is a real regression or a transient/infrastructure issue.

5. Environment Context

Variable Value
Runner OS Ubuntu 24.04.3 LTS
Runner Image ubuntu-24.04 / 20260309.50.1
Git 2.53.0
PR #20815 (merge commit 5252bd4)
PR head 27142c2eec85718a0927303c4ee9eb86382d2b7e
Base cb8105fe282fc373b5baed63d5df38682418a373
SGLANG_IS_IN_CI true
SGLANG_CUDA_COREDUMP 1 (coredumps enabled — check for dumps in failed job)
SGLANG_JIT_DEEPGEMM_FAST_WARMUP true
Node.js deprecation actions/checkout@v4 and actions/github-script@v7 using Node.js 20 (EOL June 2, 2026) — not related to failure

⚠️ Action Required: The actual root cause can only be determined by examining the logs of stage-b-test-large-1-gpu (2). This analysis confirms the wait-for-stage-b gate job operated correctly — it is purely a messenger of the downstream failure.


Job: build-test (all)

CI Failure Analysis: build-test (all) — CPU FP8 BMM Kernel Not Implemented

1. Root Cause Analysis

The test suite per-commit-cpu fails because test_bmm.py exercises FP8 (8-bit floating point) batch matrix multiplication on the CPU kernel, which explicitly does not support FP8 weights. The CPU implementation of sgl_kernel.bmm_cpu raises a hard error when called with FP8-quantized inputs. This is a known, intentional limitation in the CPU kernel ("bmm: do not support fp8 weight for now."), but the test suite does not skip or guard against this unsupported code path on CPU.

Because the suite runs with continue_on_error=False and enable_retry=False, the first test file failure (test_bmm.py) causes an immediate abort — 19 of 21 test files were never executed, making the failure impact appear worse than the underlying issue.

This is not a regression from PR #20815 (fix_amd_ci_multimodaltest). It is a pre-existing gap between test coverage expectations and CPU kernel capabilities.

2. Failure Details

Failed Test

File Tests Errors Exit Code
cpu/test_bmm.py 1 test (96 parameterized subtests) 96 1

Error Message

RuntimeError: bmm: do not support fp8 weight for now.

Stack Trace

File "test_bmm.py", line 67, in _fp8_bmm
    torch.ops.sgl_kernel.bmm_cpu(mat3, mat1, mat2_q_t, False, mat2_s)
RuntimeError: bmm: do not support fp8 weight for now.

Parameter Space (all 96 combinations failed identically)

  • B (batch): {1, 16, 17}
  • M: {1, 2, 11, 111}
  • N: {160, 512}
  • K: {160, 544}
  • chunk: {True, False}

Suite Execution Summary

Passed:  2/21  (cpu/test_activation.py, cpu/test_binding.py)
Failed:  1/21  (cpu/test_bmm.py)
Skipped: 19/21 (never reached — early abort)
Exit code: 255

Non-Fatal Warnings

  • Triton is not supported on current platform, roll back to CPU
  • Only CUDA, HIP and XPU support AWQ currently
  • Only CUDA and MUSA support GGUF quantization currently
  • numa_migrate_pages failed / get_mempolicy: Operation not permitted (Docker NUMA policy restriction)

3. Suggested Fixes

Option A: Skip FP8 tests on CPU (quick fix, recommended)

In test/srt/cpu/test_bmm.py, guard the FP8 test path:

import unittest
import torch

# At the test method or class level:
@unittest.skipUnless(
    torch.cuda.is_available(),
    "FP8 BMM is not supported on CPU kernels"
)
def test_fp8_bmm(self):
    ...

Or, if the test is parameterized and mixes FP8/non-FP8 paths, add an inline skip:

def _fp8_bmm(self, B, M, N, K, chunk):
    if not torch.cuda.is_available():
        self.skipTest("bmm: FP8 weight not supported on CPU")
    ...

Option B: Enable continue_on_error for the CPU suite

In run_suite.py or the suite configuration for per-commit-cpu, set continue_on_error=True so that subsequent test files still execute even when one file fails. This doesn't fix the root issue but prevents 19 tests from being silently skipped.

Option C: Implement FP8 BMM support in the CPU kernel (longer-term)

The error string "do not support fp8 weight for now" suggests this is a planned feature. Track implementation in a separate issue for the sgl_kernel_cpu package.

Option D: Remove test_bmm.py from the per-commit-cpu suite

If FP8 BMM on CPU is not planned near-term, exclude the file from the suite definition until the kernel supports it.

4. Priority

Medium

  • The failure is deterministic and blocks the entire CPU test suite (19/21 tests never run), masking potential real regressions.
  • However, it is not a regression from this PR — it's a pre-existing test/kernel mismatch.
  • The fix is trivial (Option A: add a skipTest guard).

5. Environment Context

Component Value
Platform CPU (Xeon), no GPU
CPU Feature AMX tile instructions confirmed available
Docker Image sglang_xeon (Ubuntu 24.04)
Python 3.12.13 (via uv 0.10.11)
PyTorch 2.9.0+cpu
sglang 0.5.6.post3.dev3008+g27142c2ee
sglang-kernel-cpu 0.4.0 (built from source)
Triton 3.5.0 (not functional on CPU — falls back)
PR Branch fix_amd_ci_multimodaltest (#20815)
Runner gnr88001599-1 / sdp
Suite Config per-commit-cpu, continue_on_error=False, enable_retry=False, timeout 1500s/file

Cross-Job Analysis

Unified CI Failure Analysis — PR #20815 (fix_amd_ci_multimodaltest)

1. Common Patterns

Pattern Jobs Affected Nature
Environment/infra issue, not code regression All 4 None of the failures trace back to code changes in PR #20815
Pre-existing issues exposed by CI environment Jobs 0, 2, build-test Stale processes, dependency skew, missing test guards
Cascading abort hiding true test coverage Jobs 0, build-test Early failures prevent remaining tests from running (19/21 skipped in CPU; 2/3 skipped in Job 0)
wait-for-stage-b is purely derivative wait-for-stage-b Gate job correctly propagated Job 2's failure — not an independent issue

The failures are NOT related to each other. They share no common root cause. Each is an independent issue:

Job Root Cause Category
Job 0 Runner infra — stale process holding port 11000
Job 2 Dependency skewtransformers==5.3.0 breaks MiniCPM custom model code
wait-for-stage-b Derivative — mirrors Job 2's failure
build-test (all) Test gap — FP8 BMM test not guarded for CPU-only execution

2. Cross-Job Dependencies

stage-b-test-large-1-gpu (2) ──FAILED──► wait-for-stage-b ──FAILED──► 24 jobs ABANDONED
  • wait-for-stage-b is a direct consequence of Job 2's failure. Fixing Job 2 eliminates this failure entirely.
  • Job 0 and Job 2 are independent matrix partitions on different runners (gpu-3 vs gpu-6) — no causal relationship.
  • build-test (all) runs on a CPU-only runner (gnr88001599-1) and is completely independent of the GPU jobs.
  • The fail-fast in wait-for-stage-b caused 24 of 27 stage-b jobs to be abandoned, massively amplifying the blast radius of Job 2's failure.

3. Unified Root Cause

There is no single unified root cause. These are three independent failures:

# Cause Scope
1 Port 11000 occupied by orphan process on h100-radixark-host1-gpu-3 Runner-specific infra
2 transformers==5.3.0 incompatible with MiniCPM custom modeling code Environment-wide dependency
3 test_bmm.py calls FP8 kernel on CPU without skip guard Test code gap

If forced to identify a theme: the CI environment is fragile — hardcoded ports, bleeding-edge unpinned dependencies, and missing test guards all independently cause failures that are not caught by the test framework before cascading into suite-level aborts.

4. Priority Ranking

Priority Job Fix Effort Blast Radius Action
🔴 P0 Job 2 (MiniCPM / transformers 5.3.0) Medium Critical — kills wait-for-stage-b → abandons 24 jobs Pin transformers<5.0 or patch MiniCPM model code for v5 compatibility
🟠 P1 Job 0 (port 11000 conflict) Low High — 3 tests blocked, will recur on same runner Add fuser -k 11000/tcp pre-test cleanup; migrate to dynamic ports
🟡 P2 build-test (FP8 BMM on CPU) Trivial Medium — 19/21 CPU tests never run Add skipUnless(torch.cuda.is_available()) to FP8 test path
P3 wait-for-stage-b None N/A — derivative Automatically resolves when Job 2 is fixed

5. Overall Recommendation

None of these failures are caused by PR #20815's code changes (fix_amd_ci_multimodaltest) — they are three independent pre-existing CI environment issues. The highest-impact fix is pinning transformers to a stable release (e.g., transformers>=4.45,<5.0) since the MiniCPM crash in Job 2 triggers the fail-fast gate and abandons 24 downstream jobs, making it the single largest contributor to CI instability. In parallel, add a fuser -k port cleanup step before EAGLE tests to resolve the persistent port conflict on Job 0, and add a one-line skipTest guard in test_bmm.py to unblock the CPU suite. Once these three targeted fixes are applied, the PR should be re-run and is expected to pass cleanly. No changes to the PR's own code are needed.


Automated CI analysis by amd-bot — progressive step analysis

@HaiShaw HaiShaw merged commit 9e629d3 into main Mar 19, 2026
93 of 106 checks passed
@HaiShaw HaiShaw deleted the fix_amd_ci_multimodaltest branch March 19, 2026 18:50
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd lora Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants