[AMD] DSv4 nightly hotfix + schedule-aware --continue-on-error in AMD CI by bingxche · Pull Request #24825 · sgl-project/sglang

bingxche · 2026-05-09T10:29:13Z

Motivation

Three issues hitting AMD CI since 2026-05-08, all fixed here.

nightly-8-gpu-mi35x-deepseek-v4-{flash,pro}-rocm720 red since 2026-05-08. PR Deepseek V4 #23882 renamed compressed → dsv4 (attention backend) and SGLANG_REASONING_EFFORT → SGLANG_DSV4_REASONING_EFFORT (env var) on main, including in the four AMD test files. The DSv4 image (publish_dsv4 in release-docker-amd-rocm720-nightly.yml builds from ref: amd/deepseek_v4) hasn't picked up that rename or its alias logic, so --attention-backend dsv4 is rejected by the image's argparse (exit 2, the visible failure) and SGLANG_DSV4_REASONING_EFFORT is silently ignored at runtime (the invisible one — accuracy would skew without crashing).
AMD nightlies fail-fast on cron. run_suite.py's --continue-on-error was passed via ${{ inputs.continue_on_error && '--continue-on-error' || '' }}, which only fires on workflow_dispatch/workflow_call. On schedule, inputs is null → empty string → fail-fast (run_unittest_files breaks at the first failed file, python/sglang/test/ci/ci_utils.py:260). With R81, fp4 failed first and fp8 was skipped — Test Summary: 0/2 passed undercounted what actually ran.
Same fail-fast-on-cron bug in pr-test-amd-rocm720.yml. That workflow has no pull_request trigger and uses the same conditional. The other AMD-touching PR test workflows (pr-test.yml, pr-test-amd.yml) already gate via a centralized set-continue-on-error step keyed on github.event_name == 'schedule', so they're unaffected.

Modifications

Test files (`test/registered/amd/`)

Revert PR #23882's two-string rename in the four DSv4 test files (8 lines, 4 files):

--attention-backend dsv4 → --attention-backend compressed
"SGLANG_DSV4_REASONING_EFFORT": "max" → "SGLANG_REASONING_EFFORT": "max"

Both old strings still work on main (deprecation aliases), so this is safe regardless of which sglang the test runs against. Temporary — drop the revert once amd/deepseek_v4 rebases past #23882.

Also rename the Flash variants to match the Pro / NV-B200 / NV-H200 naming convention (suite registration is in-file, no run_suite.py change needed):

test_deepseek_v4_fp4.py → test_deepseek_v4_flash_fp4.py
test_deepseek_v4_fp8.py → test_deepseek_v4_flash_fp8.py

CI workflows (`.github/workflows/`)

Replace the buggy conditional with a schedule-aware one in every run_suite.py call across nightly-test-amd.yml, nightly-test-amd-rocm720.yml, and pr-test-amd-rocm720.yml:

${{ (github.event_name == 'schedule' || inputs.continue_on_error) && '--continue-on-error' || '' }}

Trigger	`inputs.continue_on_error`	Flag passed?
`schedule` (cron)	n/a (null)	yes (auto)
`workflow_dispatch` (default `true`)	`true`	yes
`workflow_dispatch` operator override	`false`	no (opt-in fail-fast for debugging)
`workflow_call` (default `true`)	`true`	yes
`push` (only fires on `version.py` for nightlies)	n/a	no

Also brings the three previously-grandfathered hardcoded suites (nightly-amd-4-gpu, nightly-amd-accuracy-8-gpu-{,mi35x-}qwen35) under the same policy. The pytest test_zimage_turbo.py lines (|| true flavor) are intentionally left alone — that construct swallows the exit code, hardcoding it would mark cron failures green.

Branch / commits

Branch: cursor/dsv4-amd-nightly-hotfix-1358

Commits 2–3 were sequenced as "hardcode → improve to schedule-aware" during review iteration; the net diff vs main only reflects the final form. Squash-merge will collapse them automatically.

Validation

Manually dispatched run 25598883766 on this branch with job_filter=nightly-8-gpu-mi35x-deepseek-v4-{flash,pro}-rocm720. Expecting setUpClass to launch sglang serve ... --attention-backend compressed ... (not dsv4) and Test Summary: X/2 passed to come from both fp8 and fp4 actually running, regardless of one failing.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

PR #23882 (Deepseek V4) renamed both the attention-backend value ('compressed' -> 'dsv4') and the env var name ('SGLANG_REASONING_EFFORT' -> 'SGLANG_DSV4_REASONING_EFFORT') in main, with deprecated aliases for both. The renames also touched the four AMD-registered DSv4 test files. The DSv4 image used by 'nightly-8-gpu-mi35x-deepseek-v4-{flash,pro}-rocm720' is built from the amd/deepseek_v4 fork branch (release-docker-amd-rocm720-nightly.yml's publish_dsv4 job, daily 12:00 UTC, 'ref: amd/deepseek_v4'), which has not picked up the rename or the alias. So: - server_args.ATTENTION_BACKEND_CHOICES on the image only knows 'compressed', and rejects '--attention-backend dsv4' with argparse exit 2 (this is R81, first observed in the 2026-05-08 17:54 UTC schedule run after #23882 merged 2026-05-07 18:32 PDT). - environ.py on the image only reads SGLANG_REASONING_EFFORT and silently ignores SGLANG_DSV4_REASONING_EFFORT (no alias / no auto-migration), so reasoning_effort would silently fall back to default and skew accuracy results even if argparse passed. Until amd/deepseek_v4 is rebased onto a main commit that includes #23882's alias logic and a fresh DSv4 image is published, revert these four test files to use the pre-rename strings. Both names are still accepted on main: 'compressed' is auto-migrated to 'dsv4' by ServerArgs.__post_init__ (deprecation warning only) and SGLANG_REASONING_EFFORT is copied to SGLANG_DSV4_REASONING_EFFORT by _print_deprecated_env, so this change is safe on both the fork image and main. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

… nightlies The conditional pattern '${{ inputs.continue_on_error && '--continue-on-error' || '' }}' only fires when an 'inputs.continue_on_error' is set, which only happens on workflow_dispatch / workflow_call (where the input default is 'true'). On 'schedule' (cron) and 'push' triggers there is no such input, so the expression evaluates to an empty string and run_suite.py runs in its default fail-fast mode ('break' on first failed file in run_unittest_files at python/sglang/test/ci/ci_utils.py:260). Net effect: when the nightly cron at '30 17 * * *' (rocm720) / '0 18 * * *' (amd) hit a multi-file suite (e.g. nightly-amd-8-gpu-mi35x-deepseek-v4-flash, which contains both fp8 and fp4), one file failing caused the rest of the suite to be skipped, so we lost visibility into independent failures and the 'X/Y passed' summary undercounted what actually ran. Hardcode '--continue-on-error' on every run_suite.py invocation in the two AMD nightly workflows. This matches what nightly-test-nvidia.yml, nightly-test-npu.yml, weekly-test-nvidia.yml, and a handful of AMD jobs (nightly-amd-4-gpu, qwen35 accuracy) already do. The job is still marked failed when any file fails because run_suite.py exits non-zero in that case; we just don't short-circuit the remaining files. Also fix two run_suite.py invocations that were missing the flag entirely (qwen35 accuracy on amd.yml lines 658 and 1284). Left alone: the pytest invocation for test_zimage_turbo.py at line 788/786 uses a different '|| true' construct that swallows the exit code — hardcoding that would mark the step green on real failures, the opposite of what we want. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

…user-controlled The previous commit hardcoded --continue-on-error on every run_suite.py invocation, which is too aggressive: it removes the operator's ability to opt out of fail-fast on workflow_dispatch / workflow_call (where the 'continue_on_error' input is intentionally exposed with default=true so it can be flipped to false for targeted debugging). Switch to a unified conditional that: - On 'schedule' (cron nightly): always passes --continue-on-error so we get full visibility into independent failures across the suite's test files. This is the original motivation for this change (R81 dropped fp8 results when fp4 failed first). - On 'workflow_dispatch' / 'workflow_call': respects the 'continue_on_error' input (default true, can be set false to opt into fail-fast for debugging). - On 'push' (only fires on python/sglang/version.py changes): falls through to fail-fast, matching pre-existing behavior on that trigger. Expression used: ${{ (github.event_name == 'schedule' || inputs.continue_on_error) && '--continue-on-error' || '' }} This also brings the previously-grandfathered hardcoded suites (nightly-amd-4-gpu, nightly-amd-accuracy-8-gpu-{,mi35x-}qwen35) under the same policy — they used to ignore the input entirely and always pass --continue-on-error, which violated the same operator-control principle. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

…-rocm720 pr-test-amd-rocm720.yml has the same fail-fast-on-cron bug as the AMD nightlies fixed in the previous commit: every run_suite.py invocation uses '${{ inputs.continue_on_error && '--continue-on-error' || '' }}' which only fires on workflow_dispatch / workflow_call (where inputs.continue_on_error defaults to true). On 'schedule' the inputs context is null and the flag is silently dropped, so the cron run runs in fail-fast mode. Apply the same expression as the nightlies on all 10 run_suite.py call sites (9 single-line, 1 multi-line), so: - schedule (cron) -> auto-on --continue-on-error - workflow_dispatch / workflow_call (default true) -> --continue-on-error - workflow_dispatch with operator override false -> fail-fast This file has no 'pull_request' trigger, so PR fail-fast semantics are unaffected. The two other AMD-related PR test workflows (pr-test-amd.yml and pr-test.yml) already use a centralized 'set-continue-on-error' step that derives from github.event_name == 'schedule', so they are correct without changes. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

cursor · 2026-05-09T10:29:14Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

gemini-code-assist · 2026-05-09T10:29:17Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Match the naming convention already used by the Pro variant (test_deepseek_v4_pro_{fp4,fp8}.py) and the NV B200/H200 Flash variants (test_deepseek_v4_flash_fp4_{b200,h200}.py). test_deepseek_v4_fp4.py -> test_deepseek_v4_flash_fp4.py test_deepseek_v4_fp8.py -> test_deepseek_v4_flash_fp8.py The suite registration ('nightly-amd-8-gpu-mi35x-deepseek-v4-flash') is declared inside each file via register_amd_ci(...), so the suite name and the workflow's --suite argument do not change. No edits to run_suite.py or to .github/workflows/ are needed. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>

cursoragent and others added 4 commits May 9, 2026 09:55

github-actions Bot added amd deepseek labels May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] DSv4 nightly hotfix + schedule-aware --continue-on-error in AMD CI#24825

[AMD] DSv4 nightly hotfix + schedule-aware --continue-on-error in AMD CI#24825
bingxche wants to merge 5 commits intomainfrom
cursor/dsv4-amd-nightly-hotfix-1358

bingxche commented May 9, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 9, 2026

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bingxche commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Test files (test/registered/amd/)

CI workflows (.github/workflows/)

Branch / commits

Validation

Checklist

Uh oh!

cursor Bot commented May 9, 2026

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bingxche commented May 9, 2026 •

edited

Loading

Test files (`test/registered/amd/`)

CI workflows (`.github/workflows/`)