[AMD] DSv4 nightly hotfix + schedule-aware --continue-on-error in AMD CI#24825
Draft
[AMD] DSv4 nightly hotfix + schedule-aware --continue-on-error in AMD CI#24825
Conversation
PR #23882 (Deepseek V4) renamed both the attention-backend value ('compressed' -> 'dsv4') and the env var name ('SGLANG_REASONING_EFFORT' -> 'SGLANG_DSV4_REASONING_EFFORT') in main, with deprecated aliases for both. The renames also touched the four AMD-registered DSv4 test files. The DSv4 image used by 'nightly-8-gpu-mi35x-deepseek-v4-{flash,pro}-rocm720' is built from the amd/deepseek_v4 fork branch (release-docker-amd-rocm720-nightly.yml's publish_dsv4 job, daily 12:00 UTC, 'ref: amd/deepseek_v4'), which has not picked up the rename or the alias. So: - server_args.ATTENTION_BACKEND_CHOICES on the image only knows 'compressed', and rejects '--attention-backend dsv4' with argparse exit 2 (this is R81, first observed in the 2026-05-08 17:54 UTC schedule run after #23882 merged 2026-05-07 18:32 PDT). - environ.py on the image only reads SGLANG_REASONING_EFFORT and silently ignores SGLANG_DSV4_REASONING_EFFORT (no alias / no auto-migration), so reasoning_effort would silently fall back to default and skew accuracy results even if argparse passed. Until amd/deepseek_v4 is rebased onto a main commit that includes #23882's alias logic and a fresh DSv4 image is published, revert these four test files to use the pre-rename strings. Both names are still accepted on main: 'compressed' is auto-migrated to 'dsv4' by ServerArgs.__post_init__ (deprecation warning only) and SGLANG_REASONING_EFFORT is copied to SGLANG_DSV4_REASONING_EFFORT by _print_deprecated_env, so this change is safe on both the fork image and main. Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
… nightlies
The conditional pattern '${{ inputs.continue_on_error && '--continue-on-error' || '' }}'
only fires when an 'inputs.continue_on_error' is set, which only happens on
workflow_dispatch / workflow_call (where the input default is 'true'). On
'schedule' (cron) and 'push' triggers there is no such input, so the expression
evaluates to an empty string and run_suite.py runs in its default fail-fast mode
('break' on first failed file in run_unittest_files at python/sglang/test/ci/ci_utils.py:260).
Net effect: when the nightly cron at '30 17 * * *' (rocm720) / '0 18 * * *' (amd)
hit a multi-file suite (e.g. nightly-amd-8-gpu-mi35x-deepseek-v4-flash, which
contains both fp8 and fp4), one file failing caused the rest of the suite to be
skipped, so we lost visibility into independent failures and the 'X/Y passed'
summary undercounted what actually ran.
Hardcode '--continue-on-error' on every run_suite.py invocation in the two AMD
nightly workflows. This matches what nightly-test-nvidia.yml, nightly-test-npu.yml,
weekly-test-nvidia.yml, and a handful of AMD jobs (nightly-amd-4-gpu, qwen35
accuracy) already do. The job is still marked failed when any file fails because
run_suite.py exits non-zero in that case; we just don't short-circuit the
remaining files.
Also fix two run_suite.py invocations that were missing the flag entirely
(qwen35 accuracy on amd.yml lines 658 and 1284).
Left alone: the pytest invocation for test_zimage_turbo.py at line 788/786 uses a
different '|| true' construct that swallows the exit code — hardcoding that would
mark the step green on real failures, the opposite of what we want.
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
…user-controlled
The previous commit hardcoded --continue-on-error on every run_suite.py
invocation, which is too aggressive: it removes the operator's ability to
opt out of fail-fast on workflow_dispatch / workflow_call (where the
'continue_on_error' input is intentionally exposed with default=true so it
can be flipped to false for targeted debugging).
Switch to a unified conditional that:
- On 'schedule' (cron nightly): always passes --continue-on-error so we
get full visibility into independent failures across the suite's test
files. This is the original motivation for this change (R81 dropped
fp8 results when fp4 failed first).
- On 'workflow_dispatch' / 'workflow_call': respects the
'continue_on_error' input (default true, can be set false to opt into
fail-fast for debugging).
- On 'push' (only fires on python/sglang/version.py changes): falls
through to fail-fast, matching pre-existing behavior on that trigger.
Expression used:
${{ (github.event_name == 'schedule' || inputs.continue_on_error) && '--continue-on-error' || '' }}
This also brings the previously-grandfathered hardcoded suites
(nightly-amd-4-gpu, nightly-amd-accuracy-8-gpu-{,mi35x-}qwen35) under the
same policy — they used to ignore the input entirely and always pass
--continue-on-error, which violated the same operator-control principle.
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
…-rocm720
pr-test-amd-rocm720.yml has the same fail-fast-on-cron bug as the AMD
nightlies fixed in the previous commit: every run_suite.py invocation uses
'${{ inputs.continue_on_error && '--continue-on-error' || '' }}' which only
fires on workflow_dispatch / workflow_call (where inputs.continue_on_error
defaults to true). On 'schedule' the inputs context is null and the flag is
silently dropped, so the cron run runs in fail-fast mode.
Apply the same expression as the nightlies on all 10 run_suite.py call sites
(9 single-line, 1 multi-line), so:
- schedule (cron) -> auto-on --continue-on-error
- workflow_dispatch / workflow_call (default true) -> --continue-on-error
- workflow_dispatch with operator override false -> fail-fast
This file has no 'pull_request' trigger, so PR fail-fast semantics are
unaffected. The two other AMD-related PR test workflows (pr-test-amd.yml
and pr-test.yml) already use a centralized 'set-continue-on-error' step
that derives from github.event_name == 'schedule', so they are correct
without changes.
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
|
Cursor Agent can help with this pull request. Just |
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Match the naming convention already used by the Pro variant
(test_deepseek_v4_pro_{fp4,fp8}.py) and the NV B200/H200 Flash variants
(test_deepseek_v4_flash_fp4_{b200,h200}.py).
test_deepseek_v4_fp4.py -> test_deepseek_v4_flash_fp4.py
test_deepseek_v4_fp8.py -> test_deepseek_v4_flash_fp8.py
The suite registration ('nightly-amd-8-gpu-mi35x-deepseek-v4-flash') is
declared inside each file via register_amd_ci(...), so the suite name and
the workflow's --suite argument do not change. No edits to run_suite.py
or to .github/workflows/ are needed.
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Three issues hitting AMD CI since 2026-05-08, all fixed here.
nightly-8-gpu-mi35x-deepseek-v4-{flash,pro}-rocm720red since 2026-05-08. PR Deepseek V4 #23882 renamedcompressed→dsv4(attention backend) andSGLANG_REASONING_EFFORT→SGLANG_DSV4_REASONING_EFFORT(env var) onmain, including in the four AMD test files. The DSv4 image (publish_dsv4inrelease-docker-amd-rocm720-nightly.ymlbuilds fromref: amd/deepseek_v4) hasn't picked up that rename or its alias logic, so--attention-backend dsv4is rejected by the image's argparse (exit 2, the visible failure) andSGLANG_DSV4_REASONING_EFFORTis silently ignored at runtime (the invisible one — accuracy would skew without crashing).run_suite.py's--continue-on-errorwas passed via${{ inputs.continue_on_error && '--continue-on-error' || '' }}, which only fires onworkflow_dispatch/workflow_call. Onschedule,inputsis null → empty string → fail-fast (run_unittest_filesbreaks at the first failed file,python/sglang/test/ci/ci_utils.py:260). With R81, fp4 failed first and fp8 was skipped —Test Summary: 0/2 passedundercounted what actually ran.pr-test-amd-rocm720.yml. That workflow has nopull_requesttrigger and uses the same conditional. The other AMD-touching PR test workflows (pr-test.yml,pr-test-amd.yml) already gate via a centralizedset-continue-on-errorstep keyed ongithub.event_name == 'schedule', so they're unaffected.Modifications
Test files (
test/registered/amd/)Revert PR #23882's two-string rename in the four DSv4 test files (8 lines, 4 files):
--attention-backend dsv4→--attention-backend compressed"SGLANG_DSV4_REASONING_EFFORT": "max"→"SGLANG_REASONING_EFFORT": "max"Both old strings still work on
main(deprecation aliases), so this is safe regardless of which sglang the test runs against. Temporary — drop the revert onceamd/deepseek_v4rebases past #23882.Also rename the Flash variants to match the Pro / NV-B200 / NV-H200 naming convention (suite registration is in-file, no
run_suite.pychange needed):test_deepseek_v4_fp4.py→test_deepseek_v4_flash_fp4.pytest_deepseek_v4_fp8.py→test_deepseek_v4_flash_fp8.pyCI workflows (
.github/workflows/)Replace the buggy conditional with a schedule-aware one in every
run_suite.pycall acrossnightly-test-amd.yml,nightly-test-amd-rocm720.yml, andpr-test-amd-rocm720.yml:${{ (github.event_name == 'schedule' || inputs.continue_on_error) && '--continue-on-error' || '' }}inputs.continue_on_errorschedule(cron)workflow_dispatch(defaulttrue)trueworkflow_dispatchoperator overridefalseworkflow_call(defaulttrue)truepush(only fires onversion.pyfor nightlies)Also brings the three previously-grandfathered hardcoded suites (
nightly-amd-4-gpu,nightly-amd-accuracy-8-gpu-{,mi35x-}qwen35) under the same policy. Thepytest test_zimage_turbo.pylines (|| trueflavor) are intentionally left alone — that construct swallows the exit code, hardcoding it would mark cron failures green.Branch / commits
Branch:
cursor/dsv4-amd-nightly-hotfix-1358Commits 2–3 were sequenced as "hardcode → improve to schedule-aware" during review iteration; the net diff vs
mainonly reflects the final form. Squash-merge will collapse them automatically.Validation
Manually dispatched run 25598883766 on this branch with
job_filter=nightly-8-gpu-mi35x-deepseek-v4-{flash,pro}-rocm720. ExpectingsetUpClassto launchsglang serve ... --attention-backend compressed ...(notdsv4) andTest Summary: X/2 passedto come from both fp8 and fp4 actually running, regardless of one failing.Checklist