Skip to content

[AMD] DSv4 nightly hotfix + schedule-aware --continue-on-error in AMD CI#24825

Draft
bingxche wants to merge 5 commits intomainfrom
cursor/dsv4-amd-nightly-hotfix-1358
Draft

[AMD] DSv4 nightly hotfix + schedule-aware --continue-on-error in AMD CI#24825
bingxche wants to merge 5 commits intomainfrom
cursor/dsv4-amd-nightly-hotfix-1358

Conversation

@bingxche
Copy link
Copy Markdown
Collaborator

@bingxche bingxche commented May 9, 2026

Motivation

Three issues hitting AMD CI since 2026-05-08, all fixed here.

  1. nightly-8-gpu-mi35x-deepseek-v4-{flash,pro}-rocm720 red since 2026-05-08. PR Deepseek V4 #23882 renamed compresseddsv4 (attention backend) and SGLANG_REASONING_EFFORTSGLANG_DSV4_REASONING_EFFORT (env var) on main, including in the four AMD test files. The DSv4 image (publish_dsv4 in release-docker-amd-rocm720-nightly.yml builds from ref: amd/deepseek_v4) hasn't picked up that rename or its alias logic, so --attention-backend dsv4 is rejected by the image's argparse (exit 2, the visible failure) and SGLANG_DSV4_REASONING_EFFORT is silently ignored at runtime (the invisible one — accuracy would skew without crashing).
  2. AMD nightlies fail-fast on cron. run_suite.py's --continue-on-error was passed via ${{ inputs.continue_on_error && '--continue-on-error' || '' }}, which only fires on workflow_dispatch/workflow_call. On schedule, inputs is null → empty string → fail-fast (run_unittest_files breaks at the first failed file, python/sglang/test/ci/ci_utils.py:260). With R81, fp4 failed first and fp8 was skipped — Test Summary: 0/2 passed undercounted what actually ran.
  3. Same fail-fast-on-cron bug in pr-test-amd-rocm720.yml. That workflow has no pull_request trigger and uses the same conditional. The other AMD-touching PR test workflows (pr-test.yml, pr-test-amd.yml) already gate via a centralized set-continue-on-error step keyed on github.event_name == 'schedule', so they're unaffected.

Modifications

Test files (test/registered/amd/)

Revert PR #23882's two-string rename in the four DSv4 test files (8 lines, 4 files):

  • --attention-backend dsv4--attention-backend compressed
  • "SGLANG_DSV4_REASONING_EFFORT": "max""SGLANG_REASONING_EFFORT": "max"

Both old strings still work on main (deprecation aliases), so this is safe regardless of which sglang the test runs against. Temporary — drop the revert once amd/deepseek_v4 rebases past #23882.

Also rename the Flash variants to match the Pro / NV-B200 / NV-H200 naming convention (suite registration is in-file, no run_suite.py change needed):

  • test_deepseek_v4_fp4.pytest_deepseek_v4_flash_fp4.py
  • test_deepseek_v4_fp8.pytest_deepseek_v4_flash_fp8.py

CI workflows (.github/workflows/)

Replace the buggy conditional with a schedule-aware one in every run_suite.py call across nightly-test-amd.yml, nightly-test-amd-rocm720.yml, and pr-test-amd-rocm720.yml:

${{ (github.event_name == 'schedule' || inputs.continue_on_error) && '--continue-on-error' || '' }}
Trigger inputs.continue_on_error Flag passed?
schedule (cron) n/a (null) yes (auto)
workflow_dispatch (default true) true yes
workflow_dispatch operator override false no (opt-in fail-fast for debugging)
workflow_call (default true) true yes
push (only fires on version.py for nightlies) n/a no

Also brings the three previously-grandfathered hardcoded suites (nightly-amd-4-gpu, nightly-amd-accuracy-8-gpu-{,mi35x-}qwen35) under the same policy. The pytest test_zimage_turbo.py lines (|| true flavor) are intentionally left alone — that construct swallows the exit code, hardcoding it would mark cron failures green.

Branch / commits

Branch: cursor/dsv4-amd-nightly-hotfix-1358

Commits 2–3 were sequenced as "hardcode → improve to schedule-aware" during review iteration; the net diff vs main only reflects the final form. Squash-merge will collapse them automatically.

Validation

Manually dispatched run 25598883766 on this branch with job_filter=nightly-8-gpu-mi35x-deepseek-v4-{flash,pro}-rocm720. Expecting setUpClass to launch sglang serve ... --attention-backend compressed ... (not dsv4) and Test Summary: X/2 passed to come from both fp8 and fp4 actually running, regardless of one failing.

Checklist

Open in Web Open in Cursor 

cursoragent and others added 4 commits May 9, 2026 09:55
PR #23882 (Deepseek V4) renamed both the attention-backend value
('compressed' -> 'dsv4') and the env var name ('SGLANG_REASONING_EFFORT'
-> 'SGLANG_DSV4_REASONING_EFFORT') in main, with deprecated aliases for
both. The renames also touched the four AMD-registered DSv4 test files.

The DSv4 image used by 'nightly-8-gpu-mi35x-deepseek-v4-{flash,pro}-rocm720'
is built from the amd/deepseek_v4 fork branch (release-docker-amd-rocm720-nightly.yml's publish_dsv4 job, daily 12:00 UTC, 'ref: amd/deepseek_v4'),
which has not picked up the rename or the alias. So:

  - server_args.ATTENTION_BACKEND_CHOICES on the image only knows
    'compressed', and rejects '--attention-backend dsv4' with argparse
    exit 2 (this is R81, first observed in the 2026-05-08 17:54 UTC
    schedule run after #23882 merged 2026-05-07 18:32 PDT).
  - environ.py on the image only reads SGLANG_REASONING_EFFORT and
    silently ignores SGLANG_DSV4_REASONING_EFFORT (no alias / no
    auto-migration), so reasoning_effort would silently fall back to
    default and skew accuracy results even if argparse passed.

Until amd/deepseek_v4 is rebased onto a main commit that includes
#23882's alias logic and a fresh DSv4 image is published, revert these
four test files to use the pre-rename strings. Both names are still
accepted on main: 'compressed' is auto-migrated to 'dsv4' by
ServerArgs.__post_init__ (deprecation warning only) and
SGLANG_REASONING_EFFORT is copied to SGLANG_DSV4_REASONING_EFFORT by
_print_deprecated_env, so this change is safe on both the fork image
and main.

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
… nightlies

The conditional pattern '${{ inputs.continue_on_error && '--continue-on-error' || '' }}'
only fires when an 'inputs.continue_on_error' is set, which only happens on
workflow_dispatch / workflow_call (where the input default is 'true'). On
'schedule' (cron) and 'push' triggers there is no such input, so the expression
evaluates to an empty string and run_suite.py runs in its default fail-fast mode
('break' on first failed file in run_unittest_files at python/sglang/test/ci/ci_utils.py:260).

Net effect: when the nightly cron at '30 17 * * *' (rocm720) / '0 18 * * *' (amd)
hit a multi-file suite (e.g. nightly-amd-8-gpu-mi35x-deepseek-v4-flash, which
contains both fp8 and fp4), one file failing caused the rest of the suite to be
skipped, so we lost visibility into independent failures and the 'X/Y passed'
summary undercounted what actually ran.

Hardcode '--continue-on-error' on every run_suite.py invocation in the two AMD
nightly workflows. This matches what nightly-test-nvidia.yml, nightly-test-npu.yml,
weekly-test-nvidia.yml, and a handful of AMD jobs (nightly-amd-4-gpu, qwen35
accuracy) already do. The job is still marked failed when any file fails because
run_suite.py exits non-zero in that case; we just don't short-circuit the
remaining files.

Also fix two run_suite.py invocations that were missing the flag entirely
(qwen35 accuracy on amd.yml lines 658 and 1284).

Left alone: the pytest invocation for test_zimage_turbo.py at line 788/786 uses a
different '|| true' construct that swallows the exit code — hardcoding that would
mark the step green on real failures, the opposite of what we want.

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
…user-controlled

The previous commit hardcoded --continue-on-error on every run_suite.py
invocation, which is too aggressive: it removes the operator's ability to
opt out of fail-fast on workflow_dispatch / workflow_call (where the
'continue_on_error' input is intentionally exposed with default=true so it
can be flipped to false for targeted debugging).

Switch to a unified conditional that:

  - On 'schedule' (cron nightly): always passes --continue-on-error so we
    get full visibility into independent failures across the suite's test
    files. This is the original motivation for this change (R81 dropped
    fp8 results when fp4 failed first).
  - On 'workflow_dispatch' / 'workflow_call': respects the
    'continue_on_error' input (default true, can be set false to opt into
    fail-fast for debugging).
  - On 'push' (only fires on python/sglang/version.py changes): falls
    through to fail-fast, matching pre-existing behavior on that trigger.

Expression used:

  ${{ (github.event_name == 'schedule' || inputs.continue_on_error) && '--continue-on-error' || '' }}

This also brings the previously-grandfathered hardcoded suites
(nightly-amd-4-gpu, nightly-amd-accuracy-8-gpu-{,mi35x-}qwen35) under the
same policy — they used to ignore the input entirely and always pass
--continue-on-error, which violated the same operator-control principle.

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
…-rocm720

pr-test-amd-rocm720.yml has the same fail-fast-on-cron bug as the AMD
nightlies fixed in the previous commit: every run_suite.py invocation uses
'${{ inputs.continue_on_error && '--continue-on-error' || '' }}' which only
fires on workflow_dispatch / workflow_call (where inputs.continue_on_error
defaults to true). On 'schedule' the inputs context is null and the flag is
silently dropped, so the cron run runs in fail-fast mode.

Apply the same expression as the nightlies on all 10 run_suite.py call sites
(9 single-line, 1 multi-line), so:

  - schedule (cron) -> auto-on --continue-on-error
  - workflow_dispatch / workflow_call (default true) -> --continue-on-error
  - workflow_dispatch with operator override false -> fail-fast

This file has no 'pull_request' trigger, so PR fail-fast semantics are
unaffected. The two other AMD-related PR test workflows (pr-test-amd.yml
and pr-test.yml) already use a centralized 'set-continue-on-error' step
that derives from github.event_name == 'schedule', so they are correct
without changes.

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 9, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Match the naming convention already used by the Pro variant
(test_deepseek_v4_pro_{fp4,fp8}.py) and the NV B200/H200 Flash variants
(test_deepseek_v4_flash_fp4_{b200,h200}.py).

  test_deepseek_v4_fp4.py -> test_deepseek_v4_flash_fp4.py
  test_deepseek_v4_fp8.py -> test_deepseek_v4_flash_fp8.py

The suite registration ('nightly-amd-8-gpu-mi35x-deepseek-v4-flash') is
declared inside each file via register_amd_ci(...), so the suite name and
the workflow's --suite argument do not change. No edits to run_suite.py
or to .github/workflows/ are needed.

Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants