Skip to content

ci: extract cuda stage actions + runner_config mapping#25138

Merged
hnyls2002 merged 22 commits into
mainfrom
lsyin/ci-extract-stage-actions
May 14, 2026
Merged

ci: extract cuda stage actions + runner_config mapping#25138
hnyls2002 merged 22 commits into
mainfrom
lsyin/ci-extract-stage-actions

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented May 13, 2026

Stack on top of #25197 (already merged): pulls every CUDA stage job in pr-test.yml into 3 composite actions and routes install-script selection through a single runner_config lookup key.

Composite actions

  • .github/actions/setup-cuda-test-stage — checkout, stage-health / maintenance gates, sgl-kernel wheel download, install. Install script is now looked up by runner_config (no longer a caller input).
  • .github/actions/run-test-cuda-suiterun_suite.py invocation with suite name, partition flags, optional --timeout-per-file, continue-on-error flag.
  • .github/actions/teardown-cuda-test-stage — coredump upload + venv cleanup.

Mapping

scripts/ci/runner_configs.py is the single source of truth for runner_config → install_script. The composite action shells out to it at runtime. Sanity check in CI: every runner_config referenced by a workflow job must exist in the map.

Caller stub now ~20 lines per stage

stage-c-test-deepep-4-gpu-h100:
  needs: [...]
  if: |
    ...
  runs-on: 4-gpu-h100
  strategy: { ... matrix from partitions JSON ... }
  steps:
    - uses: ./.github/actions/setup-cuda-test-stage
      with:
        runner_config: deepep-4-gpu-h100
        sgl_kernel: ${{ needs.check-changes.outputs.sgl_kernel }}
        pr_head_sha: ${{ inputs.pr_head_sha || '' }}
        git_ref: ${{ inputs.git_ref || '' }}
    - name: Warmup ...   # stage-specific, kept inline
    - uses: ./.github/actions/run-test-cuda-suite
      with: { suite_name, partition_id, partition_size, continue_on_error_flag }
    - uses: ./.github/actions/teardown-cuda-test-stage
      with: { artifact_suffix: ${{ matrix.partition }} }

Warmup commands (8-gpu-h200, deepep-4-gpu-h100, deepep-8-gpu-h200) stay inline in the caller since each has different model URLs. Can be folded into the mapping later if it earns its keep.

Scope

CUDA stage jobs only (13 jobs). stage-a-test-cpu keeps its own bespoke setup (uv pip + protoc + rust cache) since the install path is fundamentally different.

AMD / NPU per-commit + nightly suites unchanged — they still use the legacy register_*_ci(suite=...) path and pr-test-amd.yml / pr-test-npu.yml have their own dispatch.

Equivalence verification (all 14 CUDA stage jobs)

Python script renders each stage's effective job spec from both forms (main inline + PR-C caller stub + _pr-test-stage.yml reusable workflow) and compares after GHA expression simplification (literal ==, ternary && || , format(), && identity) and cosmetic normalization. All 14 stages PASS semantic equivalence; 8 KNOWN intentional changes are flagged separately:

  • upload-cuda-coredumps artifact-suffix: ${{ matrix.partition }} now standardized to all stages (some stages were missing it on main)
  • Run FA4 jit_kernel tests (SM100+) renamed to generic Run extra pytest, gated by new extra_pytest_path input (same script body)

https://gist.github.com/hnyls2002/4cbb2fe077b9a6db710b5e5eeb025725

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hnyls2002 hnyls2002 changed the title ci: extract per-stage setup/teardown into composite actions ci: extract per-stage setup/run/teardown into composite actions May 13, 2026
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@hnyls2002 hnyls2002 force-pushed the lsyin/ci-extract-stage-actions branch from 1a5e48f to ab0b1e5 Compare May 14, 2026 00:38
@hnyls2002 hnyls2002 changed the title ci: extract per-stage setup/run/teardown into composite actions ci: extract cuda stage actions + runner_config mapping May 14, 2026
@hnyls2002 hnyls2002 force-pushed the lsyin/ci-extract-stage-actions branch from a3c1d1a to 3ba6cae Compare May 14, 2026 02:28
@hnyls2002 hnyls2002 merged commit 85d9c77 into main May 14, 2026
80 of 101 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/ci-extract-stage-actions branch May 14, 2026 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants