Skip to content

[CI] Fix test suite names and add suite validation#21937

Merged
ispobock merged 13 commits intomainfrom
suite-check
Apr 3, 2026
Merged

[CI] Fix test suite names and add suite validation#21937
ispobock merged 13 commits intomainfrom
suite-check

Conversation

@ispobock
Copy link
Copy Markdown
Collaborator

@ispobock ispobock commented Apr 2, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@ispobock ispobock changed the title Fix test suite names and add suite validation [CI] Fix test suite names and add suite validation Apr 2, 2026
@ispobock
Copy link
Copy Markdown
Collaborator Author

ispobock commented Apr 2, 2026

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Apr 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request renames the CPU test suite from 'stage-a-cpu-only' to 'stage-a-test-cpu' across several unit test files and introduces a validation mechanism in test/run_suite.py to ensure tests are registered to valid suites. Feedback was provided regarding the validation logic, noting that it currently uses a flat set of suite names which could allow a test to be incorrectly registered to a suite intended for a different hardware backend. It is recommended to refactor the validation to be backend-aware and to include AMD and NPU backends in the check.

@github-actions github-actions bot added the blackwell SM100/SM120 label Apr 2, 2026
@ispobock
Copy link
Copy Markdown
Collaborator Author

ispobock commented Apr 3, 2026

https://github.com/sgl-project/sglang/actions/runs/23933687537/job/69813982375?pr=21937#step:7:13701
test_fused_temperature_softmax.py test cannot pass, disable it first.
@Godmook @BBuf @DarkSharpness may have a check

@ispobock
Copy link
Copy Markdown
Collaborator Author

ispobock commented Apr 3, 2026

/rerun-stage stage-b-test-1-gpu-small

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

✅ Triggered stage-b-test-1-gpu-small to run independently (skipping dependencies). View workflow run

@ispobock ispobock merged commit 47f4fd2 into main Apr 3, 2026
86 of 126 checks passed
@ispobock ispobock deleted the suite-check branch April 3, 2026 15:47
@Godmook
Copy link
Copy Markdown
Contributor

Godmook commented Apr 3, 2026

https://github.com/sgl-project/sglang/actions/runs/23933687537/job/69813982375?pr=21937#step:7:13701 test_fused_temperature_softmax.py test cannot pass, disable it first. @Godmook @BBuf @DarkSharpness may have a check

@ispobock @BBuf @DarkSharpness
The tolerance difference comes from intermediate precision: the reference path (div_ on bf16 tensor) truncates to bf16 before softmax, while the Triton kernel keeps fp32 throughout. This naturally produces ~2e-4 absolute / ~1e-2 relative difference on bf16 inputs.

Bumped to atol=5e-4, rtol=2e-2 that si above the observed max error and still tighter than most fused kernel tests in the repo (typically 1e-2 or looser). Does this range look reasonable to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants