Skip to content

ci: Enable GB200 runners#2017

Merged
terrykong merged 49 commits intomainfrom
chtruong/gb200
Mar 9, 2026
Merged

ci: Enable GB200 runners#2017
terrykong merged 49 commits intomainfrom
chtruong/gb200

Conversation

@chtruong814
Copy link
Copy Markdown
Contributor

@chtruong814 chtruong814 commented Feb 24, 2026

What does this PR do ?

  • Enable GB200 runners. Fallback to A100 runners if external contributor. Currently, we are unable to run CI for external contributors in the GB200 cluster. We have additional work to enable this.
  • fp8 vllm generation tests are failing on gb200. Skipping this for now. Will open an issue
  • The eval functional tests are failing on gb200 because the resulting score is different than expected. Updating the expected score to a wider range
  • The image name is updated by a Github Action variable. Currently it's "megatron-bridge" because we did not create the "rl" docker repo in the cluster environment yet. We will update that before we merge this.
  • Going forward, the CI needs to be kicked off by commenting with /ok to test <commit_sha>. The CI will still respect the labels applied. This aligns with how we are kicking off CI for other repos.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores

    • Updated CI/CD infrastructure with configurable registry and test data path parameters; simplified test setup by removing cloud-specific dependencies
    • Modified PR bot auto-sync configuration
  • Bug Fixes

    • Added hardware compatibility checks to skip FP8 tests on unsupported devices
  • Tests

    • Disabled four GPU functional tests (eval and grpo variants)
  • Dependencies

    • Replaced decord with decord2

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 added the CI:docs Run doctest label Feb 24, 2026
@github-actions github-actions bot added CI Relating to CI and removed CI:docs Run doctest labels Feb 24, 2026
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 added CI:docs Run doctest and removed CI:docs Run doctest labels Feb 26, 2026
@chtruong814 chtruong814 added CI:docs Run doctest and removed CI:docs Run doctest labels Feb 26, 2026
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:docs Run doctest labels Feb 28, 2026
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 28, 2026
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels Feb 28, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 4, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@chtruong814
Copy link
Copy Markdown
Contributor Author

/ok to test 73e70e8

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814
Copy link
Copy Markdown
Contributor Author

/ok to test 836c8cb

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814
Copy link
Copy Markdown
Contributor Author

/ok to test 7166bce

kajalj22
kajalj22 previously approved these changes Mar 6, 2026
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
terrykong
terrykong previously approved these changes Mar 6, 2026
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814
Copy link
Copy Markdown
Contributor Author

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814
Copy link
Copy Markdown
Contributor Author

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814
Copy link
Copy Markdown
Contributor Author

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814
Copy link
Copy Markdown
Contributor Author

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814
Copy link
Copy Markdown
Contributor Author

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants