Skip to content

[MoE Refactor] Add Temporary Integration Tests - H100/B200#31759

Merged
mgoin merged 17 commits intomainfrom
add-more-ci-for-moe-refactor
Jan 6, 2026
Merged

[MoE Refactor] Add Temporary Integration Tests - H100/B200#31759
mgoin merged 17 commits intomainfrom
add-more-ci-for-moe-refactor

Conversation

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Jan 6, 2026

Purpose

  • add ci job to validate MoE refactor
  • this is very compute intensive and duplicative, so this is a temporary job that I will run on my PRs

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds H100 integration tests for MoE refactoring. My review found several critical and high-severity issues, primarily related to test configuration.

  • The new Buildkite pipeline step is misconfigured, pointing to the wrong test script, configuration file, and using an incorrect tensor parallelism size.
  • Several YAML configuration files contain errors, such as typos in environment variable names, invalid syntax, conflicting settings (e.g., enabling both DeepGEMM and FlashInfer), and mismatches between filenames and their content (e.g., a 'cutlass' test enabling 'marlin').
  • The main test list file (config-h100.txt) is incomplete and omits several of the newly added test configurations, meaning they would not be executed.

These issues need to be addressed to ensure the new integration tests run correctly and validate the intended configurations.

Signed-off-by: Robert Shaw <robshaw@redhat.com>
@robertgshaw2-redhat robertgshaw2-redhat changed the title add h100 integration tests [MoE Refactor] Add Temporary Integration Tests Jan 6, 2026
@robertgshaw2-redhat robertgshaw2-redhat marked this pull request as ready for review January 6, 2026 01:36
@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 6, 2026
@robertgshaw2-redhat robertgshaw2-redhat changed the title [MoE Refactor] Add Temporary Integration Tests [MoE Refactor] Add Temporary Integration Tests - H100 Jan 6, 2026
Robert Shaw added 3 commits January 6, 2026 02:04
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

I unblocked the job

Signed-off-by: Robert Shaw <robshaw@redhat.com>
@mergify mergify bot mentioned this pull request Jan 6, 2026
5 tasks
@robertgshaw2-redhat robertgshaw2-redhat changed the title [MoE Refactor] Add Temporary Integration Tests - H100 [MoE Refactor] Add Temporary Integration Tests - H100/B200 Jan 6, 2026
Robert Shaw added 4 commits January 5, 2026 23:07
nit
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces temporary integration tests for the MoE refactor, targeting H100 and B200 GPUs. It adds new Buildkite pipeline steps and a comprehensive set of YAML configuration files for various test scenarios. My review identified a critical syntax error in one of the YAML configuration files that would cause a CI job to fail, and a misconfiguration in another CI job where an incorrect test suite was specified. Addressing these issues will ensure the new temporary tests run correctly.

num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line contains only indented whitespace, which is invalid YAML syntax in this context. This will cause the YAML parser to fail and crash the test run. Please remove this line to fix the syntax.

optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The B200 integration test is configured to use the config-h100.txt file. This appears to be a copy-paste error and will cause the H100 test suite to run on B200 hardware, instead of the intended B200-specific tests defined in config-b200.txt.

    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 6, 2026

Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Robert Shaw and others added 2 commits January 6, 2026 09:03
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@github-project-automation github-project-automation bot moved this to Backlog in MoE Refactor Jan 6, 2026
@robertgshaw2-redhat robertgshaw2-redhat moved this from Backlog to In progress in MoE Refactor Jan 6, 2026
@robertgshaw2-redhat robertgshaw2-redhat moved this from Done to In review in MoE Refactor Jan 6, 2026
@robertgshaw2-redhat robertgshaw2-redhat moved this from In progress to Done in MoE Refactor Jan 6, 2026
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

@mgoin mgoin merged commit d3e477c into main Jan 6, 2026
21 checks passed
@mgoin mgoin deleted the add-more-ci-for-moe-refactor branch January 6, 2026 15:34
@github-project-automation github-project-automation bot moved this from In review to Done in MoE Refactor Jan 6, 2026
LucasWilkinson pushed a commit to neuralmagic/vllm that referenced this pull request Jan 6, 2026
…ect#31759)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
…ect#31759)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…ect#31759)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…ect#31759)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…ect#31759)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants