[MoE Refactor] Add Temporary Integration Tests - H100/B200#31759
[MoE Refactor] Add Temporary Integration Tests - H100/B200#31759
Conversation
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request adds H100 integration tests for MoE refactoring. My review found several critical and high-severity issues, primarily related to test configuration.
- The new Buildkite pipeline step is misconfigured, pointing to the wrong test script, configuration file, and using an incorrect tensor parallelism size.
- Several YAML configuration files contain errors, such as typos in environment variable names, invalid syntax, conflicting settings (e.g., enabling both DeepGEMM and FlashInfer), and mismatches between filenames and their content (e.g., a 'cutlass' test enabling 'marlin').
- The main test list file (
config-h100.txt) is incomplete and omits several of the newly added test configurations, meaning they would not be executed.
These issues need to be addressed to ensure the new integration tests run correctly and validate the intended configurations.
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-fi-trtllm.yaml
Outdated
Show resolved
Hide resolved
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Channel-marlin.yaml
Outdated
Show resolved
Hide resolved
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Channel-vllm-cutlass.yaml
Outdated
Show resolved
Hide resolved
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-triton.yaml
Outdated
Show resolved
Hide resolved
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-fi-cutlass.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: Robert Shaw <robshaw@redhat.com>
…m-project/vllm into add-more-ci-for-moe-refactor
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
I unblocked the job |
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces temporary integration tests for the MoE refactor, targeting H100 and B200 GPUs. It adds new Buildkite pipeline steps and a comprehensive set of YAML configuration files for various test scenarios. My review identified a critical syntax error in one of the YAML configuration files that would cause a CI job to fail, and a misconfiguration in another CI job where an incorrect test suite was specified. Addressing these issues will ensure the new temporary tests run correctly.
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
|
|
.buildkite/test-pipeline.yaml
Outdated
| optional: true | ||
| num_gpus: 2 | ||
| commands: | ||
| - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt |
There was a problem hiding this comment.
The B200 integration test is configured to use the config-h100.txt file. This appears to be a copy-paste error and will cause the H100 test suite to run on B200 hardware, instead of the intended B200-specific tests defined in config-b200.txt.
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt|
Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
…ect#31759) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…ect#31759) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…ect#31759) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…ect#31759) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…ect#31759) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
Purpose
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.