Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1406,3 +1406,19 @@ steps:
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020 2 1

##### MoE Refactor (Temporary) Tests #####

- label: MoE Refactor Integration Test (H100 - TEMPORARY) # optional
gpu: h100
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt

- label: MoE Refactor Integration Test (B200 - TEMPORARY) # optional
gpu: b200
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The B200 integration test is incorrectly using the configuration file for H100 (config-h100.txt). This will cause the wrong set of tests to be executed on the B200 hardware. It should be using config-b200.txt to run the tests intended for B200.

    - pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt

Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8"
accuracy_threshold: 0.92
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_FLASHINFER_MOE_FP8: "1"
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8"
accuracy_threshold: 0.92
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_FLASHINFER_MOE_FP8: "1"
VLLM_FLASHINFER_MaOE_BACKEND: "latency"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a typo in the environment variable name VLLM_FLASHINFER_MaOE_BACKEND. It should be VLLM_FLASHINFER_MOE_BACKEND. This typo will prevent the correct backend from being configured, causing the test to not run as intended.

  VLLM_FLASHINFER_MOE_BACKEND: "latency"

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
model_name: "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8"
accuracy_threshold: 0.92
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_TEST_FORCE_FP8_MARLIN: "1"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
model_name: "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8"
accuracy_threshold: 0.92
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line contains only indentation, which makes the YAML file invalid. This will cause a parsing error when the test configuration is loaded. Please remove this line to ensure the file is valid.

Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# TODO(rob): enable
# model_name: "amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV"
# accuracy_threshold: 0.62
# num_questions: 1319
# num_fewshot: 5
# server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
# env:
# VLLM_USE_FLASHINFER_MOE_FP8: "1"
# VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
model_name: "amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV"
accuracy_threshold: 0.62
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_DEEP_GEMM: "1"
VLLM_USE_DEEP_GEMM_MOE: "1"
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_DEEP_GEMM: "0"
VLLM_USE_DEEP_GEMM_MOE: "0"
VLLM_USE_FLASHINFER_MOE_FP8: "1"
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_DEEP_GEMM: "0"
VLLM_USE_DEEP_GEMM_MOE: "0"
VLLM_USE_FLASHINFER_MOE_FP8: "1"
VLLM_FLASHINFER_MOE_BACKEND: "latency"
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_DEEP_GEMM: "0"
VLLM_USE_DEEP_GEMM_MOE: "0"
VLLM_TEST_FORCE_FP8_MARLIN: "1"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_DEEP_GEMM: "0"
VLLM_USE_DEEP_GEMM_MOE: "0"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "RedHatAI/Qwen3-30B-A3B-FP8-block"
accuracy_threshold: 0.85
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_DEEP_GEMM: "1"
VLLM_USE_DEEP_GEMM_MOE: "1"
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_name: "RedHatAI/Qwen3-30B-A3B-FP8-block"
accuracy_threshold: 0.85
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_DEEP_GEMM: "1"
VLLM_USE_DEEP_GEMM_MOE: "1"
VLLM_USE_FLASHINFER_MOE_FP8: "1"
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
model_name: "RedHatAI/Qwen3-30B-A3B-FP8-block"
accuracy_threshold: 0.85
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_DEEP_GEMM: "0"
VLLM_USE_DEEP_GEMM_MOE: "0"
VLLM_TEST_FORCE_FP8_MARLIN: "1"
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
model_name: "RedHatAI/Qwen3-30B-A3B-FP8-block"
accuracy_threshold: 0.85
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
model_name: "RedHatAI/Qwen3-30B-A3B-FP8-dynamic"
accuracy_threshold: 0.85
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
model_name: "RedHatAI/Qwen3-30B-A3B-FP8-dynamic"
accuracy_threshold: 0.85
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_TEST_FORCE_FP8_MARLIN: "1"
Comment on lines +6 to +7
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The filename suggests a vllm-cutlass configuration, but the environment variable VLLM_TEST_FORCE_FP8_MARLIN is set, which forces the marlin kernel. This is inconsistent and will not test the intended vllm-cutlass kernel. Based on other vllm-cutlass configurations, this env block should be removed.

Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --data-parallel-size 2 --enable-expert-parallel"
env:
VLLM_USE_FLASHINFER_MOE_FP4: "1"
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_FLASHINFER_MOE_FP4: "1"
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_FLASHINFER_MOE_FP4: "1"
VLLM_FLASHINFER_MOE_BACKEND: "latency"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_FLASHINFER_MOE_FP4: "1"
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Comment on lines +6 to +8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The filename indicates a marlin test configuration, but the environment variables are set for flashinfer. This is inconsistent and will not test the marlin kernel. Please update the environment variables to be consistent with a marlin test for this model type.

env:
  VLLM_USE_DEEP_GEMM: "0"
  VLLM_USE_DEEP_GEMM_MOE: "0"
  VLLM_TEST_FORCE_FP8_MARLIN: "1"

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_TEST_FORCE_FP8_MARLIN: "1"
Comment on lines +6 to +7
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The filename suggests a vllm-cutlass configuration, but the environment variable VLLM_TEST_FORCE_FP8_MARLIN is set, which forces the marlin kernel. This is inconsistent and will not test the intended vllm-cutlass kernel. The env block should be removed to allow the default kernel to be used.

Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "nvidia/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --data-parallel-size 2 --enable-expert-parallel"
env:
VLLM_USE_FLASHINFER_MOE_FP4: "1"
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "nvidia/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_FLASHINFER_MOE_FP4: "1"
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "nvidia/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_FLASHINFER_MOE_FP4: "1"
VLLM_FLASHINFER_MOE_BACKEND: "latency"
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model_name: "nvidia/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_USE_FLASHINFER_MOE_FP4: "1"
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
Comment on lines +6 to +8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The filename indicates a marlin test configuration, but the environment variables are set for flashinfer. This is inconsistent and will not test the marlin kernel. Please update the environment variables to force the marlin kernel.

env:
  VLLM_TEST_FORCE_FP8_MARLIN: "1"

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
model_name: "nvidia/Qwen3-30B-A3B-NVFP4"
accuracy_threshold: 0.88
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
env:
VLLM_TEST_FORCE_FP8_MARLIN: "1"
Comment on lines +6 to +7
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The filename suggests a vllm-cutlass configuration, but the environment variable VLLM_TEST_FORCE_FP8_MARLIN is set, which forces the marlin kernel. This is inconsistent and will not test the intended vllm-cutlass kernel. The env block should be removed.

12 changes: 12 additions & 0 deletions tests/evals/gsm8k/configs/moe-refactor/config-b200.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Llama-4-Scout-Fp8-ModelOpt-fi-trtllm.yaml
Qwen3-30B-A3B-Fp8-AutoFp8-fi-trtllm.yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line has trailing whitespace, which could cause parsing issues for scripts that consume this file, potentially leading to test failures. Please remove the trailing spaces.

Qwen3-30B-A3B-Fp8-AutoFp8-fi-trtllm.yaml

Qwen3-30B-A3B-NvFp4-CT-vllm-cutlass.yaml
Qwen3-30B-A3B-NvFp4-CT-marlin.yaml
Qwen3-30B-A3B-NvFp4-CT-fi-trtllm.yaml
Qwen3-30B-A3B-NvFp4-CT-fi-cutlass.yaml
Qwen3-30B-A3B-NvFp4-CT-fi-cutlass-dp-ep.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-vllm-cutlass.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-marlin.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-fi-trtllm.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass-dp-ep.yaml
13 changes: 13 additions & 0 deletions tests/evals/gsm8k/configs/moe-refactor/config-h100.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Mixtral-8x7B-Fp8-AutoFp8-triton.yaml
Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm.yaml
Qwen3-30B-A3B-Fp8-AutoFp8-fi-cutlass.yaml
Qwen3-30B-A3B-Fp8-AutoFp8-marlin.yaml
Qwen3-30B-A3B-Fp8-AutoFp8-triton.yaml
Qwen3-30B-A3B-Fp8-CT-Block-deepgemm.yaml
Qwen3-30B-A3B-Fp8-CT-Block-marlin.yaml
Qwen3-30B-A3B-Fp8-CT-Block-vllm-cutlass.yaml
Qwen3-30B-A3B-Fp8-CT-Channel-marlin.yaml
Qwen3-30B-A3B-Fp8-CT-Channel-vllm-cutlass.yaml
Llama-4-Scout-Fp8-ModelOpt-fi-cutlass.yaml
Llama-4-Scout-Fp8-ModelOpt-marlin.yaml
Llama-4-Scout-Fp8-ModelOpt-triton.yaml
Loading