-
-
Notifications
You must be signed in to change notification settings - Fork 15.7k
Add more ci for moe refactor b200 #31769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
dd90b6d
acaa9a5
ab6a497
e16b007
546b42a
dfa7775
d68cff5
b18f148
4386dd8
1fb843b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8" | ||
| accuracy_threshold: 0.92 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP8: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "throughput" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8" | ||
| accuracy_threshold: 0.92 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP8: "1" | ||
| VLLM_FLASHINFER_MaOE_BACKEND: "latency" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| model_name: "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8" | ||
| accuracy_threshold: 0.92 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_TEST_FORCE_FP8_MARLIN: "1" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| model_name: "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8" | ||
| accuracy_threshold: 0.92 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # TODO(rob): enable | ||
| # model_name: "amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV" | ||
| # accuracy_threshold: 0.62 | ||
| # num_questions: 1319 | ||
| # num_fewshot: 5 | ||
| # server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| # env: | ||
| # VLLM_USE_FLASHINFER_MOE_FP8: "1" | ||
| # VLLM_FLASHINFER_MOE_BACKEND: "throughput" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| model_name: "amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV" | ||
| accuracy_threshold: 0.62 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "1" | ||
| VLLM_USE_DEEP_GEMM_MOE: "1" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "0" | ||
| VLLM_USE_DEEP_GEMM_MOE: "0" | ||
| VLLM_USE_FLASHINFER_MOE_FP8: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "throughput" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "0" | ||
| VLLM_USE_DEEP_GEMM_MOE: "0" | ||
| VLLM_USE_FLASHINFER_MOE_FP8: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "latency" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "0" | ||
| VLLM_USE_DEEP_GEMM_MOE: "0" | ||
| VLLM_TEST_FORCE_FP8_MARLIN: "1" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "0" | ||
| VLLM_USE_DEEP_GEMM_MOE: "0" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-FP8-block" | ||
| accuracy_threshold: 0.85 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "1" | ||
| VLLM_USE_DEEP_GEMM_MOE: "1" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-FP8-block" | ||
| accuracy_threshold: 0.85 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "1" | ||
| VLLM_USE_DEEP_GEMM_MOE: "1" | ||
| VLLM_USE_FLASHINFER_MOE_FP8: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "throughput" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-FP8-block" | ||
| accuracy_threshold: 0.85 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "0" | ||
| VLLM_USE_DEEP_GEMM_MOE: "0" | ||
| VLLM_TEST_FORCE_FP8_MARLIN: "1" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-FP8-block" | ||
| accuracy_threshold: 0.85 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-FP8-dynamic" | ||
| accuracy_threshold: 0.85 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-FP8-dynamic" | ||
| accuracy_threshold: 0.85 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_TEST_FORCE_FP8_MARLIN: "1" | ||
|
Comment on lines
+6
to
+7
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --data-parallel-size 2 --enable-expert-parallel" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP4: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "throughput" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP4: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "throughput" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP4: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "latency" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP4: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "throughput" | ||
|
Comment on lines
+6
to
+8
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The filename indicates a env:
VLLM_USE_DEEP_GEMM: "0"
VLLM_USE_DEEP_GEMM_MOE: "0"
VLLM_TEST_FORCE_FP8_MARLIN: "1" |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| model_name: "RedHatAI/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_TEST_FORCE_FP8_MARLIN: "1" | ||
|
Comment on lines
+6
to
+7
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "nvidia/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --data-parallel-size 2 --enable-expert-parallel" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP4: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "throughput" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "nvidia/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP4: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "throughput" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "nvidia/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP4: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "latency" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "nvidia/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_FP4: "1" | ||
| VLLM_FLASHINFER_MOE_BACKEND: "throughput" | ||
|
Comment on lines
+6
to
+8
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| model_name: "nvidia/Qwen3-30B-A3B-NVFP4" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| env: | ||
| VLLM_TEST_FORCE_FP8_MARLIN: "1" | ||
|
Comment on lines
+6
to
+7
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| Llama-4-Scout-Fp8-ModelOpt-fi-trtllm.yaml | ||
| Qwen3-30B-A3B-Fp8-AutoFp8-fi-trtllm.yaml | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Qwen3-30B-A3B-NvFp4-CT-vllm-cutlass.yaml | ||
| Qwen3-30B-A3B-NvFp4-CT-marlin.yaml | ||
| Qwen3-30B-A3B-NvFp4-CT-fi-trtllm.yaml | ||
| Qwen3-30B-A3B-NvFp4-CT-fi-cutlass.yaml | ||
| Qwen3-30B-A3B-NvFp4-CT-fi-cutlass-dp-ep.yaml | ||
| Qwen3-30B-A3B-NvFp4-ModelOpt-vllm-cutlass.yaml | ||
| Qwen3-30B-A3B-NvFp4-ModelOpt-marlin.yaml | ||
| Qwen3-30B-A3B-NvFp4-ModelOpt-fi-trtllm.yaml | ||
| Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass.yaml | ||
| Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass-dp-ep.yaml | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| Mixtral-8x7B-Fp8-AutoFp8-triton.yaml | ||
| Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm.yaml | ||
| Qwen3-30B-A3B-Fp8-AutoFp8-fi-cutlass.yaml | ||
| Qwen3-30B-A3B-Fp8-AutoFp8-marlin.yaml | ||
| Qwen3-30B-A3B-Fp8-AutoFp8-triton.yaml | ||
| Qwen3-30B-A3B-Fp8-CT-Block-deepgemm.yaml | ||
| Qwen3-30B-A3B-Fp8-CT-Block-marlin.yaml | ||
| Qwen3-30B-A3B-Fp8-CT-Block-vllm-cutlass.yaml | ||
| Qwen3-30B-A3B-Fp8-CT-Channel-marlin.yaml | ||
| Qwen3-30B-A3B-Fp8-CT-Channel-vllm-cutlass.yaml | ||
| Llama-4-Scout-Fp8-ModelOpt-fi-cutlass.yaml | ||
| Llama-4-Scout-Fp8-ModelOpt-marlin.yaml | ||
| Llama-4-Scout-Fp8-ModelOpt-triton.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The B200 integration test is incorrectly using the configuration file for H100 (
config-h100.txt). This will cause the wrong set of tests to be executed on the B200 hardware. It should be usingconfig-b200.txtto run the tests intended for B200.- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt