[MoE Refactor][15/N] Apply Refactor to Fp8#31415
Conversation
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2" | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2 --disable-uvicorn-access-log" |
There was a problem hiding this comment.
What is this? Do you just want to add it to the eval script here
vllm/tests/evals/gsm8k/test_gsm8k_correctness.py
Lines 60 to 65 in c907d22
There was a problem hiding this comment.
yeah, thats a good idea. It makes the logs much easier to read by not logging out /completions every request
| # Delayed import is required since the oracle is imported | ||
| # by CPU backends which cannot import all of these experts. | ||
| # TODO: update the experts to make this not happen. | ||
| from vllm.model_executor.layers.fused_moe import ( | ||
| TritonExperts, | ||
| TritonOrDeepGemmExperts, | ||
| ) | ||
| from vllm.model_executor.layers.fused_moe.cutlass_moe import ( | ||
| CutlassExpertsFp8, | ||
| ) | ||
| from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import ( | ||
| FlashInferExperts, | ||
| ) | ||
| from vllm.model_executor.layers.fused_moe.fused_marlin_moe import ( | ||
| MarlinExperts, | ||
| ) | ||
| from vllm.model_executor.layers.fused_moe.prepare_finalize import ( | ||
| MoEPrepareAndFinalizeNoEP, | ||
| ) | ||
| from vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe import ( | ||
| AiterExperts, | ||
| ) |
There was a problem hiding this comment.
I think we should just put each import within each conditional, rather than importing all
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
zyongye
left a comment
There was a problem hiding this comment.
The overall structure LGTM. But we will need to revisit the fallback experts abstraction in the future as discussed.
| assert w2_input_scale is not None | ||
|
|
||
| rotate_weights_for_fi_trtllm_fp8_per_tensor_moe(w13, w2) | ||
| register_scales_for_trtllm_fp8_per_tensor_moe( |
There was a problem hiding this comment.
Should we register this inside the FusedMoEMethod? instead of in the utils?
There was a problem hiding this comment.
there is some work to make TRTLLM a modular kernel. once that is done, we can revisit it
Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
Great work! |
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
SUMMARY:
Test Plan
python3 ./benchmark_grouped_gemm_cutlass.py --model "deepseek-ai/DeepSeek-V2-Lite" --tp-sizes 4 --batch-sizes 2 4python benchmark_cutlass_moe_fp8.py \ --model "Llama-4-Maverick-17B-128E-Instruct-FP8" \ --tp-sizes 8 \ --batch-size 2 4 8 \ --per-act-token-opts false \ --per-out-ch-opts falseTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.