[Kernel] DP + EP : GPTOSS + DeepEP-HighThroughput#22907
[Kernel] DP + EP : GPTOSS + DeepEP-HighThroughput#22907varun-sundar-rabindranath wants to merge 8 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request integrates GPT-OSS with DeepEPHT, introducing a new path for Mixture-of-Experts (MoE) layers using flashinfer kernels, specifically for data and expert parallelism. It adds support for mxfp8 quantization and a new trtllm_moe layer. The changes are quite extensive. My review has identified some leftover debugging code and comments in vllm/model_executor/layers/quantization/mxfp4.py which should be removed before this pull request is merged. Given that the pull request description indicates debugging is still in progress, these findings serve as a reminder for cleanup.
| if False: | ||
| # TODO(varun) : remove before landing | ||
| return self._route_and_experts_example( | ||
| layer, x, router_logits, top_k, renormalize, use_grouped_topk, | ||
| topk_group, num_expert_group, global_num_experts, expert_map, | ||
| custom_routing_function, scoring_func, e_score_correction_bias, | ||
| apply_router_weight_on_input, activation, enable_eplb, | ||
| expert_load_view, logical_to_physical_map, | ||
| logical_replica_count) |
| else: | ||
| #pass | ||
|
|
||
| if (envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 | ||
| or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_BF16): | ||
| # B200 code ?? | ||
| # Quant config shouldn't be None !! | ||
| return TrtLlmGenExperts(moe) | ||
| else: | ||
| # H100 code ?? | ||
| # you use matmul_ogs kernel here! | ||
| raise NotImplementedError( | ||
| "Mxfp4 does not support non-batched experts format for EP") |
There was a problem hiding this comment.
This else block contains leftover debugging comments and a #pass statement. These should be removed for production code to improve clarity and maintainability.
else:
if (envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8
or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_BF16):
return TrtLlmGenExperts(moe)
else:
raise NotImplementedError(
"Mxfp4 does not support non-batched experts format for EP")|
Also a side note, please use the evaluation strategy in the recipe instead of |
|
This pull request has merge conflicts that must be resolved before it can be |
| logical_replica_count: Optional[torch.Tensor] = None | ||
| ) -> torch.Tensor: | ||
|
|
||
| topk_weights, topk_ids = FusedMoE.select_experts( |
There was a problem hiding this comment.
topk_ids and topk_weights needs to be local and non local experts' id should be -1.
Or use global topk_ids and topk_weights, and provide local_expert_offset and local_num_experts.
There was a problem hiding this comment.
The topk_ids and topk_weights gets processed into,
use global topk_ids and topk_weights, and provide local_expert_offset and local_num_experts
in the all2alls. I verified that this is correct.
| None, | ||
| "tile_tokens_dim": | ||
| self._get_tile_tokens_dim(x_quant, topk, local_num_experts), | ||
| "routing_method_type": |
There was a problem hiding this comment.
routing_method_type is hardcoded to renormalize. Maybe add assertion above to make sure it's not using a different routing method.
|
Update on debugging:
|
| "MX-FP8 quantization. Please install it with" \ | ||
| "`pip install flashinfer`") from err | ||
|
|
||
| return mxfp8_quantize(x, is_sf_swizzled_layout=False) |
There was a problem hiding this comment.
@IwakuraRein is this the right way to quantize bf16 activations to fp8 ? Thanks.
There was a problem hiding this comment.
Yes, when you are using mxfp4 x mxfp8 path.
If torch.compile is using garbage values to initialize |
Purpose
Integrate GPTOSS with DeepEPHTPrepareFinalize
Commands:
server command:
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve openai/gpt-oss-20b --port 9010 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-cachinglm_eval command:
lm_eval --model local-completions --tasks gsm8k --model_args model=openai/gpt-oss-20b,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100Issue: The server some times hangs / reports IMA. When the server runs through the lm_eval outputs are good. They look like
and match main TP.
Debugging:
This PR uses
trtllm_fp4_block_scale_routed_moefrom flashinfer. I narrowed the issue down to the flashinfer kernel.mPtrExpertCountsis not big enough.I am still debugging this.
Test Plan
Test Result
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.