Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a proof-of-concept for a naive dispatch/combine mechanism for expert parallelism and refactors the dispatch method across various device communicators to use topk_weights and topk_ids. My review has identified a few critical issues in the new MoEPrepareAndFinalizeNaiveEP implementation that could lead to runtime errors or incorrect behavior. Additionally, there are some debugging log statements that should be removed before merging.
| a1, _, extra_tensors = get_ep_group().dispatch( | ||
| a1, | ||
| self.dummy_tensor, # router logits | ||
| is_sequence_parallel=False, # TODO? | ||
| extra_tensors=extra_tensors, | ||
| ) |
There was a problem hiding this comment.
The call to get_ep_group().dispatch appears to use an outdated signature. The dispatch methods in vllm/distributed/device_communicators have been updated to accept topk_weights and topk_ids directly. However, the calling signature from GroupCoordinator in parallel_state.py (which is what get_ep_group() returns) seems to be unchanged, leading to a signature mismatch where a boolean is_sequence_parallel is passed as topk_ids (a tensor). This will likely cause a TypeError at runtime. This is a critical issue that needs to be addressed, likely by updating vllm/distributed/parallel_state.py and then adjusting this call accordingly.
| logger.info_once(f"{router_logits.shape=}") | ||
| topk_weights, topk_ids = layer.select_experts( | ||
| hidden_states=x, | ||
| router_logits=router_logits, | ||
| ) | ||
| logger.info_once(f"{router_logits.shape=}") | ||
| logger.info_once(f"{topk_weights.shape=}") |
There was a problem hiding this comment.
|
This pull request has merge conflicts that must be resolved before it can be |
| def topk_indices_dtype(self) -> torch.dtype | None: | ||
| return None | ||
|
|
||
| def num_dispatchers(self) -> int: |
There was a problem hiding this comment.
what should this be?
There was a problem hiding this comment.
For now, this can be the all2all_manager world size. If DP+TP is supported then it is divided by the tp world size.
There was a problem hiding this comment.
why do we need to know num_dispatcher?
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
… cutlass Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
|
||
| from vllm.platforms import current_platform | ||
|
|
||
| # The torch ops do not support fp8, so use an int8 view. |
There was a problem hiding this comment.
which ops are missing? too late for this PR, but we can improve fp8 coverage in PyTorch to make things better in the future
There was a problem hiding this comment.
float8_e4m3nz
There was a problem hiding this comment.
i.e. the usual one for nvidia gpus
|
Nice, this seems much simpler than I expected. |
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
replaced by #32567 |
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.