[MoE Refactor] DefaultMoERunner simplifcation#33049
[MoE Refactor] DefaultMoERunner simplifcation#33049robertgshaw2-redhat merged 5 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--33049.org.readthedocs.build/en/33049/ |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and beneficial refactoring of the MoE implementation. By moving the forward pass logic from the FusedMoE layer into a new DefaultMoERunner class, the code is now cleaner, more modular, and easier to maintain. The dynamic registration of custom ops per layer is also a great improvement. I've found a couple of issues in the new implementation that need to be addressed, one of which is critical. Otherwise, this is a solid simplification.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py (422-470)
The prepare_dp_allgather_tensor method is being called with incorrect arguments. It expects a layer object as its first argument, but it's receiving self (the DefaultMoERunner instance) instead. This will cause a runtime error when post_quant_allgather is true.
This suggestion fixes the issue by passing the layer object to _maybe_dispatch and using it correctly. You will also need to update the call to _maybe_dispatch in forward_impl to pass the layer object:
# in forward_impl
hidden_states, router_logits, extra_tensor = self._maybe_dispatch(
layer, hidden_states, router_logits
) def _maybe_dispatch(
self,
layer: torch.nn.Module,
hidden_states: torch.Tensor,
router_logits: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor | None]:
extra_tensor: torch.Tensor | None = None
if self.do_naive_dispatch_combine:
post_quant_allgather = (
self.moe_config.dp_size > 1
and self.moe_config.use_ep
and getattr(self.quant_method, "do_post_quant_allgather", False)
)
if post_quant_allgather:
hidden_states_to_dispatch, extra_tensors = (
self.quant_method.prepare_dp_allgather_tensor(
layer, hidden_states, router_logits
)
)
else:
hidden_states_to_dispatch = hidden_states
hidden_states, router_logits, extra_tensors_dispatched = (
get_ep_group().dispatch(
hidden_states_to_dispatch,
router_logits,
self.moe_config.is_sequence_parallel,
extra_tensors=extra_tensors,
)
)
if extra_tensors_dispatched is not None:
assert len(extra_tensors_dispatched) == 1
extra_tensor = extra_tensors_dispatched[0]
# NOTE: Similar with DP, PCP also needs dispatch and combine. For
# simplicity, AgRsAll2All was added separately for PCP here. Maybe
# we should modify All2AllManager abstract to better support PCP.
if self.moe_config.pcp_size > 1:
hidden_states = get_pcp_group().all_gather(
hidden_states,
dim=0,
)
router_logits = get_pcp_group().all_gather(
router_logits,
dim=0,
)
return hidden_states, router_logits, extra_tensorvllm/model_executor/layers/fused_moe/shared_fused_moe.py (51-53)
Removing this property changes the behavior of SharedFusedMoE. Previously, the internal gate was disabled when self.use_overlapped was false. Now, the gate from the parent FusedMoE class will always be used if it's provided, regardless of use_overlapped.
The use_overlapped flag can be disabled for correctness reasons in certain configurations (e.g., with EPLB). This change might re-introduce correctness issues that the conditional gate was intended to prevent. The previous behavior of disabling the gate when not using overlap should be preserved to avoid potential regressions.
|
This pull request has merge conflicts that must be resolved before it can be |
| # TODO: Once the OOM issue for the TPU backend is resolved, we will | ||
| # switch to using the moe_forward custom op. | ||
| # Note: CPU doesn't require wrapped forward_impl. | ||
| if current_platform.is_tpu() or current_platform.is_cpu(): |
There was a problem hiding this comment.
note, we should be able to remove the TPU stuff soon
vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py
Outdated
Show resolved
Hide resolved
|
Hi @bnellnm, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
4f0285e to
f8ea830
Compare
f8ea830 to
f36b66e
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
f36b66e to
fd392e2
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
fd392e2 to
d02c8fd
Compare
2ff88fa to
42aea01
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py
Outdated
Show resolved
Hide resolved
|
LGTM, pending fixing the overlap |
Signed-off-by: Bill Nell <bnell@redhat.com>
|
I ran some LL performance tests, B200 NVFP4 DeepSeek EP=4
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 12.48
Total input tokens: 10
Total generated tokens: 1000
Request throughput (req/s): 0.80
Output token throughput (tok/s): 80.11
Peak output token throughput (tok/s): 130.00
Peak concurrent requests: 3.00
Total token throughput (tok/s): 80.91
---------------Time to First Token----------------
Mean TTFT (ms): 495.57
Median TTFT (ms): 18.25
P99 TTFT (ms): 4359.99
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.60
Median TPOT (ms): 7.60
P99 TPOT (ms): 7.61
---------------Inter-token Latency----------------
Mean ITL (ms): 7.60
Median ITL (ms): 7.60
P99 ITL (ms): 8.02
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 4 --num-prompts 40 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7265a87165c0>, trust_remote_code=False, seed=1773945917, num_prompts=40, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=4, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-6b3a2a38-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:45:23 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 4
100%|█████████████████████████████████████████████████████████████████████| 40/40 [00:09<00:00, 4.14it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 40
Failed requests: 0
Maximum request concurrency: 4
Benchmark duration (s): 9.66
Total input tokens: 40
Total generated tokens: 4000
Request throughput (req/s): 4.14
Output token throughput (tok/s): 414.13
Peak output token throughput (tok/s): 416.00
Peak concurrent requests: 8.00
Total token throughput (tok/s): 418.27
---------------Time to First Token----------------
Mean TTFT (ms): 32.90
Median TTFT (ms): 31.46
P99 TTFT (ms): 54.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 9.41
Median TPOT (ms): 9.39
P99 TPOT (ms): 9.48
---------------Inter-token Latency----------------
Mean ITL (ms): 9.41
Median ITL (ms): 9.38
P99 ITL (ms): 9.98
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 8 --num-prompts 80 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x75ecd630e5c0>, trust_remote_code=False, seed=1773945933, num_prompts=80, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=8, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-db844791-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:45:39 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
100%|█████████████████████████████████████████████████████████████████████| 80/80 [00:11<00:00, 7.05it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 80
Failed requests: 0
Maximum request concurrency: 8
Benchmark duration (s): 11.35
Total input tokens: 80
Total generated tokens: 8000
Request throughput (req/s): 7.05
Output token throughput (tok/s): 704.90
Peak output token throughput (tok/s): 728.00
Peak concurrent requests: 16.00
Total token throughput (tok/s): 711.95
---------------Time to First Token----------------
Mean TTFT (ms): 35.80
Median TTFT (ms): 36.65
P99 TTFT (ms): 61.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 11.09
Median TPOT (ms): 11.08
P99 TPOT (ms): 11.22
---------------Inter-token Latency----------------
Mean ITL (ms): 11.09
Median ITL (ms): 11.06
P99 ITL (ms): 11.69
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 16 --num-prompts 160 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x794e381fa5c0>, trust_remote_code=False, seed=1773945951, num_prompts=160, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=16, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-5051fcef-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:45:57 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|███████████████████████████████████████████████████████████████████| 160/160 [00:14<00:00, 11.32it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 160
Failed requests: 0
Maximum request concurrency: 16
Benchmark duration (s): 14.13
Total input tokens: 160
Total generated tokens: 16000
Request throughput (req/s): 11.32
Output token throughput (tok/s): 1132.04
Peak output token throughput (tok/s): 1168.00
Peak concurrent requests: 32.00
Total token throughput (tok/s): 1143.36
---------------Time to First Token----------------
Mean TTFT (ms): 51.19
Median TTFT (ms): 45.91
P99 TTFT (ms): 88.35
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 13.74
Median TPOT (ms): 13.73
P99 TPOT (ms): 13.96
---------------Inter-token Latency----------------
Mean ITL (ms): 13.74
Median ITL (ms): 13.68
P99 ITL (ms): 18.07
================================================== |
|
results look identifixal |
|
going to merge this. docs build is stuck. |
9279c59
into
vllm-project:main
Signed-off-by: Bill Nell <bnell@redhat.com>
### What this PR does / why we need it? Main2main Upgrade vllm commit to 0320 17:00 1. fix vllm refactored `_moe_forward` to call `runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True. vllm PR:"[MoE Refactor] DefaultMoERunner simplification [#33049](vllm-project/vllm#33049)" 2.fix vllm moved the call to `self._set_compile_ranges()` in `VllmConfig.__post_init__` from **before** `check_and_update_config()` to **after** it (to allow platforms to lower `max_num_batched_tokens` first). vllm PR: "fix(xpu): Re-compute compile ranges after platform-specific config updates" [#37523](vllm-project/vllm#37523) ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8b63257 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it? Main2main Upgrade vllm commit to 0320 17:00 1. fix vllm refactored `_moe_forward` to call `runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True. vllm PR:"[MoE Refactor] DefaultMoERunner simplification [#33049](vllm-project/vllm#33049)" 2.fix vllm moved the call to `self._set_compile_ranges()` in `VllmConfig.__post_init__` from **before** `check_and_update_config()` to **after** it (to allow platforms to lower `max_num_batched_tokens` first). vllm PR: "fix(xpu): Re-compute compile ranges after platform-specific config updates" [#37523](vllm-project/vllm#37523) ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8b63257 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
Purpose
forwardmethods in DefaultMoERunner by moving bits of functionality to helper methods.Test Plan
Run all MoE Integration tests + all Kernel MoE tests (including fp8 + fp4, etc.)
Test Result
cc @yzong-rh
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.