[MoE Refactor] DefaultMoERunner simplifcation by bnellnm · Pull Request #33049 · vllm-project/vllm

bnellnm · 2026-01-26T02:29:56Z

Purpose

Simplify the forward methods in DefaultMoERunner by moving bits of functionality to helper methods.
Disable cloning of shared experts input when inplace is disabled.

Test Plan

Run all MoE Integration tests + all Kernel MoE tests (including fp8 + fp4, etc.)

Test Result

cc @yzong-rh

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-01-26T02:30:37Z

Documentation preview: https://vllm--33049.org.readthedocs.build/en/33049/

gemini-code-assist

Code Review

This pull request introduces a significant and beneficial refactoring of the MoE implementation. By moving the forward pass logic from the FusedMoE layer into a new DefaultMoERunner class, the code is now cleaner, more modular, and easier to maintain. The dynamic registration of custom ops per layer is also a great improvement. I've found a couple of issues in the new implementation that need to be addressed, one of which is critical. Otherwise, this is a solid simplification.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py (422-470)

The prepare_dp_allgather_tensor method is being called with incorrect arguments. It expects a layer object as its first argument, but it's receiving self (the DefaultMoERunner instance) instead. This will cause a runtime error when post_quant_allgather is true.

This suggestion fixes the issue by passing the layer object to _maybe_dispatch and using it correctly. You will also need to update the call to _maybe_dispatch in forward_impl to pass the layer object:

# in forward_impl
hidden_states, router_logits, extra_tensor = self._maybe_dispatch(
    layer, hidden_states, router_logits
)

    def _maybe_dispatch(
        self,
        layer: torch.nn.Module,
        hidden_states: torch.Tensor,
        router_logits: torch.Tensor,
    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor | None]:
        extra_tensor: torch.Tensor | None = None

        if self.do_naive_dispatch_combine:
            post_quant_allgather = (
                self.moe_config.dp_size > 1
                and self.moe_config.use_ep
                and getattr(self.quant_method, "do_post_quant_allgather", False)
            )
            if post_quant_allgather:
                hidden_states_to_dispatch, extra_tensors = (
                    self.quant_method.prepare_dp_allgather_tensor(
                        layer, hidden_states, router_logits
                    )
                )
            else:
                hidden_states_to_dispatch = hidden_states

            hidden_states, router_logits, extra_tensors_dispatched = (
                get_ep_group().dispatch(
                    hidden_states_to_dispatch,
                    router_logits,
                    self.moe_config.is_sequence_parallel,
                    extra_tensors=extra_tensors,
                )
            )

            if extra_tensors_dispatched is not None:
                assert len(extra_tensors_dispatched) == 1
                extra_tensor = extra_tensors_dispatched[0]

        # NOTE: Similar with DP, PCP also needs dispatch and combine. For
        # simplicity, AgRsAll2All was added separately for PCP here. Maybe
        # we should modify All2AllManager abstract to better support PCP.
        if self.moe_config.pcp_size > 1:
            hidden_states = get_pcp_group().all_gather(
                hidden_states,
                dim=0,
            )
            router_logits = get_pcp_group().all_gather(
                router_logits,
                dim=0,
            )

        return hidden_states, router_logits, extra_tensor

vllm/model_executor/layers/fused_moe/shared_fused_moe.py (51-53)

Removing this property changes the behavior of SharedFusedMoE. Previously, the internal gate was disabled when self.use_overlapped was false. Now, the gate from the parent FusedMoE class will always be used if it's provided, regardless of use_overlapped.

The use_overlapped flag can be disabled for correctness reasons in certain configurations (e.g., with EPLB). This change might re-introduce correctness issues that the conditional gate was intended to prevent. The previous behavior of disabling the gate when not using overlap should be preserved to avoid potential regressions.

mergify · 2026-01-27T01:36:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-redhat · 2026-01-29T22:30:08Z

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py

+        # TODO: Once the OOM issue for the TPU backend is resolved, we will
+        # switch to using the moe_forward custom op.
+        # Note: CPU doesn't require wrapped forward_impl.
+        if current_platform.is_tpu() or current_platform.is_cpu():


note, we should be able to remove the TPU stuff soon

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py

mergify · 2026-02-05T01:05:01Z

Hi @bnellnm, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-11T08:50:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-02-12T17:56:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-04T04:03:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py

Signed-off-by: Bill Nell <bnell@redhat.com>

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py

robertgshaw2-redhat · 2026-03-19T12:39:55Z

LGTM, pending fixing the overlap

Signed-off-by: Bill Nell <bnell@redhat.com>

robertgshaw2-redhat · 2026-03-19T18:46:49Z

I ran some LL performance tests, B200 NVFP4 DeepSeek EP=4

pr

(vllm) robertgshaw2-redhat@dgx-b200-02:~/vllm$ just sweep-vllm
just benchmark 1 10 7890 && just benchmark 4 40 7890 && just benchmark 8 80 7890 && just benchmark 16 160 7890
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 1 --num-prompts 10 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7e0b7d1f2840>, trust_remote_code=False, seed=1773943563, num_prompts=10, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=1, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-0dcc4c60-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:06:10 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|█████████████████████████████████████████████████████████████████████| 10/10 [00:12<00:00,  1.27s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  12.65     
Total input tokens:                      10        
Total generated tokens:                  1000      
Request throughput (req/s):              0.79      
Output token throughput (tok/s):         79.05     
Peak output token throughput (tok/s):    130.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          79.84     
---------------Time to First Token----------------
Mean TTFT (ms):                          509.63    
Median TTFT (ms):                        18.42     
P99 TTFT (ms):                           4489.20   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.63      
Median TPOT (ms):                        7.63      
P99 TPOT (ms):                           7.64      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.63      
Median ITL (ms):                         7.62      
P99 ITL (ms):                            8.14      
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 4 --num-prompts 40 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x722498902d40>, trust_remote_code=False, seed=1773943583, num_prompts=40, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=4, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-e37fb4ab-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:06:29 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 4
100%|█████████████████████████████████████████████████████████████████████| 40/40 [00:09<00:00,  4.13it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     40        
Failed requests:                         0         
Maximum request concurrency:             4         
Benchmark duration (s):                  9.68      
Total input tokens:                      40        
Total generated tokens:                  4000      
Request throughput (req/s):              4.13      
Output token throughput (tok/s):         413.04    
Peak output token throughput (tok/s):    416.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          417.17    
---------------Time to First Token----------------
Mean TTFT (ms):                          34.65     
Median TTFT (ms):                        33.15     
P99 TTFT (ms):                           65.52     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.42      
Median TPOT (ms):                        9.40      
P99 TPOT (ms):                           9.56      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.42      
Median ITL (ms):                         9.38      
P99 ITL (ms):                            9.89      
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 8 --num-prompts 80 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7dc6cbafed40>, trust_remote_code=False, seed=1773943599, num_prompts=80, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=8, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-00c4d2f0-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:06:45 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
100%|█████████████████████████████████████████████████████████████████████| 80/80 [00:11<00:00,  7.02it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  11.40     
Total input tokens:                      80        
Total generated tokens:                  8000      
Request throughput (req/s):              7.02      
Output token throughput (tok/s):         701.97    
Peak output token throughput (tok/s):    720.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          708.99    
---------------Time to First Token----------------
Mean TTFT (ms):                          43.08     
Median TTFT (ms):                        41.37     
P99 TTFT (ms):                           63.82     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.06     
Median TPOT (ms):                        11.07     
P99 TPOT (ms):                           11.15     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.06     
Median ITL (ms):                         10.98     
P99 ITL (ms):                            12.17     
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 16 --num-prompts 160 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x75fa70116840>, trust_remote_code=False, seed=1773943617, num_prompts=160, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=16, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-2a5c1e8f-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:07:02 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|███████████████████████████████████████████████████████████████████| 160/160 [00:14<00:00, 11.30it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  14.16     
Total input tokens:                      160       
Total generated tokens:                  16000     
Request throughput (req/s):              11.30     
Output token throughput (tok/s):         1129.58   
Peak output token throughput (tok/s):    1184.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          1140.88   
---------------Time to First Token----------------
Mean TTFT (ms):                          51.34     
Median TTFT (ms):                        46.76     
P99 TTFT (ms):                           82.99     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.77     
Median TPOT (ms):                        13.77     
P99 TPOT (ms):                           13.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.77     
Median ITL (ms):                         13.70     
P99 ITL (ms):                            18.43     
==================================================

main

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  12.48     
Total input tokens:                      10        
Total generated tokens:                  1000      
Request throughput (req/s):              0.80      
Output token throughput (tok/s):         80.11     
Peak output token throughput (tok/s):    130.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          80.91     
---------------Time to First Token----------------
Mean TTFT (ms):                          495.57    
Median TTFT (ms):                        18.25     
P99 TTFT (ms):                           4359.99   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.60      
Median TPOT (ms):                        7.60      
P99 TPOT (ms):                           7.61      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.60      
Median ITL (ms):                         7.60      
P99 ITL (ms):                            8.02      
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 4 --num-prompts 40 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7265a87165c0>, trust_remote_code=False, seed=1773945917, num_prompts=40, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=4, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-6b3a2a38-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:45:23 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 4
100%|█████████████████████████████████████████████████████████████████████| 40/40 [00:09<00:00,  4.14it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     40        
Failed requests:                         0         
Maximum request concurrency:             4         
Benchmark duration (s):                  9.66      
Total input tokens:                      40        
Total generated tokens:                  4000      
Request throughput (req/s):              4.14      
Output token throughput (tok/s):         414.13    
Peak output token throughput (tok/s):    416.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          418.27    
---------------Time to First Token----------------
Mean TTFT (ms):                          32.90     
Median TTFT (ms):                        31.46     
P99 TTFT (ms):                           54.51     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.41      
Median TPOT (ms):                        9.39      
P99 TPOT (ms):                           9.48      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.41      
Median ITL (ms):                         9.38      
P99 ITL (ms):                            9.98      
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 8 --num-prompts 80 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x75ecd630e5c0>, trust_remote_code=False, seed=1773945933, num_prompts=80, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=8, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-db844791-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:45:39 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
100%|█████████████████████████████████████████████████████████████████████| 80/80 [00:11<00:00,  7.05it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  11.35     
Total input tokens:                      80        
Total generated tokens:                  8000      
Request throughput (req/s):              7.05      
Output token throughput (tok/s):         704.90    
Peak output token throughput (tok/s):    728.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          711.95    
---------------Time to First Token----------------
Mean TTFT (ms):                          35.80     
Median TTFT (ms):                        36.65     
P99 TTFT (ms):                           61.96     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.09     
Median TPOT (ms):                        11.08     
P99 TPOT (ms):                           11.22     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.09     
Median ITL (ms):                         11.06     
P99 ITL (ms):                            11.69     
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 16 --num-prompts 160 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x794e381fa5c0>, trust_remote_code=False, seed=1773945951, num_prompts=160, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=16, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-5051fcef-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:45:57 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|███████████████████████████████████████████████████████████████████| 160/160 [00:14<00:00, 11.32it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  14.13     
Total input tokens:                      160       
Total generated tokens:                  16000     
Request throughput (req/s):              11.32     
Output token throughput (tok/s):         1132.04   
Peak output token throughput (tok/s):    1168.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          1143.36   
---------------Time to First Token----------------
Mean TTFT (ms):                          51.19     
Median TTFT (ms):                        45.91     
P99 TTFT (ms):                           88.35     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.74     
Median TPOT (ms):                        13.73     
P99 TPOT (ms):                           13.96     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.74     
Median ITL (ms):                         13.68     
P99 ITL (ms):                            18.07     
==================================================

robertgshaw2-redhat · 2026-03-19T18:47:08Z

results look identifixal

robertgshaw2-redhat · 2026-03-19T19:01:08Z

going to merge this. docs build is stuck.

Signed-off-by: Bill Nell <bnell@redhat.com>

### What this PR does / why we need it? Main2main Upgrade vllm commit to 0320 17:00 1. fix vllm refactored `_moe_forward` to call `runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True. vllm PR:"[MoE Refactor] DefaultMoERunner simplification [#33049](vllm-project/vllm#33049)" 2.fix vllm moved the call to `self._set_compile_ranges()` in `VllmConfig.__post_init__` from **before** `check_and_update_config()` to **after** it (to allow platforms to lower `max_num_batched_tokens` first). vllm PR: "fix(xpu): Re-compute compile ranges after platform-specific config updates" [#37523](vllm-project/vllm#37523) ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8b63257 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>

mergify bot added documentation Improvements or additions to documentation v1 labels Jan 26, 2026

gemini-code-assist bot reviewed Jan 26, 2026

View reviewed changes

bnellnm force-pushed the moe-runner-1 branch from a14a2ab to 9b0b5a7 Compare January 27, 2026 01:13

mergify bot added the needs-rebase label Jan 27, 2026

robertgshaw2-redhat reviewed Jan 29, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py Outdated Show resolved Hide resolved

bnellnm marked this pull request as ready for review February 5, 2026 01:04

bnellnm requested review from WoosukKwon, mgoin, pavanimajety, tlrmchlsmth and yewentao256 as code owners February 5, 2026 01:04

bnellnm force-pushed the moe-runner-1 branch from 9b0b5a7 to d9b9d13 Compare February 7, 2026 22:02

mergify bot removed the needs-rebase label Feb 7, 2026

bnellnm force-pushed the moe-runner-1 branch 2 times, most recently from 4f0285e to f8ea830 Compare February 9, 2026 21:23

bnellnm requested a review from tjtanaa as a code owner February 9, 2026 21:23

bnellnm force-pushed the moe-runner-1 branch from f8ea830 to f36b66e Compare February 10, 2026 19:50

mergify bot added the needs-rebase label Feb 11, 2026

bnellnm force-pushed the moe-runner-1 branch from f36b66e to fd392e2 Compare February 11, 2026 22:44

mergify bot removed the needs-rebase label Feb 11, 2026

bnellnm requested a review from robertgshaw2-redhat February 12, 2026 00:32

mergify bot added the needs-rebase label Feb 12, 2026

bnellnm force-pushed the moe-runner-1 branch from fd392e2 to d02c8fd Compare February 12, 2026 21:56

mergify bot added the needs-rebase label Feb 21, 2026

bnellnm mentioned this pull request Feb 23, 2026

[MoE Refactor] Make SharedExperts class for use with DefaultMoERunner #35153

Open

5 tasks

bnellnm force-pushed the moe-runner-1 branch from 2ff88fa to 42aea01 Compare February 24, 2026 18:24

mergify bot removed the needs-rebase label Feb 24, 2026

mergify bot added the needs-rebase label Mar 4, 2026

robertgshaw2-redhat reviewed Mar 5, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Mar 5, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py Show resolved Hide resolved

bnellnm force-pushed the moe-runner-1 branch from 82c8ae2 to 5392522 Compare March 5, 2026 21:04

mergify bot removed the needs-rebase label Mar 5, 2026

bnellnm added 4 commits March 18, 2026 16:48

initial MoERunner refactor

4aeabf2

Signed-off-by: Bill Nell <bnell@redhat.com>

fix lint

a4d3acb

Signed-off-by: Bill Nell <bnell@redhat.com>

rebase

5b7f133

Signed-off-by: Bill Nell <bnell@redhat.com>

rebase + remove dead code

fad7f33

Signed-off-by: Bill Nell <bnell@redhat.com>

bnellnm force-pushed the moe-runner-1 branch from 5392522 to fad7f33 Compare March 18, 2026 16:48

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026

robertgshaw2-redhat reviewed Mar 18, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py Show resolved Hide resolved

robertgshaw2-redhat reviewed Mar 18, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Mar 18, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Mar 19, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py Outdated Show resolved Hide resolved

fix gate overlap

ec88db3

Signed-off-by: Bill Nell <bnell@redhat.com>

robertgshaw2-redhat merged commit 9279c59 into vllm-project:main Mar 19, 2026
69 of 70 checks passed

chooper26 pushed a commit to intellistream/vllm-hust that referenced this pull request Mar 21, 2026

[MoE Refactor] DefaultMoERunner simplifcation (vllm-project#33049)

be23ee3

Signed-off-by: Bill Nell <bnell@redhat.com>

leo-pony mentioned this pull request Mar 23, 2026

Main2main Upgrade vllm commit to 0320 17:00 vllm-project/vllm-ascend#7510

Merged

Uh oh!

Conversation

bnellnm commented Jan 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Jan 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py (422-470)

vllm/model_executor/layers/fused_moe/shared_fused_moe.py (51-53)

Uh oh!

mergify bot commented Jan 27, 2026

Uh oh!

robertgshaw2-redhat Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Feb 5, 2026

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

mergify bot commented Feb 12, 2026

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Mar 19, 2026

Uh oh!

robertgshaw2-redhat commented Mar 19, 2026

Uh oh!

robertgshaw2-redhat commented Mar 19, 2026

Uh oh!

robertgshaw2-redhat commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bnellnm commented Jan 26, 2026 •

edited by github-actions bot

Loading