Skip to content

[MoE Refactor] DefaultMoERunner simplifcation#33049

Merged
robertgshaw2-redhat merged 5 commits intovllm-project:mainfrom
neuralmagic:moe-runner-1
Mar 19, 2026
Merged

[MoE Refactor] DefaultMoERunner simplifcation#33049
robertgshaw2-redhat merged 5 commits intovllm-project:mainfrom
neuralmagic:moe-runner-1

Conversation

@bnellnm
Copy link
Collaborator

@bnellnm bnellnm commented Jan 26, 2026

Purpose

  • Simplify the forward methods in DefaultMoERunner by moving bits of functionality to helper methods.
  • Disable cloning of shared experts input when inplace is disabled.

Test Plan

Run all MoE Integration tests + all Kernel MoE tests (including fp8 + fp4, etc.)

Test Result

cc @yzong-rh


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Jan 26, 2026

Documentation preview: https://vllm--33049.org.readthedocs.build/en/33049/

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Jan 26, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and beneficial refactoring of the MoE implementation. By moving the forward pass logic from the FusedMoE layer into a new DefaultMoERunner class, the code is now cleaner, more modular, and easier to maintain. The dynamic registration of custom ops per layer is also a great improvement. I've found a couple of issues in the new implementation that need to be addressed, one of which is critical. Otherwise, this is a solid simplification.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py (422-470)

critical

The prepare_dp_allgather_tensor method is being called with incorrect arguments. It expects a layer object as its first argument, but it's receiving self (the DefaultMoERunner instance) instead. This will cause a runtime error when post_quant_allgather is true.

This suggestion fixes the issue by passing the layer object to _maybe_dispatch and using it correctly. You will also need to update the call to _maybe_dispatch in forward_impl to pass the layer object:

# in forward_impl
hidden_states, router_logits, extra_tensor = self._maybe_dispatch(
    layer, hidden_states, router_logits
)
    def _maybe_dispatch(
        self,
        layer: torch.nn.Module,
        hidden_states: torch.Tensor,
        router_logits: torch.Tensor,
    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor | None]:
        extra_tensor: torch.Tensor | None = None

        if self.do_naive_dispatch_combine:
            post_quant_allgather = (
                self.moe_config.dp_size > 1
                and self.moe_config.use_ep
                and getattr(self.quant_method, "do_post_quant_allgather", False)
            )
            if post_quant_allgather:
                hidden_states_to_dispatch, extra_tensors = (
                    self.quant_method.prepare_dp_allgather_tensor(
                        layer, hidden_states, router_logits
                    )
                )
            else:
                hidden_states_to_dispatch = hidden_states

            hidden_states, router_logits, extra_tensors_dispatched = (
                get_ep_group().dispatch(
                    hidden_states_to_dispatch,
                    router_logits,
                    self.moe_config.is_sequence_parallel,
                    extra_tensors=extra_tensors,
                )
            )

            if extra_tensors_dispatched is not None:
                assert len(extra_tensors_dispatched) == 1
                extra_tensor = extra_tensors_dispatched[0]

        # NOTE: Similar with DP, PCP also needs dispatch and combine. For
        # simplicity, AgRsAll2All was added separately for PCP here. Maybe
        # we should modify All2AllManager abstract to better support PCP.
        if self.moe_config.pcp_size > 1:
            hidden_states = get_pcp_group().all_gather(
                hidden_states,
                dim=0,
            )
            router_logits = get_pcp_group().all_gather(
                router_logits,
                dim=0,
            )

        return hidden_states, router_logits, extra_tensor

vllm/model_executor/layers/fused_moe/shared_fused_moe.py (51-53)

high

Removing this property changes the behavior of SharedFusedMoE. Previously, the internal gate was disabled when self.use_overlapped was false. Now, the gate from the parent FusedMoE class will always be used if it's provided, regardless of use_overlapped.

The use_overlapped flag can be disabled for correctness reasons in certain configurations (e.g., with EPLB). This change might re-introduce correctness issues that the conditional gate was intended to prevent. The previous behavior of disabling the gate when not using overlap should be preserved to avoid potential regressions.

@mergify
Copy link

mergify bot commented Jan 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 27, 2026
# TODO: Once the OOM issue for the TPU backend is resolved, we will
# switch to using the moe_forward custom op.
# Note: CPU doesn't require wrapped forward_impl.
if current_platform.is_tpu() or current_platform.is_cpu():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, we should be able to remove the TPU stuff soon

@bnellnm bnellnm marked this pull request as ready for review February 5, 2026 01:04
@mergify
Copy link

mergify bot commented Feb 5, 2026

Hi @bnellnm, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify mergify bot removed the needs-rebase label Feb 7, 2026
@bnellnm bnellnm force-pushed the moe-runner-1 branch 2 times, most recently from 4f0285e to f8ea830 Compare February 9, 2026 21:23
@bnellnm bnellnm requested a review from tjtanaa as a code owner February 9, 2026 21:23
@mergify
Copy link

mergify bot commented Feb 11, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link

mergify bot commented Feb 12, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link

mergify bot commented Mar 4, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 4, 2026
bnellnm added 4 commits March 18, 2026 16:48
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026
@robertgshaw2-redhat
Copy link
Collaborator

LGTM, pending fixing the overlap

Signed-off-by: Bill Nell <bnell@redhat.com>
@robertgshaw2-redhat
Copy link
Collaborator

I ran some LL performance tests, B200 NVFP4 DeepSeek EP=4

  • pr
(vllm) robertgshaw2-redhat@dgx-b200-02:~/vllm$ just sweep-vllm
just benchmark 1 10 7890 && just benchmark 4 40 7890 && just benchmark 8 80 7890 && just benchmark 16 160 7890
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 1 --num-prompts 10 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7e0b7d1f2840>, trust_remote_code=False, seed=1773943563, num_prompts=10, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=1, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-0dcc4c60-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:06:10 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|█████████████████████████████████████████████████████████████████████| 10/10 [00:12<00:00,  1.27s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  12.65     
Total input tokens:                      10        
Total generated tokens:                  1000      
Request throughput (req/s):              0.79      
Output token throughput (tok/s):         79.05     
Peak output token throughput (tok/s):    130.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          79.84     
---------------Time to First Token----------------
Mean TTFT (ms):                          509.63    
Median TTFT (ms):                        18.42     
P99 TTFT (ms):                           4489.20   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.63      
Median TPOT (ms):                        7.63      
P99 TPOT (ms):                           7.64      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.63      
Median ITL (ms):                         7.62      
P99 ITL (ms):                            8.14      
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 4 --num-prompts 40 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x722498902d40>, trust_remote_code=False, seed=1773943583, num_prompts=40, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=4, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-e37fb4ab-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:06:29 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 4
100%|█████████████████████████████████████████████████████████████████████| 40/40 [00:09<00:00,  4.13it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     40        
Failed requests:                         0         
Maximum request concurrency:             4         
Benchmark duration (s):                  9.68      
Total input tokens:                      40        
Total generated tokens:                  4000      
Request throughput (req/s):              4.13      
Output token throughput (tok/s):         413.04    
Peak output token throughput (tok/s):    416.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          417.17    
---------------Time to First Token----------------
Mean TTFT (ms):                          34.65     
Median TTFT (ms):                        33.15     
P99 TTFT (ms):                           65.52     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.42      
Median TPOT (ms):                        9.40      
P99 TPOT (ms):                           9.56      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.42      
Median ITL (ms):                         9.38      
P99 ITL (ms):                            9.89      
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 8 --num-prompts 80 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7dc6cbafed40>, trust_remote_code=False, seed=1773943599, num_prompts=80, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=8, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-00c4d2f0-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:06:45 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
100%|█████████████████████████████████████████████████████████████████████| 80/80 [00:11<00:00,  7.02it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  11.40     
Total input tokens:                      80        
Total generated tokens:                  8000      
Request throughput (req/s):              7.02      
Output token throughput (tok/s):         701.97    
Peak output token throughput (tok/s):    720.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          708.99    
---------------Time to First Token----------------
Mean TTFT (ms):                          43.08     
Median TTFT (ms):                        41.37     
P99 TTFT (ms):                           63.82     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.06     
Median TPOT (ms):                        11.07     
P99 TPOT (ms):                           11.15     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.06     
Median ITL (ms):                         10.98     
P99 ITL (ms):                            12.17     
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 16 --num-prompts 160 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x75fa70116840>, trust_remote_code=False, seed=1773943617, num_prompts=160, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=16, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-2a5c1e8f-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:07:02 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|███████████████████████████████████████████████████████████████████| 160/160 [00:14<00:00, 11.30it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  14.16     
Total input tokens:                      160       
Total generated tokens:                  16000     
Request throughput (req/s):              11.30     
Output token throughput (tok/s):         1129.58   
Peak output token throughput (tok/s):    1184.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          1140.88   
---------------Time to First Token----------------
Mean TTFT (ms):                          51.34     
Median TTFT (ms):                        46.76     
P99 TTFT (ms):                           82.99     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.77     
Median TPOT (ms):                        13.77     
P99 TPOT (ms):                           13.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.77     
Median ITL (ms):                         13.70     
P99 ITL (ms):                            18.43     
==================================================
  • main
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  12.48     
Total input tokens:                      10        
Total generated tokens:                  1000      
Request throughput (req/s):              0.80      
Output token throughput (tok/s):         80.11     
Peak output token throughput (tok/s):    130.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          80.91     
---------------Time to First Token----------------
Mean TTFT (ms):                          495.57    
Median TTFT (ms):                        18.25     
P99 TTFT (ms):                           4359.99   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.60      
Median TPOT (ms):                        7.60      
P99 TPOT (ms):                           7.61      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.60      
Median ITL (ms):                         7.60      
P99 ITL (ms):                            8.02      
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 4 --num-prompts 40 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7265a87165c0>, trust_remote_code=False, seed=1773945917, num_prompts=40, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=4, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-6b3a2a38-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:45:23 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 4
100%|█████████████████████████████████████████████████████████████████████| 40/40 [00:09<00:00,  4.14it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     40        
Failed requests:                         0         
Maximum request concurrency:             4         
Benchmark duration (s):                  9.66      
Total input tokens:                      40        
Total generated tokens:                  4000      
Request throughput (req/s):              4.14      
Output token throughput (tok/s):         414.13    
Peak output token throughput (tok/s):    416.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          418.27    
---------------Time to First Token----------------
Mean TTFT (ms):                          32.90     
Median TTFT (ms):                        31.46     
P99 TTFT (ms):                           54.51     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.41      
Median TPOT (ms):                        9.39      
P99 TPOT (ms):                           9.48      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.41      
Median ITL (ms):                         9.38      
P99 ITL (ms):                            9.98      
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 8 --num-prompts 80 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x75ecd630e5c0>, trust_remote_code=False, seed=1773945933, num_prompts=80, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=8, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-db844791-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:45:39 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
100%|█████████████████████████████████████████████████████████████████████| 80/80 [00:11<00:00,  7.05it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     80        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  11.35     
Total input tokens:                      80        
Total generated tokens:                  8000      
Request throughput (req/s):              7.05      
Output token throughput (tok/s):         704.90    
Peak output token throughput (tok/s):    728.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          711.95    
---------------Time to First Token----------------
Mean TTFT (ms):                          35.80     
Median TTFT (ms):                        36.65     
P99 TTFT (ms):                           61.96     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.09     
Median TPOT (ms):                        11.08     
P99 TPOT (ms):                           11.22     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.09     
Median ITL (ms):                         11.06     
P99 ITL (ms):                            11.69     
==================================================
vllm bench serve --port 7890 --model nvidia/DeepSeek-R1-NVFP4 --dataset-name random --input-len 2 --output-len 100 --max-concurrency 16 --num-prompts 160 --seed $(date +%s) --temperature 0.0
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x794e381fa5c0>, trust_remote_code=False, seed=1773945951, num_prompts=160, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai', base_url=None, host='127.0.0.1', port=7890, endpoint='/v1/completions', header=None, max_concurrency=16, model='nvidia/DeepSeek-R1-NVFP4', input_len=2, output_len=100, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-5051fcef-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 03-19 14:45:57 [datasets.py:703] Sampling input_len from [1, 1] and output_len from [100, 100]
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|███████████████████████████████████████████████████████████████████| 160/160 [00:14<00:00, 11.32it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  14.13     
Total input tokens:                      160       
Total generated tokens:                  16000     
Request throughput (req/s):              11.32     
Output token throughput (tok/s):         1132.04   
Peak output token throughput (tok/s):    1168.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          1143.36   
---------------Time to First Token----------------
Mean TTFT (ms):                          51.19     
Median TTFT (ms):                        45.91     
P99 TTFT (ms):                           88.35     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.74     
Median TPOT (ms):                        13.73     
P99 TPOT (ms):                           13.96     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.74     
Median ITL (ms):                         13.68     
P99 ITL (ms):                            18.07     
==================================================

@robertgshaw2-redhat
Copy link
Collaborator

results look identifixal

@robertgshaw2-redhat
Copy link
Collaborator

going to merge this. docs build is stuck.

@robertgshaw2-redhat robertgshaw2-redhat merged commit 9279c59 into vllm-project:main Mar 19, 2026
69 of 70 checks passed
chooper26 pushed a commit to intellistream/vllm-hust that referenced this pull request Mar 21, 2026
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Mar 23, 2026
### What this PR does / why we need it?
Main2main Upgrade vllm commit to 0320 17:00

1. fix vllm refactored `_moe_forward` to call
`runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True.
vllm PR:"[MoE Refactor] DefaultMoERunner simplification
[#33049](vllm-project/vllm#33049)"

2.fix vllm moved the call to `self._set_compile_ranges()` in
`VllmConfig.__post_init__` from **before** `check_and_update_config()`
to **after** it (to allow platforms to lower `max_num_batched_tokens`
first). vllm PR: "fix(xpu): Re-compute compile ranges after
platform-specific config updates"
[#37523](vllm-project/vllm#37523)


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8b63257

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Mar 25, 2026
### What this PR does / why we need it?
Main2main Upgrade vllm commit to 0320 17:00

1. fix vllm refactored `_moe_forward` to call
`runner.forward_impl_chunked()` when `runner.use_dp_chunking` is True.
vllm PR:"[MoE Refactor] DefaultMoERunner simplification
[#33049](vllm-project/vllm#33049)"

2.fix vllm moved the call to `self._set_compile_ranges()` in
`VllmConfig.__post_init__` from **before** `check_and_update_config()`
to **after** it (to allow platforms to lower `max_num_batched_tokens`
first). vllm PR: "fix(xpu): Re-compute compile ranges after
platform-specific config updates"
[#37523](vllm-project/vllm#37523)


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8b63257

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants