DRAFT: Mistral large 3 Extended Blackwell Support by jdebache · Pull Request #29884 · vllm-project/vllm

jdebache · 2025-12-02T13:28:55Z

Purpose

Improve performance and support of Mistral Large 3 on Blackwell.

Details

Added per-tensor scaled Triton configs for MoE (for Eagle draft model)
(WIP) Added per-block scaled Triton configs for MoE (for target model)
Added support for Flashinfer TRTLLM per-tensor scaled FP8 MoE kernels (for Eagle draft model)
(WIP) Added support for Flashinfer TRTLLM per-block scaled FP8 MoE kernels (for target model)
Fixed Llama4 routing for FP4 MoE
Added support for Mistral config format in benchmarks/kernels/benchmark_moe.py
Added support for Mistral tokenizer in vllm/benchmarks/throughput.py

Best Performance Usage

FP8 Checkpoint on DGX B200 (8 devices)

The FP8 model will fit on a single node.
At low concurrencies, deploy with TP8:

VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1 \
VLLM_ATTENTION_BACKEND=FLASHINFER_MLA \
VLLM_USE_FLASHINFER_MOE_FP8=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
vllm serve /models/Mistral-Large-3-675B-Instruct-2512 \
-tp 8 --kv-cache-dtype fp8 --no-enable-prefix-caching \
--config-format mistral --load-format mistral --tokenizer-mode mistral \
--max_model_len 65536 --max_num_seqs 512 --limit-mm-per-prompt '{"image":10}' \
--tool-call-parser mistral --enable-auto-tool-choice

At higher concurrencies (128 concurrent requests and above), deploy with DP8 and expert parallelism:

VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1 \
VLLM_ATTENTION_BACKEND=FLASHINFER_MLA \
VLLM_USE_FLASHINFER_MOE_FP8=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
vllm serve /models/Mistral-Large-3-675B-Instruct-2512 \
--data-parallel-size 8 --enable-expert-parallel --kv-cache-dtype fp8 --no-enable-prefix-caching \
--config-format mistral --load-format mistral --tokenizer-mode mistral \
--max_model_len 65536 --max_num_seqs 512 --limit-mm-per-prompt '{"image":10}' \
--tool-call-parser mistral --enable-auto-tool-choice

NVFP4

For NVFP4 checkpoints add the following to leverage the optimized kernels from Flashinfer:

VLLM_NVFP4_GEMM_BACKEND=cutlass VLLM_USE_FLASHINFER_MOE_FP4=1

With a version of Flashinfer >0.5.3:

VLLM_NVFP4_GEMM_BACKEND=flashinfer-cudnn VLLM_USE_FLASHINFER_MOE_FP4=1

A bug in the auto-tuner fixed recently (flashinfer-ai/flashinfer#2140) allows using flashinfer-cudnn.

GB200 P/D Disaggregated Dynamo Deployment

There are two options to set up a Dynamo P/D disaggregated deployment of this model. The first one is available immediately and relies on the processing pipeline of Dynamo. The second is pending a PR on Dynamo to enable delegating pre-processing to the vLLM backend.

For compatibility with ToT vLLM, you might need to include some changes that are not currently in upstream Dynamo:

fix: adjust usage of vLLM for deprecation of 'disable_log_requests' ai-dynamo/dynamo#4659

With Dynamo request processing

Start by copying config.json from Ministral to your model directory.

Install Dynamo from source to get compatibility with the latest vLLM revision: https://github.com/ai-dynamo/dynamo?tab=readme-ov-file#developing-locally
Install vLLM on top of that installation

With delegated request processing

Pending on some changes (TODO LINK) in Dynamo, you will be able to skip the file-copying step above.

Next steps

We have identified further optimizations which will be part of other PRs:

FP8 context attention
Routing GEMM optimization

Contributors

@dbari, @DanBlanaru, @evezhier, @hypdeb

mgoin

It seems there are a few things mixed into this one. Do you think we could prioritize the critical perf features like kernel support and tuned configs?

mgoin · 2025-12-02T15:52:02Z

requirements/cuda.txt

@@ -11,3 +11,4 @@ torchaudio==2.9.0
 torchvision==0.24.0 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version
 # FlashInfer should be updated together with the Dockerfile
 flashinfer-python==0.5.3
+nvtx==0.2.13


I think this isn't a big deal but do we need this dep?

We've added some nvtx ranges to the gpu_model_runner.py to make profiling easier, which uses nvtx. This can be split into another PR if it is desirable.

mgoin · 2025-12-02T16:22:56Z

vllm/v1/worker/gpu_model_runner.py

+
+        # Count context tokens per request
+        context_requests = 0
+        decode_requests = 0
+
+        for req in scheduler_output.scheduled_new_reqs:
+            context_len = len(req.prompt_token_ids) if req.prompt_token_ids else 0
+            num_computed = req.num_computed_tokens
+            if num_computed < context_len:
+                context_requests += 1
+            else:
+                decode_requests += 1
+        # For cached requests
+        for i, req_id in enumerate(scheduler_output.scheduled_cached_reqs.req_ids):
+            context_len = self.requests[req_id].num_prompt_tokens
+            num_computed = scheduler_output.scheduled_cached_reqs.num_computed_tokens[i]
+
+            if num_computed < context_len:
+                context_requests += 1
+            else:
+                decode_requests += 1
+


Can we put this in another PR? We don't want to eat this cost when profiling isn't enabled

jdebache · 2025-12-02T17:25:17Z

It seems there are a few things mixed into this one. Do you think we could prioritize the critical perf features like kernel support and tuned configs?

I'll start by splitting the NVTX stuff out, see how it looks after this.

mgoin

This looks straightforward to me, thanks for splitting it up!

mergify · 2025-12-10T04:10:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hypdeb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-12-10T08:11:49Z

Hi @hypdeb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: jdebache <jdebache@nvidia.com>

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>

Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>

Signed-off-by: Dan Blanaru <48605845+DanBlanaru@users.noreply.github.com>

Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>

Signed-off-by: jdebache <jdebache@nvidia.com>

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>

…ock_scale_moe Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>

dbari · 2025-12-18T13:26:39Z

I added support for Flashinfer TRTLLM per-block scaled FP8 MoE kernels for the target model, you can cross it off the list. However, some fixes will need to be merged to flashinfer and get a version number before they can be integrated here.

Wait for flashinfer PR#2238 to be merged
Update flashinfer version to the next after the PR is merged

mergify · 2025-12-18T13:32:56Z

Hi @hypdeb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>

mergify · 2025-12-18T20:02:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hypdeb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jdebache · 2026-01-27T14:53:21Z

Hey @mgoin, we are closing this one in favour of #33174, for starter. Other changes will come later.

mergify bot added ci/build performance Performance-related issues nvidia v1 labels Dec 2, 2025

github-project-automation bot added this to NVIDIA Dec 2, 2025

mgoin reviewed Dec 2, 2025

View reviewed changes

mgoin approved these changes Dec 5, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Dec 5, 2025

mgoin marked this pull request as ready for review December 5, 2025 01:15

mgoin requested review from pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners December 5, 2025 01:15

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 5, 2025

jdebache force-pushed the mistral_large_3_blackwell branch from 4e91a18 to 9f5af28 Compare December 6, 2025 11:29

mergify bot added the needs-rebase label Dec 10, 2025

jdebache force-pushed the mistral_large_3_blackwell branch from b240199 to 8d0de6b Compare December 10, 2025 07:51

mergify bot removed the needs-rebase label Dec 10, 2025

jdebache and others added 8 commits December 15, 2025 01:36

extended Blackwell support for Mistral Large 3

bd516a4

Signed-off-by: jdebache <jdebache@nvidia.com>

Explicit cast to satisfy mypy

16a932e

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>

revert mistakenly committed changes

42fa7a8

Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>

removed unused import

47da694

Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>

add GB200 moe config

d06e90e

Signed-off-by: Dan Blanaru <48605845+DanBlanaru@users.noreply.github.com>

split out nvtx ranges for model runner

f560876

Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>

revert mistakenly removed line

306ff7c

Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>

revert change made unnecessary

0c16645

Signed-off-by: jdebache <jdebache@nvidia.com>

jdebache force-pushed the mistral_large_3_blackwell branch from 8d0de6b to 0c16645 Compare December 15, 2025 09:36

dbari added 3 commits December 18, 2025 01:39

Implementation of blockscale MoE with Flashinfer

b87dacb

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>

Minor fixes after rebase

c7b8baf

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>

Remove argument tile_tokens_dim from call to flashinfer_trtllm_fp8_bl…

783171c

…ock_scale_moe Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>

mergify bot added the deepseek Related to DeepSeek models label Dec 18, 2025

Fixes after rebase

28f22e6

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>

mergify bot added the needs-rebase label Dec 18, 2025

dbari mentioned this pull request Jan 27, 2026

Add support for Mistral Large 3 inference with Flashinfer MoE #33174

Merged

jdebache closed this Jan 27, 2026

github-project-automation bot moved this from In review to Done in NVIDIA Jan 27, 2026

jdebache deleted the mistral_large_3_blackwell branch March 25, 2026 10:19

Uh oh!

Conversation

jdebache commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Details

Best Performance Usage

FP8 Checkpoint on DGX B200 (8 devices)

NVFP4

GB200 P/D Disaggregated Dynamo Deployment

With Dynamo request processing

With delegated request processing

Next steps

Contributors

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jdebache Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jdebache Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jdebache commented Dec 2, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 10, 2025

Uh oh!

mergify bot commented Dec 10, 2025

Uh oh!

dbari commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 18, 2025

Uh oh!

mergify bot commented Dec 18, 2025

Uh oh!

jdebache commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jdebache commented Dec 2, 2025 •

edited by github-actions bot

Loading

dbari commented Dec 18, 2025 •

edited

Loading