DRAFT: Mistral large 3 Extended Blackwell Support#29884
DRAFT: Mistral large 3 Extended Blackwell Support#29884jdebache wants to merge 12 commits intovllm-project:mainfrom
Conversation
mgoin
left a comment
There was a problem hiding this comment.
It seems there are a few things mixed into this one. Do you think we could prioritize the critical perf features like kernel support and tuned configs?
requirements/cuda.txt
Outdated
| @@ -11,3 +11,4 @@ torchaudio==2.9.0 | |||
| torchvision==0.24.0 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version | |||
| # FlashInfer should be updated together with the Dockerfile | |||
| flashinfer-python==0.5.3 | |||
| nvtx==0.2.13 | |||
There was a problem hiding this comment.
I think this isn't a big deal but do we need this dep?
There was a problem hiding this comment.
We've added some nvtx ranges to the gpu_model_runner.py to make profiling easier, which uses nvtx. This can be split into another PR if it is desirable.
vllm/v1/worker/gpu_model_runner.py
Outdated
|
|
||
| # Count context tokens per request | ||
| context_requests = 0 | ||
| decode_requests = 0 | ||
|
|
||
| for req in scheduler_output.scheduled_new_reqs: | ||
| context_len = len(req.prompt_token_ids) if req.prompt_token_ids else 0 | ||
| num_computed = req.num_computed_tokens | ||
| if num_computed < context_len: | ||
| context_requests += 1 | ||
| else: | ||
| decode_requests += 1 | ||
| # For cached requests | ||
| for i, req_id in enumerate(scheduler_output.scheduled_cached_reqs.req_ids): | ||
| context_len = self.requests[req_id].num_prompt_tokens | ||
| num_computed = scheduler_output.scheduled_cached_reqs.num_computed_tokens[i] | ||
|
|
||
| if num_computed < context_len: | ||
| context_requests += 1 | ||
| else: | ||
| decode_requests += 1 | ||
|
|
There was a problem hiding this comment.
Can we put this in another PR? We don't want to eat this cost when profiling isn't enabled
I'll start by splitting the NVTX stuff out, see how it looks after this. |
mgoin
left a comment
There was a problem hiding this comment.
This looks straightforward to me, thanks for splitting it up!
4e91a18 to
9f5af28
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
b240199 to
8d0de6b
Compare
|
Hi @hypdeb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: jdebache <jdebache@nvidia.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>
Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>
Signed-off-by: Dan Blanaru <48605845+DanBlanaru@users.noreply.github.com>
Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>
Signed-off-by: Julien Debache <jdebache@cpu-0002.cm.cluster>
Signed-off-by: jdebache <jdebache@nvidia.com>
8d0de6b to
0c16645
Compare
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
…ock_scale_moe Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
|
I added support for Flashinfer TRTLLM per-block scaled FP8 MoE kernels for the target model, you can cross it off the list. However, some fixes will need to be merged to
|
|
Hi @hypdeb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Improve performance and support of Mistral Large 3 on Blackwell.
Details
benchmarks/kernels/benchmark_moe.pyvllm/benchmarks/throughput.pyBest Performance Usage
FP8 Checkpoint on DGX B200 (8 devices)
The FP8 model will fit on a single node.
At low concurrencies, deploy with TP8:
At higher concurrencies (128 concurrent requests and above), deploy with DP8 and expert parallelism:
NVFP4
For NVFP4 checkpoints add the following to leverage the optimized kernels from Flashinfer:
With a version of Flashinfer
>0.5.3:A bug in the auto-tuner fixed recently (flashinfer-ai/flashinfer#2140) allows using
flashinfer-cudnn.GB200 P/D Disaggregated Dynamo Deployment
There are two options to set up a Dynamo P/D disaggregated deployment of this model. The first one is available immediately and relies on the processing pipeline of Dynamo. The second is pending a PR on Dynamo to enable delegating pre-processing to the vLLM backend.
For compatibility with ToT vLLM, you might need to include some changes that are not currently in upstream Dynamo:
With Dynamo request processing
Start by copying
config.jsonfrom Ministral to your model directory.With delegated request processing
Pending on some changes (TODO LINK) in Dynamo, you will be able to skip the file-copying step above.
Next steps
We have identified further optimizations which will be part of other PRs:
Contributors
@dbari, @DanBlanaru, @evezhier, @hypdeb