Use maximum number of batched tokens to autotune MoE#28106
Use maximum number of batched tokens to autotune MoE#28106nvjullin wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the Mixture-of-Experts (MoE) autotuning logic to use the maximum number of batched tokens from the scheduler configuration, which is a more appropriate parameter for this purpose than the CUDA graph capture size. The changes are logical, but the refactoring is incomplete, leading to a critical issue where a removed attribute is still being accessed in the code.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
2773c72 to
ce4f6a9
Compare
Signed-off-by: Julien Lin <jullin@nvidia.com>
ce4f6a9 to
fd72784
Compare
fd72784 to
9c5783d
Compare
|
@nvjullin could you rebase so that we can keep driving this PR? thanks |
5e1d4ea to
0bdb4c5
Compare
Purpose
Follow up on #27904.
CC @varun-sundar-rabindranath.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.