-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Bugfix] VLLM_V1 supports passing other compilation levels #19340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @zou3519, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request by @zou3519. This PR addresses a bug in VLLM V1 where the compilation level was previously hardcoded to PIECEWISE (level 3), preventing users from utilizing other compilation levels. The primary goal is to restore the user's ability to specify different compilation levels, particularly NO_COMPILATION (level 0) and DYNAMO_AS_IS (level 1), which are valuable for debugging purposes. The changes involve modifying the configuration handling, adding specific logic for the DYNAMO_AS_IS level in the V1 worker, introducing a counter for tracking DYNAMO_AS_IS usage, and adding new tests to cover levels 0 and 1.
Highlights
- Bugfix: Allow non-PIECEWISE compilation levels in V1: Previously, VLLM V1 forced the compilation level to
PIECEWISE. This PR removes that restriction, allowing users to specify other levels likeNO_COMPILATION(0) orDYNAMO_AS_IS(1). - Configuration Update: The default compilation level in
CompilationConfigis updated to bePIECEWISEifVLLM_USE_V1is enabled, andNO_COMPILATIONotherwise. Crucially, the explicit forcing ofPIECEWISEwithin the V1 config initialization (VllmConfig.__post_init__) has been removed. - DYNAMO_AS_IS Handling: Specific logic is added in the V1 GPU model runner (
vllm/v1/worker/gpu_model_runner.py) to explicitly callmodel.compilewhen theDYNAMO_AS_IScompilation level is selected and supported. - Compilation Counter: A new counter,
dynamo_as_is_count, is added tovllm/compilation/counter.pyto track how many times theDYNAMO_AS_IScompilation level is used. This counter is incremented in both the V0 and V1 model loading paths. - New Tests: Tests are added in
tests/compile/test_config.pyto specifically verify the behavior ofCompilationLevel.DYNAMO_AS_IS(level 1) andCompilationLevel.NO_COMPILATION(level 0) using the new compilation counter.
Changelog
Click here to see the changelog
- tests/compile/test_config.py
- Added
test_dynamo_as_isto verify compilation level 1 (DYNAMO_AS_IS). - Added
test_no_compilationto verify compilation level 0 (NO_COMPILATION).
- Added
- vllm/compilation/counter.py
- Added
dynamo_as_is_countto track usage of CompilationLevel.DYNAMO_AS_IS.
- Added
- vllm/config.py
- Changed the default value of
CompilationConfig.levelto bePIECEWISEfor V1 andNO_COMPILATIONotherwise (lines 3870-3871). - Removed the line that explicitly forced
self.compilation_config.level = CompilationLevel.PIECEWISEinVllmConfig.__post_init__(removed line 4446).
- Changed the default value of
- vllm/v1/worker/gpu_model_runner.py
- Imported
envs,compilation_counter, andsupports_dynamo(lines 15, 21, 43). - Added logic in
load_modelto callself.model.compileand incrementcompilation_counter.dynamo_as_is_countwhenCompilationLevel.DYNAMO_AS_ISis used (lines 1626-1635).
- Imported
- vllm/worker/model_runner.py
- Imported
compilation_counter(line 25). - Added line to increment
compilation_counter.dynamo_as_is_countwithin theDYNAMO_AS_IScheck inload_model(line 1220).
- Imported
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @zou3519, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, Gemini here with a summary of this pull request. This PR addresses a limitation in VLLM V1 where the compilation level was previously hardcoded to PIECEWISE (level 3). The main goal is to allow users to specify and utilize other compilation levels, specifically levels 0 (NO_COMPILATION) and 1 (DYNAMO_AS_IS), which are particularly useful for debugging scenarios as highlighted in the description. The changes involve modifying the configuration logic to respect the user-provided compilation level, adding specific handling for the DYNAMO_AS_IS level in the V1 model runner, and introducing new tests to ensure these alternative compilation levels function correctly.
Highlights
- Compilation Level Flexibility: VLLM V1 now respects the user-configured compilation level instead of forcing PIECEWISE (level 3).
- Debugging Support: Enables the use of NO_COMPILATION (level 0) and DYNAMO_AS_IS (level 1) for easier debugging of potential issues.
- Configuration Update: Removed the explicit forcing of PIECEWISE compilation level in the V1 configuration initialization.
- New Tests: Added tests (
test_dynamo_as_is,test_no_compilation) to verify the behavior of compilation levels 1 and 0. - Compilation Counter: Introduced a counter (
dynamo_as_is_count) to track occurrences of the DYNAMO_AS_IS compilation level being used.
Changelog
Click here to see the changelog
- tests/compile/test_config.py
- Added
test_dynamo_as_isto verify model loading with compilation level 1 (DYNAMO_AS_IS) and check the corresponding counter. - Added
test_no_compilationto verify model loading with compilation level 0 (NO_COMPILATION) and check that no graphs or dynamo compilations occur.
- Added
- vllm/compilation/counter.py
- Added
dynamo_as_is_countfield to theCompilationCounterclass to track the number of times the DYNAMO_AS_IS compilation level is used.
- Added
- vllm/config.py
- Modified the default value of
CompilationConfig.levelto beCompilationLevel.PIECEWISEifVLLM_USE_V1is true, otherwiseCompilationLevel.NO_COMPILATION(DIFF 1). - Removed the line
self.compilation_config.level = CompilationLevel.PIECEWISEfrom theVllmConfig.__post_init__method whenVLLM_USE_V1is true, which previously forced this level (DIFF 2).
- Modified the default value of
- vllm/v1/worker/gpu_model_runner.py
- Imported
envs,compilation_counter, andsupports_dynamo. - Added a conditional block in the
load_modelmethod to explicitly callself.model.compilewith the appropriate backend and incrementcompilation_counter.dynamo_as_is_countwhen the compilation level isDYNAMO_AS_ISand dynamo is supported.
- Imported
- vllm/worker/model_runner.py
- Imported
compilation_counter. - Added
compilation_counter.dynamo_as_is_count += 1within the existingtorch.compileblock for theDYNAMO_AS_ISlevel in theload_modelmethod (this seems to be for the non-V1 path).
- Imported
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the issue where the V1 engine previously forced the compilation level to PIECEWISE. By removing this forced setting and adjusting the default compilation level, users now have the flexibility to specify other compilation levels, which is particularly useful for debugging purposes as highlighted in the PR description.
The changes are clear, and the addition of tests for CompilationLevel.NO_COMPILATION (level 0) and CompilationLevel.DYNAMO_AS_IS (level 1) is a great step towards ensuring correctness. The new counter for DYNAMO_AS_IS compilations is also a good addition for observability and testing.
Regarding the test plan, for CompilationLevel.DYNAMO_ONCE (level 2), you might consider a test that verifies the model is compiled only once. This could involve:
- Setting the compilation level to 2.
- Running the model with a set of inputs and checking a relevant compilation counter (e.g.,
num_backend_compilationsor a more specific Dynamo counter if available) to confirm compilation occurred. - Running the model again (perhaps with different data but compatible shapes if the compilation is shape-generic, or same-shaped data).
- Asserting that the compilation counter did not increment on the second run, indicating the cached compiled graph was used.
Overall, the PR is well-structured and improves the flexibility of the V1 engine's compilation features. Nice work!
Summary of Findings
- Stylistic Suggestion: Line Continuation: In
vllm/v1/worker/gpu_model_runner.py(lines 1627-1628), theifcondition uses a\for line continuation. Since the condition is already enclosed in parentheses, this backslash is redundant and can be removed for slightly improved readability, adhering to PEP 8 guidelines for line breaks within parentheses. This is a minor stylistic point and does not affect functionality.
Merge Readiness
The pull request appears to be in good shape and addresses the intended bugfix effectively. The core changes are sound, and the new tests cover the enabled compilation levels (0 and 1). The identified stylistic point is minor. I believe this PR is ready for merge after considering the feedback. As I am an AI, I am not authorized to approve pull requests; please ensure other reviewers approve this change before merging.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the issue where VLLM V1 previously forced the compilation level to PIECEWISE. The changes allow V1 to respect user-defined compilation levels, which is particularly useful for debugging purposes, as highlighted in the description. The introduction of tests for compilation levels 0 (NO_COMPILATION) and 1 (DYNAMO_AS_IS) is a good addition, and the update to CompilationCounter to track DYNAMO_AS_IS usage is appropriate.
The core logic changes in vllm/config.py to adjust the default compilation level and remove the override in VllmConfig.__post_init__ are correct. The modifications in vllm/v1/worker/gpu_model_runner.py and vllm/worker/model_runner.py to handle DYNAMO_AS_IS and update the counter are also well-implemented.
One area that might need further consideration or clarification is the handling of CompilationLevel.DYNAMO_ONCE (level 2) in the V1 execution path, as detailed in the specific comment.
Summary of Findings
- Handling of
CompilationLevel.DYNAMO_ONCE(level 2) in V1: The V1gpu_model_runner.pyexplicitly callstorch.compileforCompilationLevel.DYNAMO_AS_ISbut not forCompilationLevel.DYNAMO_ONCE. This differs from V0's behavior. Clarification on whether V1 should also perform a full model compile for level 2 would be beneficial. - Use of
VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE: Invllm/v1/worker/gpu_model_runner.py(andvllm/worker/model_runner.py), thefullgraphparameter fortorch.compileis set usingenvs.VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE. This suggests it's primarily for testing. While this might be acceptable for debugging-focused compilation levels, it's worth noting. (Severity: low, not commented inline due to settings)
Merge Readiness
The pull request is well on its way to achieving its goals. The primary changes for supporting compilation levels 0 and 1 are solid. However, there's a medium-severity point regarding the handling of CompilationLevel.DYNAMO_ONCE in the V1 pathway that should be addressed or clarified before merging. Addressing this will ensure consistent behavior or clear understanding of the differences between V0 and V1 for all compilation levels. I am unable to approve pull requests, so please have another reviewer approve this code before merging.
51ae00a to
69e8b20
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this to the TorchCompileDispatcher wrapper that already handles the other two levels, or is that not possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it worked in V0, this was separate:
vllm/vllm/compilation/decorators.py
Lines 154 to 155 in 10be209
| # for CompilationLevel.DYNAMO_AS_IS , the upper level model runner | |
| # will handle the compilation, so we don't need to do anything here. |
I don't want to boil the ocean in this PR, but I agree everything is nicer when they're together. I can see if this is possible in a follow-up if that sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep super fair, I have a (bad) habit of asking for improvements on code that's not directly related to the PR
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. Left some comment.
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should consider removing TEST from this env var name as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pre-existing, also this flag is now True by default, so I will try to delete it in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@zou3519 should we merge this? |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
vllm/config.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will appear in the API and CLI docs, but the note is outside of the docstring so the reader won't know what the default behaviour is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I'll move it up here.
Summary: Before this PR, V1 always forced the compilation level to be 3 (PIECEWISE). In VLLM V0, users had an option to set the compilation level to be something else. This PR changes V1 to respect the compilation level set by the user. The other compilation levels are mostly just used for debugging, but they are very helpful for debugging. I've been trying to track down some silent incorrectness and -O1 is using vLLM with stock torch.compile so it helps in seeing if the problem is in stock torch.compile or vLLM. Test Plan: Added some new tests to test CompilationLevel 0 and 1. I'm not sure exactly how to write a behavior test for CompilationLevel 2. Signed-off-by: Richard Zou <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]> Signed-off-by: x22x22 <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Noam Gat <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]>
…ect#19340) Signed-off-by: Richard Zou <[email protected]>
Summary:
Before this PR, V1 always forced the compilation level to be 3 (PIECEWISE). In VLLM V0, users had an option to set the compilation level to be something else. This PR changes V1 to respect the compilation level set by the user.
The other compilation levels are mostly just used for debugging, but they are very helpful for debugging. I've been trying to track down some silent incorrectness and -O1 is using vLLM with stock torch.compile so it helps in seeing if the problem is in stock torch.compile or vLLM.
Test Plan:
Added some new tests to test CompilationLevel 0 and 1. I'm not sure exactly how to write a behavior test for CompilationLevel 2.