[Frontend][torch.compile] CompilationConfig Overhaul (#20283): Set up -O infrastructure#26847
Conversation
|
Documentation preview: https://vllm--26847.org.readthedocs.build/en/26847/ |
930fd18 to
e849607
Compare
e849607 to
4c8f770
Compare
|
@ProExpertProg Tracking a list of keys for determining if a model is moe/sequential here. |
4093a43 to
3b1c862
Compare
vllm/config/vllm.py
Outdated
| logger = init_logger(__name__) | ||
|
|
||
| # PassConfig preset instances for each compilation mode. Default fields set. | ||
| pass_config_none = PassConfig( |
There was a problem hiding this comment.
@ProExpertProg let us know what the defaults for each mode need to be.
ProExpertProg
left a comment
There was a problem hiding this comment.
Some initial thoughts
vllm/config/vllm.py
Outdated
| - O3 (VLLM_COMPILE): Maximum optimization with autotuning | ||
| """ | ||
| # TODO: Implement model specific paramters, | ||
| default_config = optimization_level_to_config[self.optimization_level] |
There was a problem hiding this comment.
Here we should ask the platform for an opinion on the defaults.
There was a problem hiding this comment.
3 options:
- we pass default to platform, it makes modifications
- platform returns default, can use these global defaults as a starting point
- each platform owns its own defaults
Let's chat about this at the end of release meeting maybe
There was a problem hiding this comment.
@ProExpertProg Are we still including platform specific modifications? Are we pushing that until inductor partition is on by default?
There was a problem hiding this comment.
I think for now we can just do this on is_cuda_alike() platforms. But we're gonna need a more robust approach here
69e2ccb to
23985af
Compare
5326be2 to
9f96da8
Compare
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
| This is separate from general `CompilationConfig` so that inductor passes | ||
| don't all have access to full configuration - that would create a cycle as | ||
| the `PassManager` is set as a property of config.""" | ||
| the `PassManager` is set as a property of config. |
There was a problem hiding this comment.
This was actually changed recently; we should still keep pass config separate, but this reason is no longer true. Can be done in follow up cc @ilmarkov
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
vllm/config/vllm.py
Outdated
| ): | ||
| logger.info( | ||
| "Cudagraph mode %s is not compatible with compilation mode %s. " | ||
| "Cudagraph mode is not compatible with compilation mode %s." |
There was a problem hiding this comment.
Why this? Aren't you giving it 2 strings?
There was a problem hiding this comment.
cudagraph mode could be None (not the enum). I didn't want to write None to the user in the log, but yes should clean it up. At one point I had moved this after the defaults were applied, and then I saw a failure in Example Test. I wanted to keep everything exactly the same, but I am now more convinced that the test is just very flaky.
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
): Set up -O infrastructure (vllm-project#26847) Signed-off-by: morrison-turnansky <mturnans@redhat.com> Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Morrison Turnansky <mturnans@redhat.com> Co-authored-by: adabeyta <aabeyta@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
): Set up -O infrastructure (vllm-project#26847) Signed-off-by: morrison-turnansky <mturnans@redhat.com> Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Morrison Turnansky <mturnans@redhat.com> Co-authored-by: adabeyta <aabeyta@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
CompilationConfig Overhaul and Optimization Levels
Overview
Description of changes to
VLLMConfgandCompilationConfig. The introduction of meaningful optimization levels (-O0,-O1,-O2,-O3) from GitHub Issue #20283. This change aims to improve user experience by providing intuitive optimization levels that trade startup time for performance, while consolidating and simplifying the compilation configuration system. We also have changed defaults to help users get desired out of the box performance. These defaults are determined byoptimization-level. Importantly defaults are purely defaults; explicit user settings will not be overwritten.Key Changes
1. Repurposing
-Ofor Optimization LevelsThe
-O<n>flags now represent meaningful optimization levels that trade startup time for performance:-O0: No Optimization--enforce-eager(deprecated)-O1: Quick Optimizations-O2: Full Optimizations (Default)-O1+ CUDAGraphMode.FULL_AND_PIECEWISE-O3: Full OptimizationStill in development. Added infrastructure to prevent changing API in future
release. Currently behaves the same O2.
Troubleshooting
Common Issues
-O0or-O1for faster startupdebug_dump_pathfor additional debugging information-O2for productionAdded functions
Added functionality determining if a model is quantized and if a model is MOE. This will be relevant for future work. Added lambda's to easily get information about configuration.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.