[Frontend][torch.compile] CompilationConfig Overhaul (#20283): Set up -O infrastructure by morrison-turnansky · Pull Request #26847 · vllm-project/vllm

morrison-turnansky · 2025-10-14T22:04:35Z

CompilationConfig Overhaul and Optimization Levels

Overview

Description of changes to VLLMConfg and CompilationConfig. The introduction of meaningful optimization levels (-O0, -O1, -O2, -O3) from GitHub Issue #20283. This change aims to improve user experience by providing intuitive optimization levels that trade startup time for performance, while consolidating and simplifying the compilation configuration system. We also have changed defaults to help users get desired out of the box performance. These defaults are determined by optimization-level. Importantly defaults are purely defaults; explicit user settings will not be overwritten.

Key Changes

1. Repurposing `-O` for Optimization Levels

The -O<n> flags now represent meaningful optimization levels that trade startup time for performance:

`-O0`: No Optimization

Startup: Fastest startup time
Performance: No compilation, no cudagraphs, no optimizations
Equivalent to: --enforce-eager (deprecated)
Use case: Development, debugging, or when startup time is critical

# CLI usage
python -m vllm.entrypoints.api_server --model microsoft/DialoGPT-medium -O0

# Python API usage
from vllm.entrypoints.llm import LLM
from vllm.config.vllm import OptimizationLevel

llm = LLM(
    model="microsoft/DialoGPT-medium",
    optimization_level=OptimizationLevel.O0
)

`-O1`: Quick Optimizations

Startup: Moderate startup time
Performance: Inductor compilation, CUDAGraphMode.PIECEWISE
Use case: Balance for most development scenarios

# CLI usage
python -m vllm.entrypoints.api_server --model microsoft/DialoGPT-medium -O1

# Python API usage
from vllm.entrypoints.llm import LLM
from vllm.config.vllm import OptimizationLevel

llm = LLM(
    model="microsoft/DialoGPT-medium",
    optimization_level=OptimizationLevel.O1
)

`-O2`: Full Optimizations (Default)

Startup: Longer startup time
Performance: -O1 + CUDAGraphMode.FULL_AND_PIECEWISE
Use case: Production workloads where performance is important. This is the default use case. It is also very similar to the previous default. The primary difference is that noop & fusion flags are enabled.

# CLI usage (default, so optional)
python -m vllm.entrypoints.api_server --model microsoft/DialoGPT-medium -O2

# Python API usage
from vllm.entrypoints.llm import LLM
from vllm.config.vllm import OptimizationLevel

llm = LLM(
    model="microsoft/DialoGPT-medium",
    optimization_level=OptimizationLevel.O2  # This is the default
)

`-O3`: Full Optimization

Still in development. Added infrastructure to prevent changing API in future
release. Currently behaves the same O2.

Troubleshooting

Common Issues

Startup Time Too Long: Use -O0 or -O1 for faster startup
Compilation Errors: Use debug_dump_path for additional debugging information
Performance Issues: Ensure using -O2 for production

Added functions

Added functionality determining if a model is quantized and if a model is MOE. This will be relevant for future work. Added lambda's to easily get information about configuration.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-10-14T22:05:14Z

Documentation preview: https://vllm--26847.org.readthedocs.build/en/26847/

morrison-turnansky · 2025-10-15T13:17:43Z

@ProExpertProg Tracking a list of keys for determining if a model is moe/sequential here.

vllm/config/vllm.py

morrison-turnansky · 2025-10-15T13:55:58Z

vllm/config/vllm.py

 logger = init_logger(__name__)

+# PassConfig preset instances for each compilation mode. Default fields set.
+pass_config_none = PassConfig(


@ProExpertProg let us know what the defaults for each mode need to be.

ProExpertProg

Some initial thoughts

vllm/config/vllm.py

ProExpertProg · 2025-10-15T17:43:05Z

vllm/config/vllm.py

+            - O3 (VLLM_COMPILE): Maximum optimization with autotuning
+        """
+        # TODO: Implement model specific paramters,
+        default_config = optimization_level_to_config[self.optimization_level]


Here we should ask the platform for an opinion on the defaults.

3 options:

we pass default to platform, it makes modifications

platform returns default, can use these global defaults as a starting point

each platform owns its own defaults

Let's chat about this at the end of release meeting maybe

@ProExpertProg Are we still including platform specific modifications? Are we pushing that until inductor partition is on by default?

I think for now we can just do this on is_cuda_alike() platforms. But we're gonna need a more robust approach here

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

vllm/config/vllm.py

ProExpertProg · 2025-11-26T16:13:04Z

vllm/config/compilation.py

    This is separate from general `CompilationConfig` so that inductor passes
    don't all have access to full configuration - that would create a cycle as
-    the `PassManager` is set as a property of config."""
+    the `PassManager` is set as a property of config.


This was actually changed recently; we should still keep pass config separate, but this reason is no longer true. Can be done in follow up cc @ilmarkov

vllm/config/vllm.py

Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Morrison Turnansky <mturnans@redhat.com>

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

ProExpertProg · 2025-11-26T19:26:28Z

vllm/config/vllm.py

        ):
            logger.info(
-                "Cudagraph mode %s is not compatible with compilation mode %s. "
+                "Cudagraph mode is not compatible with compilation mode %s."


Why this? Aren't you giving it 2 strings?

cudagraph mode could be None (not the enum). I didn't want to write None to the user in the log, but yes should clean it up. At one point I had moved this after the defaults were applied, and then I saw a failure in Example Test. I wanted to keep everything exactly the same, but I am now more convinced that the test is just very flaky.

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

): Set up -O infrastructure (vllm-project#26847) Signed-off-by: morrison-turnansky <mturnans@redhat.com> Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Morrison Turnansky <mturnans@redhat.com> Co-authored-by: adabeyta <aabeyta@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

): Set up -O infrastructure (vllm-project#26847) Signed-off-by: morrison-turnansky <mturnans@redhat.com> Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Morrison Turnansky <mturnans@redhat.com> Co-authored-by: adabeyta <aabeyta@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

vllm/config/vllm.py

mergify bot added documentation Improvements or additions to documentation frontend llama Related to Llama models speculative-decoding v1 tpu Related to Google TPUs labels Oct 14, 2025

morrison-turnansky force-pushed the issue-20283-model-config branch 2 times, most recently from 930fd18 to e849607 Compare October 15, 2025 12:04

mergify bot removed the tpu Related to Google TPUs label Oct 15, 2025

morrison-turnansky force-pushed the issue-20283-model-config branch from e849607 to 4c8f770 Compare October 15, 2025 13:07

morrison-turnansky mentioned this pull request Oct 15, 2025

added parser for moe detection with test morrison-turnansky/vllm#1

Closed

5 tasks

ProExpertProg reviewed Oct 15, 2025

View reviewed changes

vllm/config/vllm.py Outdated Show resolved Hide resolved

morrison-turnansky force-pushed the issue-20283-model-config branch from 4093a43 to 3b1c862 Compare October 15, 2025 13:54

morrison-turnansky commented Oct 15, 2025

View reviewed changes

ProExpertProg reviewed Oct 15, 2025

View reviewed changes

morrison-turnansky force-pushed the issue-20283-model-config branch from 69e2ccb to 23985af Compare October 15, 2025 17:53

ProExpertProg added this to the vllm==v0.12.0/torch==2.9.0 compilation improvements milestone Oct 16, 2025

morrison-turnansky force-pushed the issue-20283-model-config branch from 5326be2 to 9f96da8 Compare October 16, 2025 13:34

morrison-turnansky marked this pull request as ready for review October 16, 2025 20:11

morrison-turnansky requested review from WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 16, 2025 20:11

morrison-turnansky added 4 commits November 25, 2025 21:15

fixed cacheing of set_current_vllm_config

0a272da

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

ci

a54f084

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

Merge branch 'main' into issue-20283-model-config

2323a3d

moved cache changes locally to test

20d1645

Signed-off-by: morrison-turnansky <mturnans@redhat.com>