Skip to content

[Frontend][torch.compile] CompilationConfig Overhaul (#20283): Set up -O infrastructure#26847

Merged
vllm-bot merged 85 commits intovllm-project:mainfrom
morrison-turnansky:issue-20283-model-config
Nov 27, 2025
Merged

[Frontend][torch.compile] CompilationConfig Overhaul (#20283): Set up -O infrastructure#26847
vllm-bot merged 85 commits intovllm-project:mainfrom
morrison-turnansky:issue-20283-model-config

Conversation

@morrison-turnansky
Copy link
Copy Markdown
Contributor

@morrison-turnansky morrison-turnansky commented Oct 14, 2025

CompilationConfig Overhaul and Optimization Levels

Overview

Description of changes to VLLMConfg and CompilationConfig. The introduction of meaningful optimization levels (-O0, -O1, -O2, -O3) from GitHub Issue #20283. This change aims to improve user experience by providing intuitive optimization levels that trade startup time for performance, while consolidating and simplifying the compilation configuration system. We also have changed defaults to help users get desired out of the box performance. These defaults are determined by optimization-level. Importantly defaults are purely defaults; explicit user settings will not be overwritten.

Key Changes

1. Repurposing -O for Optimization Levels

The -O<n> flags now represent meaningful optimization levels that trade startup time for performance:

-O0: No Optimization

  • Startup: Fastest startup time
  • Performance: No compilation, no cudagraphs, no optimizations
  • Equivalent to: --enforce-eager (deprecated)
  • Use case: Development, debugging, or when startup time is critical
# CLI usage
python -m vllm.entrypoints.api_server --model microsoft/DialoGPT-medium -O0

# Python API usage
from vllm.entrypoints.llm import LLM
from vllm.config.vllm import OptimizationLevel

llm = LLM(
    model="microsoft/DialoGPT-medium",
    optimization_level=OptimizationLevel.O0
)

-O1: Quick Optimizations

  • Startup: Moderate startup time
  • Performance: Inductor compilation, CUDAGraphMode.PIECEWISE
  • Use case: Balance for most development scenarios
# CLI usage
python -m vllm.entrypoints.api_server --model microsoft/DialoGPT-medium -O1

# Python API usage
from vllm.entrypoints.llm import LLM
from vllm.config.vllm import OptimizationLevel

llm = LLM(
    model="microsoft/DialoGPT-medium",
    optimization_level=OptimizationLevel.O1
)

-O2: Full Optimizations (Default)

  • Startup: Longer startup time
  • Performance: -O1 + CUDAGraphMode.FULL_AND_PIECEWISE
  • Use case: Production workloads where performance is important. This is the default use case. It is also very similar to the previous default. The primary difference is that noop & fusion flags are enabled.
# CLI usage (default, so optional)
python -m vllm.entrypoints.api_server --model microsoft/DialoGPT-medium -O2

# Python API usage
from vllm.entrypoints.llm import LLM
from vllm.config.vllm import OptimizationLevel

llm = LLM(
    model="microsoft/DialoGPT-medium",
    optimization_level=OptimizationLevel.O2  # This is the default
)

-O3: Full Optimization

Still in development. Added infrastructure to prevent changing API in future
release. Currently behaves the same O2.

Troubleshooting

Common Issues

  1. Startup Time Too Long: Use -O0 or -O1 for faster startup
  2. Compilation Errors: Use debug_dump_path for additional debugging information
  3. Performance Issues: Ensure using -O2 for production

Added functions

Added functionality determining if a model is quantized and if a model is MOE. This will be relevant for future work. Added lambda's to easily get information about configuration.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link
Copy Markdown

mergify bot commented Oct 14, 2025

Documentation preview: https://vllm--26847.org.readthedocs.build/en/26847/

@mergify mergify bot added documentation Improvements or additions to documentation frontend llama Related to Llama models speculative-decoding v1 tpu Related to Google TPUs labels Oct 14, 2025
@morrison-turnansky morrison-turnansky force-pushed the issue-20283-model-config branch 2 times, most recently from 930fd18 to e849607 Compare October 15, 2025 12:04
@mergify mergify bot removed the tpu Related to Google TPUs label Oct 15, 2025
@morrison-turnansky
Copy link
Copy Markdown
Contributor Author

@ProExpertProg Tracking a list of keys for determining if a model is moe/sequential here.

logger = init_logger(__name__)

# PassConfig preset instances for each compilation mode. Default fields set.
pass_config_none = PassConfig(
Copy link
Copy Markdown
Contributor Author

@morrison-turnansky morrison-turnansky Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ProExpertProg let us know what the defaults for each mode need to be.

Copy link
Copy Markdown
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts

- O3 (VLLM_COMPILE): Maximum optimization with autotuning
"""
# TODO: Implement model specific paramters,
default_config = optimization_level_to_config[self.optimization_level]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should ask the platform for an opinion on the defaults.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 options:

  • we pass default to platform, it makes modifications
  • platform returns default, can use these global defaults as a starting point
  • each platform owns its own defaults

Let's chat about this at the end of release meeting maybe

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ProExpertProg Are we still including platform specific modifications? Are we pushing that until inductor partition is on by default?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now we can just do this on is_cuda_alike() platforms. But we're gonna need a more robust approach here

Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Comment on lines 98 to +100
This is separate from general `CompilationConfig` so that inductor passes
don't all have access to full configuration - that would create a cycle as
the `PassManager` is set as a property of config."""
the `PassManager` is set as a property of config.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually changed recently; we should still keep pass config separate, but this reason is no longer true. Can be done in follow up cc @ilmarkov

morrison-turnansky and others added 2 commits November 26, 2025 13:37
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
):
logger.info(
"Cudagraph mode %s is not compatible with compilation mode %s. "
"Cudagraph mode is not compatible with compilation mode %s."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this? Aren't you giving it 2 strings?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudagraph mode could be None (not the enum). I didn't want to write None to the user in the log, but yes should clean it up. At one point I had moved this after the defaults were applied, and then I saw a failure in Example Test. I wanted to keep everything exactly the same, but I am now more convinced that the test is just very flaky.

Signed-off-by: morrison-turnansky <mturnans@redhat.com>
@vllm-bot vllm-bot merged commit 0838b52 into vllm-project:main Nov 27, 2025
129 of 133 checks passed
@github-project-automation github-project-automation bot moved this from To triage to Done in torch.compile integration Nov 27, 2025
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 27, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
): Set up -O infrastructure (vllm-project#26847)

Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Co-authored-by: adabeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
amd-hhashemi pushed a commit to amd-hhashemi/vllm that referenced this pull request Dec 2, 2025
): Set up -O infrastructure (vllm-project#26847)

Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Co-authored-by: adabeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Dec 2, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend llama Related to Llama models nvidia ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs speculative-decoding torch.compile v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

[RFC][UX][torch.compile][CUDAGraph]: Overhaul CompilationConfig and improve CLI -O<n>

6 participants