[CPU] Refine default config for the CPU backend by bigPYJ1151 · Pull Request #19539 · vllm-project/vllm

bigPYJ1151 · 2025-06-12T09:00:05Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Refine CPU backend default torch.compile config. Enable inductor as default. Disable inductor in most of CI tests to save time.
Allow setting default_max_num_seqs for different UsageContext. Setting default_max_num_seqs and default_max_num_batched_tokens for the CPU backend.

Test Plan

CI pipeline

Test Result

(Optional) Documentation Update

Signed-off-by: jiang1.li <jiang1.li@intel.com>

github-actions · 2025-06-12T09:00:21Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @bigPYJ1151, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the default configuration for the CPU backend. The primary goal is to enable torch.compile with the inductor backend by default for better performance, while simultaneously optimizing CI run times by disabling inductor for most tests. Additionally, it introduces more granular control over default sequence and batched token limits based on the usage context and sets specific defaults for the CPU platform.

Highlights

CPU Backend Defaults: Enabled torch.compile with the inductor backend by default for the CPU platform in V1 mode, moving away from the previously enforced eager mode.
CI Performance Optimization: Introduced an environment variable (VLLM_CPU_CI_ENV) to disable the inductor backend specifically within most CPU CI test runs to reduce compilation time and speed up tests. The benchmark test explicitly re-enables it.
Per-Context Sequence/Token Limits: Refactored the default max_num_seqs setting to be a dictionary allowing different default values based on the UsageContext (e.g., LLM_CLASS, OPENAI_API_SERVER). Added specific, lower default values for max_num_seqs and max_num_batched_tokens tailored for the CPU backend.
Compilation Settings Refinement: Adjusted the global compilation settings context manager used during model warm-up to pass the VLLM config and make the freezing setting conditional on max_autotune.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refines the CPU backend's default PyTorch Compile (Inductor) configuration and introduces more granular control over max_num_seqs based on usage context. The changes aim to enable Inductor by default for CPU for better performance, while using eager mode in most CI tests to save time.

Key changes include enabling Inductor by default for CPU, allowing context-specific defaults for sequence and batch token limits, and updating Inductor settings in the cpu_model_runner.

A critical concern is the modification to torch._inductor.config.freezing handling, which might default to False when True could be required by CPU Inductor backends. Additionally, the removal of force_disable_caches and changes to custom_ops handling warrant verification to prevent regressions or unexpected behavior. The CI script changes also warrant review.

gemini-code-assist · 2025-06-12T09:01:51Z

vllm/v1/worker/cpu_model_runner.py

+def _set_global_compilation_settings(config: VllmConfig):
    import torch._inductor.config

-    # Note: The CPPGEMM backend requires freezing parameters.
-    freezing_value = torch._inductor.config.freezing
-    torch._inductor.config.freezing = True
-    # Note: workaround for "ValueError: fast mode: can't pickle cyclic objects
-    # including object type dict"
-    force_disable_caches = torch._inductor.config.force_disable_caches
-    torch._inductor.config.force_disable_caches = True
-    yield
-    torch._inductor.config.freezing = freezing_value
-    torch._inductor.config.force_disable_caches = force_disable_caches
+    inductor_config = config.compilation_config.inductor_compile_config
+    try:
+        # Note: The MKLDNN and CPPGEMM backend requires freezing parameters.
+        freezing_value = torch._inductor.config.freezing
+        if inductor_config.get("max_autotune", False):
+            torch._inductor.config.freezing = True
+        yield
+    finally:
+        torch._inductor.config.freezing = freezing_value


The logic for setting torch._inductor.config.freezing has changed. It's now conditional on inductor_config.get("max_autotune", False). Verify if freezing=False is safe and intended for the default CPU Inductor path, or if freezing=True should be set by default for CPU, independently of max_autotune if required by the underlying Inductor backends.

gemini-code-assist · 2025-06-12T09:01:51Z

vllm/platforms/cpu.py

+            if os.environ.get("VLLM_CPU_CI_ENV", "0") != "0":
+                backend = "eager"
+            else:
+                backend = "inductor"


The line compilation_config.custom_ops += ["none"] was removed. Clarify the expected behavior of custom ops with this change, for both "eager" and "inductor" backends under CompilationLevel.DYNAMO_ONCE. Ensure custom ops are correctly handled.

gemini-code-assist · 2025-06-12T09:01:51Z

.buildkite/scripts/hardware_ci/run-cpu-test.sh

    python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --dtype half & 
    timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
-    python3 benchmarks/benchmark_serving.py \
+    VLLM_CPU_CI_ENV=0 python3 benchmarks/benchmark_serving.py \


Setting VLLM_CPU_CI_ENV=0 here seems to override the environment variable set in the docker run command. Is this intentional?

Suggested change

VLLM_CPU_CI_ENV=0 python3 benchmarks/benchmark_serving.py \

python3 benchmarks/benchmark_serving.py \

--backend vllm \

bigPYJ1151 · 2025-06-12T10:23:48Z

Hi @Isotr0py @DarkLight1337 Would you please help to take a review on this PR? Some minor changes of default CPU backend configs, the cpu-test has passed. Thanks :)

Isotr0py

Overall LGTM!

gemini-code-assist · 2025-06-13T05:27:43Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

refine CPU default config

a0a1963

Signed-off-by: jiang1.li <jiang1.li@intel.com>

bigPYJ1151 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners June 12, 2025 09:00

gemini-code-assist bot reviewed Jun 12, 2025

View reviewed changes

mergify bot added ci/build v1 labels Jun 12, 2025

bigPYJ1151 changed the title ~~[CPU] Refine CPU default config~~ [CPU] Refine default config for the CPU backend Jun 12, 2025

gemini-code-assist bot reviewed Jun 12, 2025

View reviewed changes

Isotr0py approved these changes Jun 12, 2025

View reviewed changes

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 12, 2025

DarkLight1337 merged commit 6458721 into vllm-project:main Jun 13, 2025
81 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU] Refine default config for the CPU backend#19539

[CPU] Refine default config for the CPU backend#19539
DarkLight1337 merged 1 commit intovllm-project:mainfrom
bigPYJ1151:cpu_config

bigPYJ1151 commented Jun 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 12, 2025

Uh oh!

gemini-code-assist bot Jun 12, 2025

Uh oh!

gemini-code-assist bot Jun 12, 2025

Uh oh!

bigPYJ1151 commented Jun 12, 2025

Uh oh!

Isotr0py left a comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	VLLM_CPU_CI_ENV=0 python3 benchmarks/benchmark_serving.py \
	python3 benchmarks/benchmark_serving.py \
	--backend vllm \

Uh oh!

Conversation

bigPYJ1151 commented Jun 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jun 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

bigPYJ1151 commented Jun 12, 2025

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bigPYJ1151 commented Jun 12, 2025 •

edited by github-actions bot

Loading