[Core] Use standalone autograd_cache_key for compilation dedup optimization by frgossen · Pull Request #39517 · vllm-project/vllm

frgossen · 2026-04-10T17:33:52Z

Use the new torch.compile standalone_compile.autograd_cache_key API
(torch >= 2.12) to compute cache keys up front, avoiding the legacy
monkey-patching of autograd_cache.autograd_cache_key during compilation.
This enables deduplication without compiling duplicate subgraphs.

A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone
API against the legacy path. This uses a dedicated env var rather than
a log-level guard because the check changes the compilation codepath
(forces legacy compile + extra key computation), not just verbosity.
This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE,
VLLM_DEBUG_MFU_METRICS, etc.).

Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no
functional changes.
Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges,
passes/test_pass_manager, passes/test_noop_elimination).
E2E test matrix: 2x2 matrix of (standalone vs legacy path) ×
(VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by
setting has_standalone_key_api=False locally.

Cold-compile benchmark (Llama 3 70B, TP=4):
No dedup (109b4a0): mean 35.6s ± 0.8s, median 35.7s
Dedup via monkey-patch (baseline): mean 34.2s ± 0.8s, median 34.3s
Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s

The standalone API is ~0.9s (2.5%) faster than no dedup, and within
noise of the monkey-patch baseline it replaces.

E2E test matrix (Llama 3 70B, TP=4, 1 run each):

Path	DEBUG=0	DEBUG=1
Standalone	34.3s OK	40.5s OK (cross-check passed)
Legacy	34.9s OK	35.1s OK (cross-check passed)

All 4 combinations served the model successfully. The DEBUG=1
standalone path is slower as expected (runs legacy compile + standalone
key computation to cross-check equivalence).

gemini-code-assist

Code Review

This pull request refactors the compilation backend to support the new standalone autograd_cache_key API available in PyTorch 2.12+, while maintaining a legacy monkey-patching path for older versions. It also introduces a debug mode to verify the equivalence of both paths. Feedback indicates that the new standalone path is missing necessary timing and logging logic for cache hits, and the autograd_cache_normalize_inputs configuration is not correctly applied during the actual compilation step in the new path.

…zation ## Purpose Use the new torch.compile standalone_compile.autograd_cache_key API (torch >= 2.12) to compute cache keys up front, avoiding the legacy monkey-patching of autograd_cache.autograd_cache_key during compilation. This enables deduplication without compiling duplicate subgraphs. A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone API against the legacy path. This uses a dedicated env var rather than a log-level guard because the check changes the compilation codepath (forces legacy compile + extra key computation), not just verbosity. This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE, VLLM_DEBUG_MFU_METRICS, etc.). ## Test Plan - Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no functional changes. - Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges, passes/test_pass_manager, passes/test_noop_elimination). - E2E test matrix: 2x2 matrix of (standalone vs legacy path) × (VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by setting has_standalone_key_api=False locally. ## Test Result Cold-compile benchmark (Llama 3 70B, TP=4): No dedup (109b4a0): mean 35.6s ± 0.8s, median 35.7s Dedup via monkey-patch (baseline): mean 34.2s ± 0.8s, median 34.3s Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s The standalone API is ~0.9s (2.5%) faster than no dedup, and within noise of the monkey-patch baseline it replaces. E2E test matrix (Llama 3 70B, TP=4, 1 run each): Path | DEBUG=0 | DEBUG=1 -------------|----------|-------- Standalone | 34.3s OK | 40.5s OK (cross-check passed) Legacy | 34.9s OK | 35.1s OK (cross-check passed) All 4 combinations served the model successfully. The DEBUG=1 standalone path is slower as expected (runs legacy compile + standalone key computation to cross-check equivalence). Signed-off-by: Frederik Gossen <frgossen@meta.com> PR: vllm-project#39517 Branch: core-standalone-autograd-cache-key

…zation Use the new torch.compile standalone_compile.autograd_cache_key API (torch >= 2.12) to compute cache keys up front, avoiding the legacy monkey-patching of autograd_cache.autograd_cache_key during compilation. This enables deduplication without compiling duplicate subgraphs. A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone API against the legacy path. This uses a dedicated env var rather than a log-level guard because the check changes the compilation codepath (forces legacy compile + extra key computation), not just verbosity. This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE, VLLM_DEBUG_MFU_METRICS, etc.). - Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no functional changes. - Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges, passes/test_pass_manager, passes/test_noop_elimination). - E2E test matrix: 2x2 matrix of (standalone vs legacy path) × (VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by setting has_standalone_key_api=False locally. Cold-compile benchmark (Llama 3 70B, TP=4): No dedup (109b4a0): mean 35.6s ± 0.8s, median 35.7s Dedup via monkey-patch (baseline): mean 34.2s ± 0.8s, median 34.3s Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s The standalone API is ~0.9s (2.5%) faster than no dedup, and within noise of the monkey-patch baseline it replaces. E2E test matrix (Llama 3 70B, TP=4, 1 run each): Path | DEBUG=0 | DEBUG=1 -------------|----------|-------- Standalone | 34.3s OK | 40.5s OK (cross-check passed) Legacy | 34.9s OK | 35.1s OK (cross-check passed) All 4 combinations served the model successfully. The DEBUG=1 standalone path is slower as expected (runs legacy compile + standalone key computation to cross-check equivalence). Signed-off-by: Frederik Gossen <frgossen@meta.com> PR: vllm-project#39517 Branch: core-standalone-autograd-cache-key

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

frgossen changed the title ~~placeholder~~ [Core] Use standalone autograd_cache_key for compilation dedup optimization Apr 10, 2026

gemini-code-assist bot reviewed Apr 10, 2026

View reviewed changes

Comment thread vllm/compilation/backends.py

Comment thread vllm/compilation/backends.py Outdated

frgossen force-pushed the core-standalone-autograd-cache-key branch from 74e9ad1 to b92c1cd Compare April 10, 2026 18:06

frgossen force-pushed the core-standalone-autograd-cache-key branch from b92c1cd to e15d901 Compare April 10, 2026 18:17

frgossen force-pushed the core-standalone-autograd-cache-key branch from e15d901 to 6d024dc Compare April 10, 2026 19:26

panpan0000 mentioned this pull request Apr 14, 2026

Introduce De-dup/Similarity-Check in CI Workflow for PR/Issue #39695

Open

5 tasks

frgossen force-pushed the core-standalone-autograd-cache-key branch from 6d024dc to d471fe2 Compare April 14, 2026 15:33

frgossen force-pushed the core-standalone-autograd-cache-key branch from d471fe2 to 0bc2392 Compare April 20, 2026 18:48

frgossen marked this pull request as ready for review April 20, 2026 18:49

frgossen requested review from BoyuanFeng, ProExpertProg, vadiklyutiy, youkaichao and zou3519 as code owners April 20, 2026 18:49

claude bot reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Use standalone autograd_cache_key for compilation dedup optimization#39517

[Core] Use standalone autograd_cache_key for compilation dedup optimization#39517
frgossen wants to merge 1 commit intovllm-project:mainfrom
frgossen:core-standalone-autograd-cache-key

frgossen commented Apr 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

claude bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

frgossen commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

frgossen commented Apr 10, 2026 •

edited

Loading