[Core] Use standalone autograd_cache_key for compilation dedup optimization#39517
Open
frgossen wants to merge 1 commit intovllm-project:mainfrom
Open
[Core] Use standalone autograd_cache_key for compilation dedup optimization#39517frgossen wants to merge 1 commit intovllm-project:mainfrom
frgossen wants to merge 1 commit intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request refactors the compilation backend to support the new standalone autograd_cache_key API available in PyTorch 2.12+, while maintaining a legacy monkey-patching path for older versions. It also introduces a debug mode to verify the equivalence of both paths. Feedback indicates that the new standalone path is missing necessary timing and logging logic for cache hits, and the autograd_cache_normalize_inputs configuration is not correctly applied during the actual compilation step in the new path.
74e9ad1 to
b92c1cd
Compare
b92c1cd to
e15d901
Compare
frgossen
added a commit
to frgossen/vllm
that referenced
this pull request
Apr 10, 2026
…zation ## Purpose Use the new torch.compile standalone_compile.autograd_cache_key API (torch >= 2.12) to compute cache keys up front, avoiding the legacy monkey-patching of autograd_cache.autograd_cache_key during compilation. This enables deduplication without compiling duplicate subgraphs. A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone API against the legacy path. This uses a dedicated env var rather than a log-level guard because the check changes the compilation codepath (forces legacy compile + extra key computation), not just verbosity. This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE, VLLM_DEBUG_MFU_METRICS, etc.). ## Test Plan - Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no functional changes. - Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges, passes/test_pass_manager, passes/test_noop_elimination). - E2E test matrix: 2x2 matrix of (standalone vs legacy path) × (VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by setting has_standalone_key_api=False locally. ## Test Result Cold-compile benchmark (Llama 3 70B, TP=4): No dedup (109b4a0): mean 35.6s ± 0.8s, median 35.7s Dedup via monkey-patch (baseline): mean 34.2s ± 0.8s, median 34.3s Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s The standalone API is ~0.9s (2.5%) faster than no dedup, and within noise of the monkey-patch baseline it replaces. E2E test matrix (Llama 3 70B, TP=4, 1 run each): Path | DEBUG=0 | DEBUG=1 -------------|----------|-------- Standalone | 34.3s OK | 40.5s OK (cross-check passed) Legacy | 34.9s OK | 35.1s OK (cross-check passed) All 4 combinations served the model successfully. The DEBUG=1 standalone path is slower as expected (runs legacy compile + standalone key computation to cross-check equivalence). Signed-off-by: Frederik Gossen <frgossen@meta.com> PR: vllm-project#39517 Branch: core-standalone-autograd-cache-key
e15d901 to
6d024dc
Compare
5 tasks
frgossen
added a commit
to frgossen/vllm
that referenced
this pull request
Apr 14, 2026
…zation ## Purpose Use the new torch.compile standalone_compile.autograd_cache_key API (torch >= 2.12) to compute cache keys up front, avoiding the legacy monkey-patching of autograd_cache.autograd_cache_key during compilation. This enables deduplication without compiling duplicate subgraphs. A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone API against the legacy path. This uses a dedicated env var rather than a log-level guard because the check changes the compilation codepath (forces legacy compile + extra key computation), not just verbosity. This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE, VLLM_DEBUG_MFU_METRICS, etc.). ## Test Plan - Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no functional changes. - Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges, passes/test_pass_manager, passes/test_noop_elimination). - E2E test matrix: 2x2 matrix of (standalone vs legacy path) × (VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by setting has_standalone_key_api=False locally. ## Test Result Cold-compile benchmark (Llama 3 70B, TP=4): No dedup (109b4a0): mean 35.6s ± 0.8s, median 35.7s Dedup via monkey-patch (baseline): mean 34.2s ± 0.8s, median 34.3s Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s The standalone API is ~0.9s (2.5%) faster than no dedup, and within noise of the monkey-patch baseline it replaces. E2E test matrix (Llama 3 70B, TP=4, 1 run each): Path | DEBUG=0 | DEBUG=1 -------------|----------|-------- Standalone | 34.3s OK | 40.5s OK (cross-check passed) Legacy | 34.9s OK | 35.1s OK (cross-check passed) All 4 combinations served the model successfully. The DEBUG=1 standalone path is slower as expected (runs legacy compile + standalone key computation to cross-check equivalence). Signed-off-by: Frederik Gossen <frgossen@meta.com> PR: vllm-project#39517 Branch: core-standalone-autograd-cache-key
6d024dc to
d471fe2
Compare
…zation Use the new torch.compile standalone_compile.autograd_cache_key API (torch >= 2.12) to compute cache keys up front, avoiding the legacy monkey-patching of autograd_cache.autograd_cache_key during compilation. This enables deduplication without compiling duplicate subgraphs. A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone API against the legacy path. This uses a dedicated env var rather than a log-level guard because the check changes the compilation codepath (forces legacy compile + extra key computation), not just verbosity. This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE, VLLM_DEBUG_MFU_METRICS, etc.). - Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no functional changes. - Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges, passes/test_pass_manager, passes/test_noop_elimination). - E2E test matrix: 2x2 matrix of (standalone vs legacy path) × (VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by setting has_standalone_key_api=False locally. Cold-compile benchmark (Llama 3 70B, TP=4): No dedup (109b4a0): mean 35.6s ± 0.8s, median 35.7s Dedup via monkey-patch (baseline): mean 34.2s ± 0.8s, median 34.3s Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s The standalone API is ~0.9s (2.5%) faster than no dedup, and within noise of the monkey-patch baseline it replaces. E2E test matrix (Llama 3 70B, TP=4, 1 run each): Path | DEBUG=0 | DEBUG=1 -------------|----------|-------- Standalone | 34.3s OK | 40.5s OK (cross-check passed) Legacy | 34.9s OK | 35.1s OK (cross-check passed) All 4 combinations served the model successfully. The DEBUG=1 standalone path is slower as expected (runs legacy compile + standalone key computation to cross-check equivalence). Signed-off-by: Frederik Gossen <frgossen@meta.com> PR: vllm-project#39517 Branch: core-standalone-autograd-cache-key
d471fe2 to
0bc2392
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Use the new torch.compile standalone_compile.autograd_cache_key API
(torch >= 2.12) to compute cache keys up front, avoiding the legacy
monkey-patching of autograd_cache.autograd_cache_key during compilation.
This enables deduplication without compiling duplicate subgraphs.
A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone
API against the legacy path. This uses a dedicated env var rather than
a log-level guard because the check changes the compilation codepath
(forces legacy compile + extra key computation), not just verbosity.
This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE,
VLLM_DEBUG_MFU_METRICS, etc.).
functional changes.
passes/test_pass_manager, passes/test_noop_elimination).
(VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by
setting has_standalone_key_api=False locally.
Cold-compile benchmark (Llama 3 70B, TP=4):
No dedup (109b4a0): mean 35.6s ± 0.8s, median 35.7s
Dedup via monkey-patch (baseline): mean 34.2s ± 0.8s, median 34.3s
Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s
The standalone API is ~0.9s (2.5%) faster than no dedup, and within
noise of the monkey-patch baseline it replaces.
E2E test matrix (Llama 3 70B, TP=4, 1 run each):
All 4 combinations served the model successfully. The DEBUG=1
standalone path is slower as expected (runs legacy compile + standalone
key computation to cross-check equivalence).