[Core] Use standalone autograd_cache_key for compilation dedup optimization#37929
[Core] Use standalone autograd_cache_key for compilation dedup optimization#37929frgossen wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request refactors the compilation deduplication logic to use the new autograd_cache_key API in torch >= 2.11, which is a great improvement. The code is well-structured, separating the new path and the legacy monkey-patching path. However, I've found a critical issue where the new caching logic assumes an Inductor backend, which could cause failures when other backends like 'eager' are used. My review includes a suggestion to make the caching logic conditional on the backend type.
b5d03e8 to
919bfc7
Compare
4ae8674 to
8e74249
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
8e74249 to
4755a59
Compare
|
I tested this locally with
|
4755a59 to
1fb06db
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
60b7522 to
8a949ef
Compare
ed08987 to
c161ce1
Compare
…zation ## Purpose Use the new torch.compile standalone_compile.autograd_cache_key API (torch >= 2.12) to compute cache keys up front, avoiding the legacy monkey-patching of autograd_cache.autograd_cache_key during compilation. This enables deduplication without compiling duplicate subgraphs. A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone API against the legacy path. This uses a dedicated env var rather than a log-level guard because the check changes the compilation codepath (forces legacy compile + extra key computation), not just verbosity. This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE, VLLM_DEBUG_MFU_METRICS, etc.). ## Test Plan - Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no functional changes. ## Test Result Cold-compile benchmark (Llama 3 70B, TP=4, 16 runs each): Before (1e688fa): mean 34.2s ± 0.8s, median 34.3s After (eecf384): mean 34.4s ± 0.6s, median 34.7s No significant difference (within noise) Signed-off-by: Frederik Gossen <frgossen@meta.com>
c161ce1 to
24f24fd
Compare
|
duplicate creation |
Purpose
Use the new torch.compile standalone_compile.autograd_cache_key API
(torch >= 2.12) to compute cache keys up front, avoiding the legacy
monkey-patching of autograd_cache.autograd_cache_key during compilation.
This enables deduplication without compiling duplicate subgraphs.
A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone
API against the legacy path. This uses a dedicated env var rather than
a log-level guard because the check changes the compilation codepath
(forces legacy compile + extra key computation), not just verbosity.
This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE,
VLLM_DEBUG_MFU_METRICS, etc.).
Test Plan
functional changes.
Test Result
Cold-compile benchmark (Llama 3 70B, TP=4, 16 runs each):
Before (1e688fa): mean 34.2s ± 0.8s, median 34.3s
After (eecf384): mean 34.4s ± 0.6s, median 34.7s
No significant difference (within noise)