Skip to content

[Core] Use standalone autograd_cache_key for compilation dedup optimization#37929

Closed
frgossen wants to merge 1 commit intovllm-project:mainfrom
frgossen:use-aot-autograd-cache-key
Closed

[Core] Use standalone autograd_cache_key for compilation dedup optimization#37929
frgossen wants to merge 1 commit intovllm-project:mainfrom
frgossen:use-aot-autograd-cache-key

Conversation

@frgossen
Copy link
Copy Markdown
Contributor

@frgossen frgossen commented Mar 23, 2026

Purpose

Use the new torch.compile standalone_compile.autograd_cache_key API
(torch >= 2.12) to compute cache keys up front, avoiding the legacy
monkey-patching of autograd_cache.autograd_cache_key during compilation.
This enables deduplication without compiling duplicate subgraphs.

A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone
API against the legacy path. This uses a dedicated env var rather than
a log-level guard because the check changes the compilation codepath
(forces legacy compile + extra key computation), not just verbosity.
This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE,
VLLM_DEBUG_MFU_METRICS, etc.).

Test Plan

  • Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no
    functional changes.

Test Result

Cold-compile benchmark (Llama 3 70B, TP=4, 16 runs each):
Before (1e688fa): mean 34.2s ± 0.8s, median 34.3s
After (eecf384): mean 34.4s ± 0.6s, median 34.7s
No significant difference (within noise)

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the compilation deduplication logic to use the new autograd_cache_key API in torch >= 2.11, which is a great improvement. The code is well-structured, separating the new path and the legacy monkey-patching path. However, I've found a critical issue where the new caching logic assumes an Inductor backend, which could cause failures when other backends like 'eager' are used. My review includes a suggestion to make the caching logic conditional on the backend type.

Comment thread vllm/compilation/backends.py Outdated
@frgossen frgossen force-pushed the use-aot-autograd-cache-key branch from b5d03e8 to 919bfc7 Compare March 23, 2026 22:06
Comment thread vllm/compilation/backends.py Outdated
Comment thread vllm/compilation/backends.py
@frgossen frgossen force-pushed the use-aot-autograd-cache-key branch 2 times, most recently from 4ae8674 to 8e74249 Compare March 27, 2026 13:56
@frgossen frgossen requested a review from tjtanaa as a code owner March 27, 2026 13:56
@mergify mergify bot added ci/build nvidia rocm Related to AMD ROCm labels Mar 27, 2026
@mergify mergify bot added the cpu Related to CPU backends label Mar 27, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 27, 2026
@frgossen frgossen marked this pull request as draft March 27, 2026 13:57
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @frgossen.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 27, 2026
@frgossen frgossen force-pushed the use-aot-autograd-cache-key branch from 8e74249 to 4755a59 Compare March 27, 2026 14:02
@mergify mergify bot removed the needs-rebase label Mar 27, 2026
@frgossen
Copy link
Copy Markdown
Contributor Author

I tested this locally with meta-llama/Meta-Llama-3-70B-Instruct. I don't think the CI will cover this because it is hidden behind a debug var and it will only kick in with a newer Pytorch version. I will run the tests locally before landing this.

has_standalone_key_api =False has_standalone_key_api = is_torch_equal_or_newer("2.12.0.dev")
VLLM_DEBUG_COMPILE_CACHE_KEY=0 PASS PASS
VLLM_DEBUG_COMPILE_CACHE_KEY=1 PASS PASS

@frgossen frgossen force-pushed the use-aot-autograd-cache-key branch from 4755a59 to 1fb06db Compare April 2, 2026 18:33
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 2, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @frgossen.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 2, 2026
@frgossen frgossen force-pushed the use-aot-autograd-cache-key branch 2 times, most recently from 60b7522 to 8a949ef Compare April 2, 2026 22:03
@mergify mergify bot removed the needs-rebase label Apr 2, 2026
@frgossen frgossen force-pushed the use-aot-autograd-cache-key branch 4 times, most recently from ed08987 to c161ce1 Compare April 8, 2026 14:39
…zation

## Purpose

Use the new torch.compile standalone_compile.autograd_cache_key API
(torch >= 2.12) to compute cache keys up front, avoiding the legacy
monkey-patching of autograd_cache.autograd_cache_key during compilation.
This enables deduplication without compiling duplicate subgraphs.

A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone
API against the legacy path. This uses a dedicated env var rather than
a log-level guard because the check changes the compilation codepath
(forces legacy compile + extra key computation), not just verbosity.
This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE,
VLLM_DEBUG_MFU_METRICS, etc.).

## Test Plan

- Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no
  functional changes.

## Test Result

Cold-compile benchmark (Llama 3 70B, TP=4, 16 runs each):
  Before (1e688fa): mean 34.2s ± 0.8s, median 34.3s
  After  (eecf384): mean 34.4s ± 0.6s, median 34.7s
  No significant difference (within noise)

Signed-off-by: Frederik Gossen <frgossen@meta.com>
@frgossen frgossen force-pushed the use-aot-autograd-cache-key branch from c161ce1 to 24f24fd Compare April 8, 2026 21:31
@frgossen frgossen closed this Apr 10, 2026
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Apr 10, 2026
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Apr 10, 2026
@frgossen
Copy link
Copy Markdown
Contributor Author

duplicate creation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends nvidia rocm Related to AMD ROCm

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants