Skip to content

[Core] Use standalone autograd_cache_key for compilation dedup optimization#39517

Open
frgossen wants to merge 1 commit intovllm-project:mainfrom
frgossen:core-standalone-autograd-cache-key
Open

[Core] Use standalone autograd_cache_key for compilation dedup optimization#39517
frgossen wants to merge 1 commit intovllm-project:mainfrom
frgossen:core-standalone-autograd-cache-key

Conversation

@frgossen
Copy link
Copy Markdown
Contributor

@frgossen frgossen commented Apr 10, 2026

Use the new torch.compile standalone_compile.autograd_cache_key API
(torch >= 2.12) to compute cache keys up front, avoiding the legacy
monkey-patching of autograd_cache.autograd_cache_key during compilation.
This enables deduplication without compiling duplicate subgraphs.

A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone
API against the legacy path. This uses a dedicated env var rather than
a log-level guard because the check changes the compilation codepath
(forces legacy compile + extra key computation), not just verbosity.
This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE,
VLLM_DEBUG_MFU_METRICS, etc.).

  • Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no
    functional changes.
  • Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges,
    passes/test_pass_manager, passes/test_noop_elimination).
  • E2E test matrix: 2x2 matrix of (standalone vs legacy path) ×
    (VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by
    setting has_standalone_key_api=False locally.

Cold-compile benchmark (Llama 3 70B, TP=4):
No dedup (109b4a0): mean 35.6s ± 0.8s, median 35.7s
Dedup via monkey-patch (baseline): mean 34.2s ± 0.8s, median 34.3s
Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s

The standalone API is ~0.9s (2.5%) faster than no dedup, and within
noise of the monkey-patch baseline it replaces.

E2E test matrix (Llama 3 70B, TP=4, 1 run each):

Path DEBUG=0 DEBUG=1
Standalone 34.3s OK 40.5s OK (cross-check passed)
Legacy 34.9s OK 35.1s OK (cross-check passed)

All 4 combinations served the model successfully. The DEBUG=1
standalone path is slower as expected (runs legacy compile + standalone
key computation to cross-check equivalence).

@frgossen frgossen changed the title placeholder [Core] Use standalone autograd_cache_key for compilation dedup optimization Apr 10, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the compilation backend to support the new standalone autograd_cache_key API available in PyTorch 2.12+, while maintaining a legacy monkey-patching path for older versions. It also introduces a debug mode to verify the equivalence of both paths. Feedback indicates that the new standalone path is missing necessary timing and logging logic for cache hits, and the autograd_cache_normalize_inputs configuration is not correctly applied during the actual compilation step in the new path.

Comment thread vllm/compilation/backends.py
Comment thread vllm/compilation/backends.py Outdated
@frgossen frgossen force-pushed the core-standalone-autograd-cache-key branch from 74e9ad1 to b92c1cd Compare April 10, 2026 18:06
@frgossen frgossen force-pushed the core-standalone-autograd-cache-key branch from b92c1cd to e15d901 Compare April 10, 2026 18:17
frgossen added a commit to frgossen/vllm that referenced this pull request Apr 10, 2026
…zation

## Purpose

Use the new torch.compile standalone_compile.autograd_cache_key API
(torch >= 2.12) to compute cache keys up front, avoiding the legacy
monkey-patching of autograd_cache.autograd_cache_key during compilation.
This enables deduplication without compiling duplicate subgraphs.

A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone
API against the legacy path. This uses a dedicated env var rather than
a log-level guard because the check changes the compilation codepath
(forces legacy compile + extra key computation), not just verbosity.
This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE,
VLLM_DEBUG_MFU_METRICS, etc.).

## Test Plan

- Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no
  functional changes.
- Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges,
  passes/test_pass_manager, passes/test_noop_elimination).
- E2E test matrix: 2x2 matrix of (standalone vs legacy path) ×
  (VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by
  setting has_standalone_key_api=False locally.

## Test Result

Cold-compile benchmark (Llama 3 70B, TP=4):
  No dedup (109b4a0):                mean 35.6s ± 0.8s, median 35.7s
  Dedup via monkey-patch (baseline):  mean 34.2s ± 0.8s, median 34.3s
  Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s

The standalone API is ~0.9s (2.5%) faster than no dedup, and within
noise of the monkey-patch baseline it replaces.

E2E test matrix (Llama 3 70B, TP=4, 1 run each):

  Path         | DEBUG=0  | DEBUG=1
  -------------|----------|--------
  Standalone   | 34.3s OK | 40.5s OK (cross-check passed)
  Legacy       | 34.9s OK | 35.1s OK (cross-check passed)

All 4 combinations served the model successfully. The DEBUG=1
standalone path is slower as expected (runs legacy compile + standalone
key computation to cross-check equivalence).

Signed-off-by: Frederik Gossen <frgossen@meta.com>
PR: vllm-project#39517
Branch: core-standalone-autograd-cache-key
@frgossen frgossen force-pushed the core-standalone-autograd-cache-key branch from e15d901 to 6d024dc Compare April 10, 2026 19:26
frgossen added a commit to frgossen/vllm that referenced this pull request Apr 14, 2026
…zation

## Purpose

Use the new torch.compile standalone_compile.autograd_cache_key API
(torch >= 2.12) to compute cache keys up front, avoiding the legacy
monkey-patching of autograd_cache.autograd_cache_key during compilation.
This enables deduplication without compiling duplicate subgraphs.

A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone
API against the legacy path. This uses a dedicated env var rather than
a log-level guard because the check changes the compilation codepath
(forces legacy compile + extra key computation), not just verbosity.
This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE,
VLLM_DEBUG_MFU_METRICS, etc.).

## Test Plan

- Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no
  functional changes.
- Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges,
  passes/test_pass_manager, passes/test_noop_elimination).
- E2E test matrix: 2x2 matrix of (standalone vs legacy path) ×
  (VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by
  setting has_standalone_key_api=False locally.

## Test Result

Cold-compile benchmark (Llama 3 70B, TP=4):
  No dedup (109b4a0):                mean 35.6s ± 0.8s, median 35.7s
  Dedup via monkey-patch (baseline):  mean 34.2s ± 0.8s, median 34.3s
  Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s

The standalone API is ~0.9s (2.5%) faster than no dedup, and within
noise of the monkey-patch baseline it replaces.

E2E test matrix (Llama 3 70B, TP=4, 1 run each):

  Path         | DEBUG=0  | DEBUG=1
  -------------|----------|--------
  Standalone   | 34.3s OK | 40.5s OK (cross-check passed)
  Legacy       | 34.9s OK | 35.1s OK (cross-check passed)

All 4 combinations served the model successfully. The DEBUG=1
standalone path is slower as expected (runs legacy compile + standalone
key computation to cross-check equivalence).

Signed-off-by: Frederik Gossen <frgossen@meta.com>
PR: vllm-project#39517
Branch: core-standalone-autograd-cache-key
@frgossen frgossen force-pushed the core-standalone-autograd-cache-key branch from 6d024dc to d471fe2 Compare April 14, 2026 15:33
…zation

Use the new torch.compile standalone_compile.autograd_cache_key API
(torch >= 2.12) to compute cache keys up front, avoiding the legacy
monkey-patching of autograd_cache.autograd_cache_key during compilation.
This enables deduplication without compiling duplicate subgraphs.

A new VLLM_DEBUG_COMPILE_CACHE_KEY env var cross-checks the standalone
API against the legacy path. This uses a dedicated env var rather than
a log-level guard because the check changes the compilation codepath
(forces legacy compile + extra key computation), not just verbosity.
This follows the established VLLM_DEBUG_* convention (VLLM_DEBUG_WORKSPACE,
VLLM_DEBUG_MFU_METRICS, etc.).

- Run meta-llama/Meta-Llama-3-70B-Instruct with TP=4. We expect no
  functional changes.
- Unit tests: tests/compile/ (test_config, test_wrapper, test_compile_ranges,
  passes/test_pass_manager, passes/test_noop_elimination).
- E2E test matrix: 2x2 matrix of (standalone vs legacy path) ×
  (VLLM_DEBUG_COMPILE_CACHE_KEY=0 vs =1). Legacy path simulated by
  setting has_standalone_key_api=False locally.

Cold-compile benchmark (Llama 3 70B, TP=4):
  No dedup (109b4a0):                mean 35.6s ± 0.8s, median 35.7s
  Dedup via monkey-patch (baseline):  mean 34.2s ± 0.8s, median 34.3s
  Dedup via standalone API (108414a): mean 34.7s ± 0.9s, median 34.8s

The standalone API is ~0.9s (2.5%) faster than no dedup, and within
noise of the monkey-patch baseline it replaces.

E2E test matrix (Llama 3 70B, TP=4, 1 run each):

  Path         | DEBUG=0  | DEBUG=1
  -------------|----------|--------
  Standalone   | 34.3s OK | 40.5s OK (cross-check passed)
  Legacy       | 34.9s OK | 35.1s OK (cross-check passed)

All 4 combinations served the model successfully. The DEBUG=1
standalone path is slower as expected (runs legacy compile + standalone
key computation to cross-check equivalence).

Signed-off-by: Frederik Gossen <frgossen@meta.com>
PR: vllm-project#39517
Branch: core-standalone-autograd-cache-key
@frgossen frgossen force-pushed the core-standalone-autograd-cache-key branch from d471fe2 to 0bc2392 Compare April 20, 2026 18:48
@frgossen frgossen marked this pull request as ready for review April 20, 2026 18:49
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant