[CUDA Graph] Enhance CUDA graph input address debugging by yiz-liu · Pull Request #35605 · vllm-project/vllm

yiz-liu · 2026-02-28T09:36:48Z

Purpose

Previously, the debugging mode for CUDA Graphs only checked a flat list of tensor data pointers directly passed in *args. This missed tensors nested inside structures, **kwargs, or attn_metadata, which could lead to silent data corruption if addresses changed between capture and replay.

This commit strengthens the validation by:

Conditionally including forward_context.attn_metadata in the validation set only when runtime_mode is CUDAGraphMode.FULL.
Adding _extract_tensors to extract tensor addresses from attn_metadata.
Add kwargs tensor addresses to input_addresses.
Improving the AssertionError during replay to explicitly list which keys (paths) are missing, added, or have mismatched memory addresses.

Test Plan

We can manually clone a tensor then it should be guarded.

Test Result

(EngineCore pid=809263) RuntimeError: Worker failed with error 'Input addresses for cudagraphs are different during replay.
(EngineCore pid=809263) Differences in attn_metadata detected:
(EngineCore pid=809263)   Changed addresses:
(EngineCore pid=809263)     model.layers.14.self_attn.attn.query_start_loc: 291420241920 -> 90296045568
(EngineCore pid=809263)     model.layers.12.self_attn.attn.query_start_loc: 291420241920 -> 90296045568
(EngineCore pid=809263)     model.layers.38.self_attn.attn.query_start_loc: 291420241920 -> 90296045568
...

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request significantly enhances the debugging capabilities for CUDA graphs by implementing a more thorough check of input tensor memory addresses. The new _extract_tensor_addresses function recursively traverses nested input structures, and _get_inputs_to_extract uses introspection to provide meaningful paths for tensors in debug logs. The improved error reporting for mismatched addresses is also a great addition. I've found one area for improvement in the object traversal logic to make it more comprehensive by adding support for objects that use __slots__.

gemini-code-assist · 2026-02-28T09:39:13Z

vllm/compilation/cuda_graph.py

+        elif hasattr(obj, "__dict__"):
+            # Use vars(obj) instead of dir() to avoid triggering @property getters
+            try:
+                for k, v in vars(obj).items():
+                    if not k.startswith("_") and not isinstance(
+                        getattr(type(obj), k, None), property
+                    ):
+                        new_prefix = f"{prefix}.{k}" if prefix else k
+                        addresses.update(
+                            _extract_tensor_addresses(v, new_prefix, visited)
+                        )
+            except TypeError:
+                pass


The current implementation for traversing custom objects only considers __dict__. This will miss tensors stored in objects that use __slots__ for attribute storage. This could lead to the debugging feature failing to detect address changes for tensors in such objects, undermining the purpose of this PR.

To make the traversal more robust, it should be updated to handle objects with __slots__, as well as objects that might have both __slots__ and __dict__.

else: # Handle objects with __slots__ if hasattr(obj, "__slots__"): for slot_name in obj.__slots__: if slot_name == "__dict__": continue if not slot_name.startswith("_"): new_prefix = f"{prefix}.{slot_name}" if prefix else slot_name try: v = getattr(obj, slot_name) addresses.update( _extract_tensor_addresses(v, new_prefix, visited)) except AttributeError: pass # Attribute not set # Handle objects with __dict__ if hasattr(obj, "__dict__"): # Use vars(obj) instead of dir() to avoid triggering @property getters try: for k, v in vars(obj).items(): if not k.startswith("_") and not isinstance( getattr(type(obj), k, None), property): new_prefix = f"{prefix}.{k}" if prefix else k addresses.update( _extract_tensor_addresses(v, new_prefix, visited)) except TypeError: pass

ProExpertProg · 2026-02-28T15:51:24Z

Cc @BoyuanFeng

BoyuanFeng · 2026-03-09T05:32:39Z

vllm's torch.compile should flatten the model's inputs as only args and empty kwargs. So we don't need _extract_tensor_addresses to manually flatten the args/kwargs?

In addition, could you share more on use cases, (e.g., a specific failure command)?

Improves the validation of CUDA graph input stability by recursively tracking tensor memory addresses within nested data structures. Instead of tracking a flat list of pointers, this uses a structured approach to map specific input paths to their memory locations. This provides significantly more detailed error messages when a tensor's address changes between capture and replay, making it easier to identify exactly which part of the input state is violating CUDA graph requirements. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu · 2026-03-17T11:40:34Z

vllm's torch.compile should flatten the model's inputs as only args and empty kwargs. So we don't need _extract_tensor_addresses to manually flatten the args/kwargs?

In addition, could you share more on use cases, (e.g., a specific failure command)?

@BoyuanFeng Sorry for the late response. Indeed the model inputs are indeed already flattened after torch.compile, but this PR intends to deal with the AttentionMetadata. For example, this helped me with #34880 (comment) . And it should be easy to reproduce by clone() any tensor in common_attention_metadata when cudagraph_mode == CUDAGraphMode.FULL.

I have simplified the process, it should be much easier to review now, could you please take a look again? Thanks.

mergify · 2026-03-17T12:55:19Z

Hi @yiz-liu, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

This commit vastly simplifies the tracking and validation of CUDA tensor memory addresses during CUDAGraph debug runs. - Removes complex recursive object traversal, replacing it with explicit, shallow extraction for `attn_metadata`. - Drops the expensive `inspect.signature(...).bind()` method. Uses simple list concatenation for checking arguments and kwargs. - Refactors the mismatch logging into an elegant set-based intersection/difference check for `attn_metadata`. - Corrects behavior where the code previously lost the `is_cuda` check for tensors. - Substantially lowers performance overhead for the debug mode. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu requested review from BoyuanFeng, ProExpertProg, youkaichao and zou3519 as code owners February 28, 2026 09:36

mergify bot added the nvidia label Feb 28, 2026

github-project-automation bot added this to NVIDIA Feb 28, 2026

gemini-code-assist bot reviewed Feb 28, 2026

View reviewed changes

yiz-liu force-pushed the meta-check branch 2 times, most recently from edf8721 to 0e96045 Compare March 17, 2026 11:33

yiz-liu force-pushed the meta-check branch from 0e96045 to a32dd35 Compare March 18, 2026 01:38

yiz-liu force-pushed the meta-check branch from a32dd35 to bfbe45f Compare March 18, 2026 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA Graph] Enhance CUDA graph input address debugging#35605

[CUDA Graph] Enhance CUDA graph input address debugging#35605
yiz-liu wants to merge 2 commits intovllm-project:mainfrom
yiz-liu:meta-check

yiz-liu commented Feb 28, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 28, 2026

Uh oh!

ProExpertProg commented Feb 28, 2026

Uh oh!

BoyuanFeng commented Mar 9, 2026

Uh oh!

yiz-liu commented Mar 17, 2026

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yiz-liu commented Feb 28, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented Feb 28, 2026

Uh oh!

BoyuanFeng commented Mar 9, 2026

Uh oh!

yiz-liu commented Mar 17, 2026

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiz-liu commented Feb 28, 2026 •

edited by github-actions bot

Loading