Skip to content

[CUDA Graph] Enhance CUDA graph input address debugging#35605

Open
yiz-liu wants to merge 2 commits intovllm-project:mainfrom
yiz-liu:meta-check
Open

[CUDA Graph] Enhance CUDA graph input address debugging#35605
yiz-liu wants to merge 2 commits intovllm-project:mainfrom
yiz-liu:meta-check

Conversation

@yiz-liu
Copy link
Contributor

@yiz-liu yiz-liu commented Feb 28, 2026

Purpose

Previously, the debugging mode for CUDA Graphs only checked a flat list of tensor data pointers directly passed in *args. This missed tensors nested inside structures, **kwargs, or attn_metadata, which could lead to silent data corruption if addresses changed between capture and replay.

This commit strengthens the validation by:

  • Conditionally including forward_context.attn_metadata in the validation set only when runtime_mode is CUDAGraphMode.FULL.
  • Adding _extract_tensors to extract tensor addresses from attn_metadata.
  • Add kwargs tensor addresses to input_addresses.
  • Improving the AssertionError during replay to explicitly list which keys (paths) are missing, added, or have mismatched memory addresses.

Test Plan

We can manually clone a tensor then it should be guarded.

Test Result

(EngineCore pid=809263) RuntimeError: Worker failed with error 'Input addresses for cudagraphs are different during replay.
(EngineCore pid=809263) Differences in attn_metadata detected:
(EngineCore pid=809263)   Changed addresses:
(EngineCore pid=809263)     model.layers.14.self_attn.attn.query_start_loc: 291420241920 -> 90296045568
(EngineCore pid=809263)     model.layers.12.self_attn.attn.query_start_loc: 291420241920 -> 90296045568
(EngineCore pid=809263)     model.layers.38.self_attn.attn.query_start_loc: 291420241920 -> 90296045568
...

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the debugging capabilities for CUDA graphs by implementing a more thorough check of input tensor memory addresses. The new _extract_tensor_addresses function recursively traverses nested input structures, and _get_inputs_to_extract uses introspection to provide meaningful paths for tensors in debug logs. The improved error reporting for mismatched addresses is also a great addition. I've found one area for improvement in the object traversal logic to make it more comprehensive by adding support for objects that use __slots__.

Comment on lines +165 to +177
elif hasattr(obj, "__dict__"):
# Use vars(obj) instead of dir() to avoid triggering @property getters
try:
for k, v in vars(obj).items():
if not k.startswith("_") and not isinstance(
getattr(type(obj), k, None), property
):
new_prefix = f"{prefix}.{k}" if prefix else k
addresses.update(
_extract_tensor_addresses(v, new_prefix, visited)
)
except TypeError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for traversing custom objects only considers __dict__. This will miss tensors stored in objects that use __slots__ for attribute storage. This could lead to the debugging feature failing to detect address changes for tensors in such objects, undermining the purpose of this PR.

To make the traversal more robust, it should be updated to handle objects with __slots__, as well as objects that might have both __slots__ and __dict__.

        else:
            # Handle objects with __slots__
            if hasattr(obj, "__slots__"):
                for slot_name in obj.__slots__:
                    if slot_name == "__dict__":
                        continue
                    if not slot_name.startswith("_"):
                        new_prefix = f"{prefix}.{slot_name}" if prefix else slot_name
                        try:
                            v = getattr(obj, slot_name)
                            addresses.update(
                                _extract_tensor_addresses(v, new_prefix, visited))
                        except AttributeError:
                            pass  # Attribute not set
            # Handle objects with __dict__
            if hasattr(obj, "__dict__"):
                # Use vars(obj) instead of dir() to avoid triggering @property getters
                try:
                    for k, v in vars(obj).items():
                        if not k.startswith("_") and not isinstance(
                                getattr(type(obj), k, None), property):
                            new_prefix = f"{prefix}.{k}" if prefix else k
                            addresses.update(
                                _extract_tensor_addresses(v, new_prefix, visited))
                except TypeError:
                    pass

@ProExpertProg
Copy link
Collaborator

Cc @BoyuanFeng

@BoyuanFeng
Copy link
Collaborator

vllm's torch.compile should flatten the model's inputs as only args and empty kwargs. So we don't need _extract_tensor_addresses to manually flatten the args/kwargs?

In addition, could you share more on use cases, (e.g., a specific failure command)?

Improves the validation of CUDA graph input stability by recursively tracking tensor memory addresses within nested data structures.

Instead of tracking a flat list of pointers, this uses a structured approach to map specific input paths to their memory locations. This provides significantly more detailed error messages when a tensor's address changes between capture and replay, making it easier to identify exactly which part of the input state is violating CUDA graph requirements.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
@yiz-liu yiz-liu force-pushed the meta-check branch 2 times, most recently from edf8721 to 0e96045 Compare March 17, 2026 11:33
@yiz-liu
Copy link
Contributor Author

yiz-liu commented Mar 17, 2026

vllm's torch.compile should flatten the model's inputs as only args and empty kwargs. So we don't need _extract_tensor_addresses to manually flatten the args/kwargs?

In addition, could you share more on use cases, (e.g., a specific failure command)?

@BoyuanFeng Sorry for the late response. Indeed the model inputs are indeed already flattened after torch.compile, but this PR intends to deal with the AttentionMetadata. For example, this helped me with #34880 (comment) . And it should be easy to reproduce by clone() any tensor in common_attention_metadata when cudagraph_mode == CUDAGraphMode.FULL.

I have simplified the process, it should be much easier to review now, could you please take a look again? Thanks.

@mergify
Copy link

mergify bot commented Mar 17, 2026

Hi @yiz-liu, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

This commit vastly simplifies the tracking and validation of CUDA tensor memory addresses during CUDAGraph debug runs.

- Removes complex recursive object traversal, replacing it with explicit, shallow extraction for `attn_metadata`.

- Drops the expensive `inspect.signature(...).bind()` method. Uses simple list concatenation for checking arguments and kwargs.

- Refactors the mismatch logging into an elegant set-based intersection/difference check for `attn_metadata`.

- Corrects behavior where the code previously lost the `is_cuda` check for tensors.

- Substantially lowers performance overhead for the debug mode.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants