[CUDA Graph] Enhance CUDA graph input address debugging#35605
[CUDA Graph] Enhance CUDA graph input address debugging#35605yiz-liu wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the debugging capabilities for CUDA graphs by implementing a more thorough check of input tensor memory addresses. The new _extract_tensor_addresses function recursively traverses nested input structures, and _get_inputs_to_extract uses introspection to provide meaningful paths for tensors in debug logs. The improved error reporting for mismatched addresses is also a great addition. I've found one area for improvement in the object traversal logic to make it more comprehensive by adding support for objects that use __slots__.
vllm/compilation/cuda_graph.py
Outdated
| elif hasattr(obj, "__dict__"): | ||
| # Use vars(obj) instead of dir() to avoid triggering @property getters | ||
| try: | ||
| for k, v in vars(obj).items(): | ||
| if not k.startswith("_") and not isinstance( | ||
| getattr(type(obj), k, None), property | ||
| ): | ||
| new_prefix = f"{prefix}.{k}" if prefix else k | ||
| addresses.update( | ||
| _extract_tensor_addresses(v, new_prefix, visited) | ||
| ) | ||
| except TypeError: | ||
| pass |
There was a problem hiding this comment.
The current implementation for traversing custom objects only considers __dict__. This will miss tensors stored in objects that use __slots__ for attribute storage. This could lead to the debugging feature failing to detect address changes for tensors in such objects, undermining the purpose of this PR.
To make the traversal more robust, it should be updated to handle objects with __slots__, as well as objects that might have both __slots__ and __dict__.
else:
# Handle objects with __slots__
if hasattr(obj, "__slots__"):
for slot_name in obj.__slots__:
if slot_name == "__dict__":
continue
if not slot_name.startswith("_"):
new_prefix = f"{prefix}.{slot_name}" if prefix else slot_name
try:
v = getattr(obj, slot_name)
addresses.update(
_extract_tensor_addresses(v, new_prefix, visited))
except AttributeError:
pass # Attribute not set
# Handle objects with __dict__
if hasattr(obj, "__dict__"):
# Use vars(obj) instead of dir() to avoid triggering @property getters
try:
for k, v in vars(obj).items():
if not k.startswith("_") and not isinstance(
getattr(type(obj), k, None), property):
new_prefix = f"{prefix}.{k}" if prefix else k
addresses.update(
_extract_tensor_addresses(v, new_prefix, visited))
except TypeError:
pass|
Cc @BoyuanFeng |
|
vllm's In addition, could you share more on use cases, (e.g., a specific failure command)? |
Improves the validation of CUDA graph input stability by recursively tracking tensor memory addresses within nested data structures. Instead of tracking a flat list of pointers, this uses a structured approach to map specific input paths to their memory locations. This provides significantly more detailed error messages when a tensor's address changes between capture and replay, making it easier to identify exactly which part of the input state is violating CUDA graph requirements. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
edf8721 to
0e96045
Compare
@BoyuanFeng Sorry for the late response. Indeed the model inputs are indeed already flattened after I have simplified the process, it should be much easier to review now, could you please take a look again? Thanks. |
|
Hi @yiz-liu, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
This commit vastly simplifies the tracking and validation of CUDA tensor memory addresses during CUDAGraph debug runs. - Removes complex recursive object traversal, replacing it with explicit, shallow extraction for `attn_metadata`. - Drops the expensive `inspect.signature(...).bind()` method. Uses simple list concatenation for checking arguments and kwargs. - Refactors the mismatch logging into an elegant set-based intersection/difference check for `attn_metadata`. - Corrects behavior where the code previously lost the `is_cuda` check for tensors. - Substantially lowers performance overhead for the debug mode. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Purpose
Previously, the debugging mode for CUDA Graphs only checked a flat list of tensor data pointers directly passed in
*args. This missed tensors nested inside structures,**kwargs, orattn_metadata, which could lead to silent data corruption if addresses changed between capture and replay.This commit strengthens the validation by:
forward_context.attn_metadatain the validation set only whenruntime_modeisCUDAGraphMode.FULL._extract_tensorsto extract tensor addresses fromattn_metadata.kwargstensor addresses toinput_addresses.AssertionErrorduring replay to explicitly list which keys (paths) are missing, added, or have mismatched memory addresses.Test Plan
We can manually clone a tensor then it should be guarded.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.