[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ by ZeldaHuang · Pull Request #37425 · vllm-project/vllm

ZeldaHuang · 2026-03-18T12:27:13Z

Purpose

Ref vllm-project/vllm-omni#1982

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

gemini-code-assist

Code Review

The pull request addresses a performance issue in CUDAGraphWrapper.__getattr__ caused by slow exception string formatting when hasattr is used on non-existent attributes. The fix is correct in its goal, but I have provided a suggestion for a more idiomatic and performant approach using getattr with a default sentinel value. This alternative avoids the performance pitfall of hasattr more directly and is a common pattern for implementing efficient proxy objects.

gemini-code-assist · 2026-03-18T12:34:16Z

vllm/compilation/cuda_graph.py

        if hasattr(self.runnable, key):
            return getattr(self.runnable, key)
        raise AttributeError(
-            f"Attribute {key} not exists in the runnable of "
-            f"cudagraph wrapper: {self.runnable}"
+            f"Attribute {key} not exists in the runnable of cudagraph wrapper"
        )


While removing the object representation from the error string fixes the performance issue with hasattr, a more idiomatic and robust way to implement a performant proxy __getattr__ is to use getattr with a default value. This avoids relying on exception handling for control flow, which is what makes hasattr slow on failure. This approach is generally faster and results in cleaner code.

To implement this, you would need to define a sentinel object at the module level, for example:
_sentinel = object()

Suggested change

if hasattr(self.runnable, key):

return getattr(self.runnable, key)

raise AttributeError(

f"Attribute {key} not exists in the runnable of "

f"cudagraph wrapper: {self.runnable}"

f"Attribute {key} not exists in the runnable of cudagraph wrapper"

)

value = getattr(self.runnable, key, _sentinel)

if value is not _sentinel:

return value

raise AttributeError(

f"Attribute {key} not exists in the runnable of cudagraph wrapper"

)

ZJY0516 · 2026-03-18T12:41:15Z

@ProExpertProg Could you please take a look at this?

Isotr0py

LGTM, but can we cache __repr__(self.runnable) to keep the message friendly for debugging?

github-actions · 2026-03-18T12:56:04Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

ZeldaHuang · 2026-03-18T14:42:06Z

LGTM, but can we cache __repr__(self.runnable) to keep the message friendly for debugging?

Add self._runnable_str = str(runnable)

zou3519 · 2026-03-18T14:57:10Z

vllm/compilation/cuda_graph.py

        raise AttributeError(
            f"Attribute {key} not exists in the runnable of "
-            f"cudagraph wrapper: {self.runnable}"
+            f"cudagraph wrapper: {self._runnable_str}"


Wouldn't it be more performant to add some sort of hasattr override? (does Python support those?) instead of relying on raising and catching an exception

Looks like python does not have a custom hasattr override

zou3519 · 2026-03-18T15:00:40Z

Sorry I want to take a closer look at this. This is going to regress the cold start time (but I agree it fixes the runtime overhead issue). Trying to see if we can avoid that as well

ZJY0516 · 2026-03-18T15:02:27Z

This is going to regress the cold start time

@zou3519 I was wondering why this affects cold start time?

zou3519 · 2026-03-18T15:04:31Z

This PR moves the str(self.runnable) to the CUDAGraphWrapper constructor. If we're deploying with piecewise cudagraphs, an extreme case is 100 piecewise graphs. If str(self.runnable) is expensive (10ms), then this is one additional second of startup time. It would be better to just not do str(self.runnable) at all, unless we think the debuggability benefit it gives is worth it.

zou3519 · 2026-03-18T15:07:58Z

vllm/compilation/cuda_graph.py

        cudagraph_options: CUDAGraphOptions | None = None,
    ) -> None:
        self.runnable = runnable
+        self._runnable_str = str(runnable)


If you want to ship this PR now and think later, then my vote is to remove this line and avoid putting str(runnable) into the AttributeError message.

Otherwise I want to learn why a hasattr call is in the hot-path

ZJY0516 · 2026-03-18T15:08:07Z

This PR moves the str(self.runnable) to the CUDAGraphWrapper constructor. If we're deploying with piecewise cudagraphs, an extreme case is 100 piecewise graphs. If str(self.runnable) is expensive (10ms), then this is one additional second of startup time. It would be better to just not do str(self.runnable) at all, unless we think the debuggability benefit it gives is worth it.

I see. It would be great if we could do str(self.runnable) only when the log level is set to debug.

ZeldaHuang · 2026-03-18T15:18:46Z

This PR moves the str(self.runnable) to the CUDAGraphWrapper constructor. If we're deploying with piecewise cudagraphs, an extreme case is 100 piecewise graphs. If str(self.runnable) is expensive (10ms), then this is one additional second of startup time. It would be better to just not do str(self.runnable) at all, unless we think the debuggability benefit it gives is worth it.

I think in piecewise CUDA graph mode there is still only one CUDAGraphWrapper, so this should still correspond to only one str(self.runnable) call rather than one per piecewise graph.

zou3519 · 2026-03-18T15:29:19Z

Sorry to be clear: let's say we have llama-3-70b

There are 80 layers
We make 81 total split graphs out of these
Then, we wrap each of the 81 in a CUDAGraphWrapper
Then we record separate cudagraphs for each cudagraph size sometime later.

At the very least, we are creating 81 CUDAGraphWrappers for this model

Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

ZeldaHuang · 2026-03-18T15:37:13Z

Sorry to be clear: let's say we have llama-3-70b

There are 80 layers

We make 81 total split graphs out of these

Then, we wrap each of the 81 in a CUDAGraphWrapper

Then we record separate cudagraphs for each cudagraph size sometime later.

At the very least, we are creating 81 CUDAGraphWrappers for this model

Thanks for clarifying.
I change the implementation, now str(self.runnable) will be created and rasied only in debug mode

zou3519 · 2026-03-18T15:40:07Z

vllm/compilation/cuda_graph.py

+                f"Attribute {key} not exists in the runnable of "
+                f"cudagraph wrapper: {self._runnable_str}"
+            )
+        raise AttributeError


does this need to be AttributeError()? (I don't know how this works)

ref to https://docs.python.org/3/library/functions.html#getattr, the default implementation will raise AttributeError() when attribute not exists.

Does it matter if it's "raise AttributeError" vs "raise AttributeError()" ?

I think they are nearly the same when no error message is provided.

Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

### What this PR does / why we need it? Follow vllm-project/vllm#37425, vllm-project/vllm-omni#1982 Copied from them: Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per decode step when profiling Qwen3 Omni. The original `CUDAGraphWrapper.__getattr__` raises: ```python raise AttributeError(f"... cudagraph wrapper: {self.runnable}") ``` When hasattr() is called for a non-existent attribute, Python internally calls __getattr__ which constructs this AttributeError. The {self.runnable} triggers `__repr__()` on the underlying model (e.g., `Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the entire nn.Module tree to generate an 18,000+ character string. This takes ~6-7ms per call. Since `hasattr(self.model, "flush_pending_metadata") ` is called every decode step in the Talker forward path, this adds ~6ms overhead per step, severely impacting audio inter-chunk latency (ICL). ```Python hasattr(self.model, "flush_pending_metadata") → getattr(self.model, "flush_pending_metadata") → not found in CUDAGraphWrapper.__dict__ → not found in the CUDAGraphWrapper class hierarchy → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata") → hasattr(self.runnable, "flush_pending_metadata") # runnable also doesn't have it → executes raise AttributeError(f"... {self.runnable}") → Python needs to construct the exception object → the f-string triggers self.runnable.__repr__() → Qwen3OmniMoeForConditionalGeneration.__repr__() → recursively traverses the entire nn.Module tree → generates a 18,000+ character string → takes ~6 ms → AttributeError object is created → hasattr catches the AttributeError and returns False → the 18,000-character string is immediately discarded (no one ever sees it) ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? See vllm-project/vllm-omni#1982 - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4497431 --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>

…t#7442) ### What this PR does / why we need it? Follow vllm-project/vllm#37425, vllm-project/vllm-omni#1982 Copied from them: Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per decode step when profiling Qwen3 Omni. The original `CUDAGraphWrapper.__getattr__` raises: ```python raise AttributeError(f"... cudagraph wrapper: {self.runnable}") ``` When hasattr() is called for a non-existent attribute, Python internally calls __getattr__ which constructs this AttributeError. The {self.runnable} triggers `__repr__()` on the underlying model (e.g., `Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the entire nn.Module tree to generate an 18,000+ character string. This takes ~6-7ms per call. Since `hasattr(self.model, "flush_pending_metadata") ` is called every decode step in the Talker forward path, this adds ~6ms overhead per step, severely impacting audio inter-chunk latency (ICL). ```Python hasattr(self.model, "flush_pending_metadata") → getattr(self.model, "flush_pending_metadata") → not found in CUDAGraphWrapper.__dict__ → not found in the CUDAGraphWrapper class hierarchy → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata") → hasattr(self.runnable, "flush_pending_metadata") # runnable also doesn't have it → executes raise AttributeError(f"... {self.runnable}") → Python needs to construct the exception object → the f-string triggers self.runnable.__repr__() → Qwen3OmniMoeForConditionalGeneration.__repr__() → recursively traverses the entire nn.Module tree → generates a 18,000+ character string → takes ~6 ms → AttributeError object is created → hasattr catches the AttributeError and returns False → the 18,000-character string is immediately discarded (no one ever sees it) ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? See vllm-project/vllm-omni#1982 - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4497431 --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

…t#7442) ### What this PR does / why we need it? Follow vllm-project/vllm#37425, vllm-project/vllm-omni#1982 Copied from them: Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per decode step when profiling Qwen3 Omni. The original `CUDAGraphWrapper.__getattr__` raises: ```python raise AttributeError(f"... cudagraph wrapper: {self.runnable}") ``` When hasattr() is called for a non-existent attribute, Python internally calls __getattr__ which constructs this AttributeError. The {self.runnable} triggers `__repr__()` on the underlying model (e.g., `Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the entire nn.Module tree to generate an 18,000+ character string. This takes ~6-7ms per call. Since `hasattr(self.model, "flush_pending_metadata") ` is called every decode step in the Talker forward path, this adds ~6ms overhead per step, severely impacting audio inter-chunk latency (ICL). ```Python hasattr(self.model, "flush_pending_metadata") → getattr(self.model, "flush_pending_metadata") → not found in CUDAGraphWrapper.__dict__ → not found in the CUDAGraphWrapper class hierarchy → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata") → hasattr(self.runnable, "flush_pending_metadata") # runnable also doesn't have it → executes raise AttributeError(f"... {self.runnable}") → Python needs to construct the exception object → the f-string triggers self.runnable.__repr__() → Qwen3OmniMoeForConditionalGeneration.__repr__() → recursively traverses the entire nn.Module tree → generates a 18,000+ character string → takes ~6 ms → AttributeError object is created → hasattr catches the AttributeError and returns False → the 18,000-character string is immediately discarded (no one ever sees it) ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? See vllm-project/vllm-omni#1982 - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4497431 --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

fix

ae7c7d6

Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

ZeldaHuang requested review from BoyuanFeng, ProExpertProg, youkaichao and zou3519 as code owners March 18, 2026 12:27

mergify bot added the nvidia label Mar 18, 2026

github-project-automation bot added this to NVIDIA Mar 18, 2026

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

Isotr0py approved these changes Mar 18, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 18, 2026

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026

cache runnable str

232feb6

Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

Isotr0py enabled auto-merge (squash) March 18, 2026 14:48

zou3519 reviewed Mar 18, 2026

View reviewed changes

gcanlin mentioned this pull request Mar 18, 2026

[Bugfix] Fix slow hasattr in ACLGraphWrapper.__getattr__ vllm-project/vllm-ascend#7442

Merged

zou3519 disabled auto-merge March 18, 2026 15:00

zou3519 reviewed Mar 18, 2026

View reviewed changes

raise error message in debug mode

17ce694

Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

zou3519 reviewed Mar 18, 2026

View reviewed changes

zou3519 approved these changes Mar 18, 2026

View reviewed changes

zou3519 enabled auto-merge (squash) March 18, 2026 16:26

ZeldaHuang and others added 2 commits March 19, 2026 10:47

Merge branch 'main' into optimize_cudagraph_wrapper

758722f

apply same change to UBatchWrapper

4b44d10

Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

auto-merge was automatically disabled March 19, 2026 03:33
Head branch was pushed to by a user without write access

ZeldaHuang requested a review from njhill as a code owner March 19, 2026 03:33

mergify bot added the v1 label Mar 19, 2026

ZeldaHuang mentioned this pull request Mar 19, 2026

[Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ vllm-project/vllm-omni#1982

Merged

5 tasks

Isotr0py merged commit d3cc379 into vllm-project:main Mar 19, 2026
60 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 19, 2026

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ (vllm-project…

3b23ca4

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

chooper26 pushed a commit to intellistream/vllm-hust that referenced this pull request Mar 21, 2026

[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ (vllm-project…

7fb9269

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ (vllm-project…

51fa28c

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ (vllm-project…

37dec84

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ (vllm-project…

6c7509a

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026

[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ (vllm-project…

d4bec27

…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

Uh oh!

Conversation

ZeldaHuang commented Mar 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Mar 18, 2026

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

ZeldaHuang commented Mar 18, 2026

Uh oh!

zou3519 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 commented Mar 18, 2026

Uh oh!

ZJY0516 commented Mar 18, 2026

Uh oh!

zou3519 commented Mar 18, 2026

Uh oh!

zou3519 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Mar 18, 2026

Uh oh!

ZeldaHuang commented Mar 18, 2026

Uh oh!

zou3519 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZeldaHuang commented Mar 18, 2026

Uh oh!

zou3519 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ZeldaHuang Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ZeldaHuang Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZeldaHuang commented Mar 18, 2026 •

edited by github-actions bot

Loading

zou3519 commented Mar 18, 2026 •

edited

Loading