[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__#37425
[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__#37425Isotr0py merged 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
The pull request addresses a performance issue in CUDAGraphWrapper.__getattr__ caused by slow exception string formatting when hasattr is used on non-existent attributes. The fix is correct in its goal, but I have provided a suggestion for a more idiomatic and performant approach using getattr with a default sentinel value. This alternative avoids the performance pitfall of hasattr more directly and is a common pattern for implementing efficient proxy objects.
vllm/compilation/cuda_graph.py
Outdated
| if hasattr(self.runnable, key): | ||
| return getattr(self.runnable, key) | ||
| raise AttributeError( | ||
| f"Attribute {key} not exists in the runnable of " | ||
| f"cudagraph wrapper: {self.runnable}" | ||
| f"Attribute {key} not exists in the runnable of cudagraph wrapper" | ||
| ) |
There was a problem hiding this comment.
While removing the object representation from the error string fixes the performance issue with hasattr, a more idiomatic and robust way to implement a performant proxy __getattr__ is to use getattr with a default value. This avoids relying on exception handling for control flow, which is what makes hasattr slow on failure. This approach is generally faster and results in cleaner code.
To implement this, you would need to define a sentinel object at the module level, for example:
_sentinel = object()
| if hasattr(self.runnable, key): | |
| return getattr(self.runnable, key) | |
| raise AttributeError( | |
| f"Attribute {key} not exists in the runnable of " | |
| f"cudagraph wrapper: {self.runnable}" | |
| f"Attribute {key} not exists in the runnable of cudagraph wrapper" | |
| ) | |
| value = getattr(self.runnable, key, _sentinel) | |
| if value is not _sentinel: | |
| return value | |
| raise AttributeError( | |
| f"Attribute {key} not exists in the runnable of cudagraph wrapper" | |
| ) |
|
@ProExpertProg Could you please take a look at this? |
Isotr0py
left a comment
There was a problem hiding this comment.
LGTM, but can we cache __repr__(self.runnable) to keep the message friendly for debugging?
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
Add |
vllm/compilation/cuda_graph.py
Outdated
| raise AttributeError( | ||
| f"Attribute {key} not exists in the runnable of " | ||
| f"cudagraph wrapper: {self.runnable}" | ||
| f"cudagraph wrapper: {self._runnable_str}" |
There was a problem hiding this comment.
Wouldn't it be more performant to add some sort of hasattr override? (does Python support those?) instead of relying on raising and catching an exception
There was a problem hiding this comment.
Looks like python does not have a custom hasattr override
|
Sorry I want to take a closer look at this. This is going to regress the cold start time (but I agree it fixes the runtime overhead issue). Trying to see if we can avoid that as well |
@zou3519 I was wondering why this affects cold start time? |
|
This PR moves the str(self.runnable) to the CUDAGraphWrapper constructor. If we're deploying with piecewise cudagraphs, an extreme case is 100 piecewise graphs. If str(self.runnable) is expensive (10ms), then this is one additional second of startup time. It would be better to just not do |
vllm/compilation/cuda_graph.py
Outdated
| cudagraph_options: CUDAGraphOptions | None = None, | ||
| ) -> None: | ||
| self.runnable = runnable | ||
| self._runnable_str = str(runnable) |
There was a problem hiding this comment.
If you want to ship this PR now and think later, then my vote is to remove this line and avoid putting str(runnable) into the AttributeError message.
Otherwise I want to learn why a hasattr call is in the hot-path
I see. It would be great if we could do |
I think in piecewise CUDA graph mode there is still only one CUDAGraphWrapper, so this should still correspond to only one str(self.runnable) call rather than one per piecewise graph. |
|
Sorry to be clear: let's say we have llama-3-70b
At the very least, we are creating 81 CUDAGraphWrappers for this model |
Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
Thanks for clarifying. |
| f"Attribute {key} not exists in the runnable of " | ||
| f"cudagraph wrapper: {self._runnable_str}" | ||
| ) | ||
| raise AttributeError |
There was a problem hiding this comment.
does this need to be AttributeError()? (I don't know how this works)
There was a problem hiding this comment.
ref to https://docs.python.org/3/library/functions.html#getattr, the default implementation will raise AttributeError() when attribute not exists.
There was a problem hiding this comment.
Does it matter if it's "raise AttributeError" vs "raise AttributeError()" ?
There was a problem hiding this comment.
I think they are nearly the same when no error message is provided.
Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
Head branch was pushed to by a user without write access
…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
### What this PR does / why we need it? Follow vllm-project/vllm#37425, vllm-project/vllm-omni#1982 Copied from them: Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per decode step when profiling Qwen3 Omni. The original `CUDAGraphWrapper.__getattr__` raises: ```python raise AttributeError(f"... cudagraph wrapper: {self.runnable}") ``` When hasattr() is called for a non-existent attribute, Python internally calls __getattr__ which constructs this AttributeError. The {self.runnable} triggers `__repr__()` on the underlying model (e.g., `Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the entire nn.Module tree to generate an 18,000+ character string. This takes ~6-7ms per call. Since `hasattr(self.model, "flush_pending_metadata") ` is called every decode step in the Talker forward path, this adds ~6ms overhead per step, severely impacting audio inter-chunk latency (ICL). ```Python hasattr(self.model, "flush_pending_metadata") → getattr(self.model, "flush_pending_metadata") → not found in CUDAGraphWrapper.__dict__ → not found in the CUDAGraphWrapper class hierarchy → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata") → hasattr(self.runnable, "flush_pending_metadata") # runnable also doesn't have it → executes raise AttributeError(f"... {self.runnable}") → Python needs to construct the exception object → the f-string triggers self.runnable.__repr__() → Qwen3OmniMoeForConditionalGeneration.__repr__() → recursively traverses the entire nn.Module tree → generates a 18,000+ character string → takes ~6 ms → AttributeError object is created → hasattr catches the AttributeError and returns False → the 18,000-character string is immediately discarded (no one ever sees it) ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? See vllm-project/vllm-omni#1982 - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4497431 --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…t#7442) ### What this PR does / why we need it? Follow vllm-project/vllm#37425, vllm-project/vllm-omni#1982 Copied from them: Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per decode step when profiling Qwen3 Omni. The original `CUDAGraphWrapper.__getattr__` raises: ```python raise AttributeError(f"... cudagraph wrapper: {self.runnable}") ``` When hasattr() is called for a non-existent attribute, Python internally calls __getattr__ which constructs this AttributeError. The {self.runnable} triggers `__repr__()` on the underlying model (e.g., `Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the entire nn.Module tree to generate an 18,000+ character string. This takes ~6-7ms per call. Since `hasattr(self.model, "flush_pending_metadata") ` is called every decode step in the Talker forward path, this adds ~6ms overhead per step, severely impacting audio inter-chunk latency (ICL). ```Python hasattr(self.model, "flush_pending_metadata") → getattr(self.model, "flush_pending_metadata") → not found in CUDAGraphWrapper.__dict__ → not found in the CUDAGraphWrapper class hierarchy → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata") → hasattr(self.runnable, "flush_pending_metadata") # runnable also doesn't have it → executes raise AttributeError(f"... {self.runnable}") → Python needs to construct the exception object → the f-string triggers self.runnable.__repr__() → Qwen3OmniMoeForConditionalGeneration.__repr__() → recursively traverses the entire nn.Module tree → generates a 18,000+ character string → takes ~6 ms → AttributeError object is created → hasattr catches the AttributeError and returns False → the 18,000-character string is immediately discarded (no one ever sees it) ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? See vllm-project/vllm-omni#1982 - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4497431 --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
…t#7442) ### What this PR does / why we need it? Follow vllm-project/vllm#37425, vllm-project/vllm-omni#1982 Copied from them: Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per decode step when profiling Qwen3 Omni. The original `CUDAGraphWrapper.__getattr__` raises: ```python raise AttributeError(f"... cudagraph wrapper: {self.runnable}") ``` When hasattr() is called for a non-existent attribute, Python internally calls __getattr__ which constructs this AttributeError. The {self.runnable} triggers `__repr__()` on the underlying model (e.g., `Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the entire nn.Module tree to generate an 18,000+ character string. This takes ~6-7ms per call. Since `hasattr(self.model, "flush_pending_metadata") ` is called every decode step in the Talker forward path, this adds ~6ms overhead per step, severely impacting audio inter-chunk latency (ICL). ```Python hasattr(self.model, "flush_pending_metadata") → getattr(self.model, "flush_pending_metadata") → not found in CUDAGraphWrapper.__dict__ → not found in the CUDAGraphWrapper class hierarchy → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata") → hasattr(self.runnable, "flush_pending_metadata") # runnable also doesn't have it → executes raise AttributeError(f"... {self.runnable}") → Python needs to construct the exception object → the f-string triggers self.runnable.__repr__() → Qwen3OmniMoeForConditionalGeneration.__repr__() → recursively traverses the entire nn.Module tree → generates a 18,000+ character string → takes ~6 ms → AttributeError object is created → hasattr catches the AttributeError and returns False → the 18,000-character string is immediately discarded (no one ever sees it) ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? See vllm-project/vllm-omni#1982 - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4497431 --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…#37425) Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Purpose
Ref vllm-project/vllm-omni#1982
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.