[Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling#36720
Conversation
|
Hi @JartX, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request addresses an out-of-memory error on ROCm platforms during worker startup by skipping the unreliable CUDA graph memory profiling. The changes in vllm/v1/worker/gpu_worker.py correctly implement this fix by adding a platform check. The logic is sound and directly addresses the root cause described. I've found one minor issue: a stray string in a docstring that appears to be a leftover from development and should be removed.
…ory estimation error Signed-off-by: JartX <sagformas@epdcenter.es>
d2fce77 to
5cf8086
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request addresses an Out-Of-Memory (OOM) error on ROCm platforms during worker startup. The fix involves skipping the CUDA graph memory profiling on ROCm, as it has been identified as unreliable on that platform. This is achieved by adding and not current_platform.is_rocm() to the condition for profiling. The change is correct, well-targeted, and should resolve the described issue without side effects on other platforms. The added comments also improve code clarity. I approve these changes.
|
Similar on XPU. Will |
|
Hi! @xinyu-intel For example, Qwen3.5 27B Int4 crashed on 48GB VRAM (RDNA3) due to the estimation logic. In contrast, Qwen3.5 35B Int4 functioned correctly on the same 48GB RDNA3 setup; at a 40,960 context length, there was still nearly 16x available, whereas with the 27B model, only 5x remained. I have not been able to reproduce this behavior on NVIDIA hardware. The specific error reported was a HIP Memory error. |
Hi. From my understanding, it is difficult to estimate the behavior consistently across different models. This is because mem_get_info queries the free memory reported by the device driver, whereas memory_reserved reflects the memory reserved by the PyTorch caching allocator. Since these two values come from different layers of the stack, their relationship is not straightforward. In particular, the behavior is not fully predictable because it depends on how the device driver manages and reports memory. I’m not sure whether memory_reserved is a more reasonable metric, but since graph capture is a PyTorch feature, querying memory statistics from PyTorch seems more straightforward. |
|
RuntimeError: Worker failed with error 'HIP out of memory. Tried to allocate 784.00 MiB. GPU 0 has a total capacity of 23.98 GiB of which 630.00 MiB is free. Of the allocated memory 22.27 GiB is allocated by PyTorch, and 492.89 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause After torch.cuda.mem_get_info to torch.cuda.memory_reserved |
|
CUDA graph memory profiling is a feature and will enabled in 0.19 by default. I think your code change will always disable this feature on ROCm platform. |
|
The problem seems to occur here: _init_minimal_kv_cache_for_profiling in gpu_model_runner.py If I set a 90% maximum usage safety margin in the free RAM calculation, then it works correctly. We need to try to get the calculation right, as xinyu-intel correctly pointed out: |
|
@JartX I have noticed this, there are 2 issues in this PR #30515 . The PR introduced a flag (Critical) However, this line vllm/vllm/v1/worker/gpu_worker.py Lines 458 to 496 in 8374387 cudagraph_memory_estimate rather than cudagraph_memory_estimate_applied.
Based on the logic flow of vllm/vllm/v1/worker/gpu_worker.py Lines 409 to 413 in 8374387
So, a proper fix for now is to make sure that those @MatthewBonanni Could you check if my understanding is correct? |
|
@tjtanaa Both of these are intentional and correct as is. Line 458 is intentionally supposed to be Line 419 is also intentionally supposed to be |
|
@JartX could you clarify the following:
edit: I see @jikunshang's comment now: #36720 (comment) could you inform what method would be more reliable for other platforms? |
|
Not should be work |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request fixes an out-of-memory error on ROCm platforms during worker startup by correcting the CUDA graph memory profiling logic. The change replaces torch.cuda.mem_get_info(), which is unreliable on ROCm, with torch.cuda.memory_allocated() for a more accurate and platform-agnostic measurement of memory usage. The fix is well-targeted and correctly addresses the root cause. The implementation is sound and I have no further comments.
|
Hi @JartX, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@JartX Replacing with |
We're talking about how it no longer crashes on me due to HIP, but that it can crash in other cases too. |
|
@JartX re-upping my question from earlier, what is actually causing the ValueError? Can you add a stack trace? |
988b309 to
5cf8086
Compare
|
This PR also solves (works around) the issue where after #30515 on ROCm MoE model performance would drop by up to 20% |
There was a problem hiding this comment.
Thanks for updating the PR, I think disabling on ROCm is the right approach for now. The fix works in my testing too. Can you update the description to reflect the final state of the PR?
In the future, can you also avoid force-pushing? It wipes out the history which makes the discussion in the PR much harder to parse
I've rewritten and rolled back to the original branch because you were right. I didn't mean to cause any trouble; I apologize if it bothered you.🙏 |
|
@JartX no problem! It's just good practice for the sake of documentation. I appreciate the fix! |
…cudagraph memory profiling (vllm-project#36720) Signed-off-by: JartX <sagformas@epdcenter.es>
…cudagraph memory profiling (vllm-project#36720) Signed-off-by: JartX <sagformas@epdcenter.es>
Problem
On ROCm/HIP platforms, vLLM fails to start with:
This was introduced by the pr
[UX][Startup] Account for CUDA graphs during memory profiling
#30515), which removed
profile_cudagraph_memory()andthe explicit recalculation of
torch_peak_increaseandnon_kv_cache_memory.Root Cause
profile_cudagraph_memory()usestorch.cuda.mem_get_info()andgraph_pool_handle()which behave differently on ROCm/HIP and can produceincorrect or negative memory estimates. When this estimate is applied to
available_kv_cache_memory_bytes, the result is inflated beyond the actualfree GPU memory, causing
request_memory()to fail.The workaround
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1happened to workbecause it forced the (incorrect) estimate to be applied, accidentally
compensating for the missing
torch_peak_increaserecalculation.Fix
Disable memory profile on Rocm