[BugFix] Fix memory spike in workspace allocation#30744
Conversation
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request addresses a memory spike issue during workspace allocation by replacing torch.Tensor.resize_ with a manual deallocation and reallocation process. The change correctly identifies that resize_ can temporarily double memory usage and cause out-of-memory errors. The implementation correctly de-references the old tensor to allow for garbage collection before allocating a new, larger tensor. This is a good fix that should effectively mitigate the memory spikes. The logic is sound and the implementation is correct.
| # FIXIT: find out which code initialize cuda before running the test | ||
| # before the fix, we need to use spawn to test it | ||
| - export VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
| # Alot of these tests are on the edge of OOMing |
There was a problem hiding this comment.
NIT
| # Alot of these tests are on the edge of OOMing | |
| # A lot of these tests are on the edge of OOMing |
|
Will open a separate fix for the failing fusion tests, it is related to the recent deprecation #30396. |
|
Fixed by #30787 |
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> (cherry picked from commit 00a8d76)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Attempt to fix: https://buildkite.com/vllm/ci/builds/43469#019b1ba9-b250-451b-8125-dc941489fe04