[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __init__#37230
Conversation
…init__ Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request introduces several improvements to the test server cleanup logic to prevent resource leaks, particularly GPU memory, when the server fails to start. The main changes include:
- Refactoring the cleanup logic into a
_shutdownmethod, which is now correctly called from__init__on startup failure. - Replacing the
psutil-based process cleanup with a more robust implementation that directly parses/procto find and terminate processes in a process group. - Making the GPU memory release check stricter by removing the "stabilization" fallback, ensuring that tests with memory leaks fail immediately.
- Increasing server startup and memory release timeouts to accommodate slower CI environments.
The changes are well-reasoned and significantly improve the reliability of the test suite. I have one suggestion to make the new process cleanup logic even more robust.
tests/utils.py
Outdated
| # Field 5 (0-indexed 4) in /proc/<pid>/stat is the pgid. | ||
| fields = stat.split() | ||
| if int(fields[4]) == pgid: | ||
| members.append(int(entry.name)) |
There was a problem hiding this comment.
The current parsing of /proc/<pid>/stat using stat.split() is not robust against process names containing spaces. The process name (field 2) is enclosed in parentheses and can contain spaces. A simple split() will incorrectly parse the fields if the name has spaces, causing the check to fail and potentially leaving orphan processes running.
A more robust approach is to find the last parenthesis of the process name and split the rest of the string. This ensures the correct fields are extracted regardless of the process name's content. This change also removes the now-incorrect comment about the field index.
# The process name is in parentheses. Find the last ')'
# to locate the end of the process name.
last_paren_idx = stat.rfind(')')
if last_paren_idx == -1:
continue
# The fields after the process name start after ') '.
# pgrp is the 3rd field after the process name (state, ppid, pgrp).
fields = stat[last_paren_idx + 2:].split()
if len(fields) > 2 and int(fields[2]) == pgid:
members.append(int(entry.name))There was a problem hiding this comment.
Done :) Replaced with os.getpgid which is cleaner too.
…init__ Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…init__ (vllm-project#37230) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…init__ (vllm-project#37230) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…init__ (vllm-project#37230) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
RemoteOpenAIServer.__init__raises (e.g. health check timeout),__exit__is never called by Python'swithstatement, leaking the server + EngineCore subprocesses and their GPU memory. Every subsequent test then OOMs._shutdown()called from both__exit__and__init__'s exception handler._kill_orphaned_children(psutil parent-child lookup, broken after parent is reaped) with_kill_process_group_survivorsthat scans/proc/*/statfor pgid members -- works even after the parent process is gone._wait_for_gpu_memory_releasethat silently proceeded when GPU memory was still held. Now raisesRuntimeErrorso the leaking test fails instead of poisoning all later tests.Test plan
pytest -s -v models/language/pooling -m 'not core_model'cc @kenroche