Skip to content

[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __init__#37230

Merged
vllm-bot merged 3 commits intovllm-project:mainfrom
ROCm:akaratza_remote_server_refactor
Mar 17, 2026
Merged

[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __init__#37230
vllm-bot merged 3 commits intovllm-project:mainfrom
ROCm:akaratza_remote_server_refactor

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 16, 2026

  • When RemoteOpenAIServer.__init__ raises (e.g. health check timeout), __exit__ is never called by Python's with statement, leaking the server + EngineCore subprocesses and their GPU memory. Every subsequent test then OOMs.
  • Extracted cleanup into _shutdown() called from both __exit__ and __init__'s exception handler.
  • Replaced _kill_orphaned_children (psutil parent-child lookup, broken after parent is reaped) with _kill_process_group_survivors that scans /proc/*/stat for pgid members -- works even after the parent process is gone.
  • Removed the Phase 2 "stabilization" fallback in _wait_for_gpu_memory_release that silently proceeded when GPU memory was still held. Now raises RuntimeError so the leaking test fails instead of poisoning all later tests.
  • Increased server startup timeout from 360s to 480s (model load alone took ~350s on MI250 ROCm CI).
  • Increased memory release timeout from 60s to 120s for ROCm GPU driver reclaim latency.

Test plan

  • pytest -s -v models/language/pooling -m 'not core_model'

cc @kenroche

…init__

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several improvements to the test server cleanup logic to prevent resource leaks, particularly GPU memory, when the server fails to start. The main changes include:

  1. Refactoring the cleanup logic into a _shutdown method, which is now correctly called from __init__ on startup failure.
  2. Replacing the psutil-based process cleanup with a more robust implementation that directly parses /proc to find and terminate processes in a process group.
  3. Making the GPU memory release check stricter by removing the "stabilization" fallback, ensuring that tests with memory leaks fail immediately.
  4. Increasing server startup and memory release timeouts to accommodate slower CI environments.

The changes are well-reasoned and significantly improve the reliability of the test suite. I have one suggestion to make the new process cleanup logic even more robust.

tests/utils.py Outdated
Comment on lines +365 to +368
# Field 5 (0-indexed 4) in /proc/<pid>/stat is the pgid.
fields = stat.split()
if int(fields[4]) == pgid:
members.append(int(entry.name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current parsing of /proc/<pid>/stat using stat.split() is not robust against process names containing spaces. The process name (field 2) is enclosed in parentheses and can contain spaces. A simple split() will incorrectly parse the fields if the name has spaces, causing the check to fail and potentially leaving orphan processes running.

A more robust approach is to find the last parenthesis of the process name and split the rest of the string. This ensures the correct fields are extracted regardless of the process name's content. This change also removes the now-incorrect comment about the field index.

                # The process name is in parentheses. Find the last ')'
                # to locate the end of the process name.
                last_paren_idx = stat.rfind(')')
                if last_paren_idx == -1:
                    continue

                # The fields after the process name start after ') '.
                # pgrp is the 3rd field after the process name (state, ppid, pgrp).
                fields = stat[last_paren_idx + 2:].split()
                if len(fields) > 2 and int(fields[2]) == pgid:
                    members.append(int(entry.name))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :) Replaced with os.getpgid which is cleaner too.

…init__

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@DarkLight1337 DarkLight1337 added ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs labels Mar 17, 2026
@vllm-bot vllm-bot merged commit 4ed5130 into vllm-project:main Mar 17, 2026
164 of 167 checks passed
@AndreasKaratzas AndreasKaratzas deleted the akaratza_remote_server_refactor branch March 17, 2026 16:08
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
…init__ (vllm-project#37230)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…init__ (vllm-project#37230)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…init__ (vllm-project#37230)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants