[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __init__ by AndreasKaratzas · Pull Request #37230 · vllm-project/vllm

AndreasKaratzas · 2026-03-16T21:38:50Z

When RemoteOpenAIServer.__init__ raises (e.g. health check timeout), __exit__ is never called by Python's with statement, leaking the server + EngineCore subprocesses and their GPU memory. Every subsequent test then OOMs.
Extracted cleanup into _shutdown() called from both __exit__ and __init__'s exception handler.
Replaced _kill_orphaned_children (psutil parent-child lookup, broken after parent is reaped) with _kill_process_group_survivors that scans /proc/*/stat for pgid members -- works even after the parent process is gone.
Removed the Phase 2 "stabilization" fallback in _wait_for_gpu_memory_release that silently proceeded when GPU memory was still held. Now raises RuntimeError so the leaking test fails instead of poisoning all later tests.
Increased server startup timeout from 360s to 480s (model load alone took ~350s on MI250 ROCm CI).
Increased memory release timeout from 60s to 120s for ROCm GPU driver reclaim latency.

Test plan

pytest -s -v models/language/pooling -m 'not core_model'

…init__ Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This pull request introduces several improvements to the test server cleanup logic to prevent resource leaks, particularly GPU memory, when the server fails to start. The main changes include:

Refactoring the cleanup logic into a _shutdown method, which is now correctly called from __init__ on startup failure.
Replacing the psutil-based process cleanup with a more robust implementation that directly parses /proc to find and terminate processes in a process group.
Making the GPU memory release check stricter by removing the "stabilization" fallback, ensuring that tests with memory leaks fail immediately.
Increasing server startup and memory release timeouts to accommodate slower CI environments.

The changes are well-reasoned and significantly improve the reliability of the test suite. I have one suggestion to make the new process cleanup logic even more robust.

gemini-code-assist · 2026-03-16T21:43:58Z

tests/utils.py

+                # Field 5 (0-indexed 4) in /proc/<pid>/stat is the pgid.
+                fields = stat.split()
+                if int(fields[4]) == pgid:
+                    members.append(int(entry.name))


The current parsing of /proc/<pid>/stat using stat.split() is not robust against process names containing spaces. The process name (field 2) is enclosed in parentheses and can contain spaces. A simple split() will incorrectly parse the fields if the name has spaces, causing the check to fail and potentially leaving orphan processes running.

A more robust approach is to find the last parenthesis of the process name and split the rest of the string. This ensures the correct fields are extracted regardless of the process name's content. This change also removes the now-incorrect comment about the field index.

# The process name is in parentheses. Find the last ')' # to locate the end of the process name. last_paren_idx = stat.rfind(')') if last_paren_idx == -1: continue # The fields after the process name start after ') '. # pgrp is the 3rd field after the process name (state, ppid, pgrp). fields = stat[last_paren_idx + 2:].split() if len(fields) > 2 and int(fields[2]) == pgid: members.append(int(entry.name))

Done :) Replaced with os.getpgid which is cleaner too.

…init__ Signed-off-by: Andreas Karatzas <akaratza@amd.com>

tests/utils.py

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…init__ (vllm-project#37230) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __…

3e37fb7

…init__ Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas mentioned this pull request Mar 16, 2026

[ROCm][CI] Revamping AMD mirrors #35897

Open

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __…

7ee6467

…init__ Signed-off-by: Andreas Karatzas <akaratza@amd.com>

DarkLight1337 reviewed Mar 17, 2026

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Mar 17, 2026

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

[CI] Transposing magic numbers to named parameters

4339b0e

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

DarkLight1337 approved these changes Mar 17, 2026

View reviewed changes

DarkLight1337 added ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs labels Mar 17, 2026

vllm-bot merged commit 4ed5130 into vllm-project:main Mar 17, 2026
164 of 167 checks passed

AndreasKaratzas deleted the akaratza_remote_server_refactor branch March 17, 2026 16:08

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __…

ac3c6e9

…init__ (vllm-project#37230) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __…

e64ec18

…init__ (vllm-project#37230) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __…

c642273

…init__ (vllm-project#37230) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in init#37230

[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in init#37230
vllm-bot merged 3 commits intovllm-project:mainfrom
ROCm:akaratza_remote_server_refactor

AndreasKaratzas commented Mar 16, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 16, 2026

Uh oh!

AndreasKaratzas Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

AndreasKaratzas commented Mar 16, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndreasKaratzas commented Mar 16, 2026 •

edited by github-actions bot

Loading