Skip to content

Add Prometheus child_exit cleanup for gunicorn workers#22324

Merged
ryan-crabbe merged 1 commit intomainfrom
litellm_prometheus_child_exit_cleanup
Feb 28, 2026
Merged

Add Prometheus child_exit cleanup for gunicorn workers#22324
ryan-crabbe merged 1 commit intomainfrom
litellm_prometheus_child_exit_cleanup

Conversation

@ryan-crabbe
Copy link
Collaborator

@ryan-crabbe ryan-crabbe commented Feb 28, 2026

Relevant issues

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Type

Changes

For those running gunicorn we can hook into when child workers die and on that we clean up (using prometheus' mark_process_dead) all the stale live tracking files for that worker.

When a gunicorn worker exits (e.g. from max_requests recycling), its
per-process prometheus .db files remain on disk. For gauges using
livesum/liveall mode, this means the dead worker's last-known values
persist as if the process were still alive. Wire gunicorn's child_exit
hook to call mark_process_dead() so live-tracking gauges accurately
reflect only running workers.
@vercel
Copy link

vercel bot commented Feb 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Feb 28, 2026 0:14am

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

This PR adds automatic Prometheus metric cleanup when gunicorn worker processes exit, preventing "ghost" gauge values from dead workers. It hooks into gunicorn's child_exit lifecycle event to call prometheus_client.multiprocess.mark_process_dead(), which removes stale .db files for the dead worker PID.

  • Adds mark_worker_exit() in prometheus_cleanup.py that wraps prometheus_client.multiprocess.mark_process_dead() with env var guard and error handling
  • Registers a gunicorn child_exit hook in proxy_cli.py when PROMETHEUS_MULTIPROC_DIR is configured
  • Adds 3 unit tests covering the happy path, noop when env is unset, and exception handling
  • The change is well-scoped and only affects gunicorn deployments with Prometheus multiprocess mode enabled

Confidence Score: 4/5

  • This PR is safe to merge — it only activates when PROMETHEUS_MULTIPROC_DIR is set and gunicorn is used, with proper error handling.
  • The change is small, well-tested, and narrowly scoped. It uses the official prometheus_client API for cleanup and only activates conditionally. The gunicorn child_exit hook is the correct integration point. Tests cover the key paths. Minor style note on inline imports but justified for optional dependency.
  • No files require special attention. All changes are straightforward and well-contained.

Important Files Changed

Filename Overview
litellm/proxy/prometheus_cleanup.py Adds mark_worker_exit() function that calls prometheus_client.multiprocess.mark_process_dead() for a dead worker PID, with proper env var guard and exception handling. Clean and correct.
litellm/proxy/proxy_cli.py Registers gunicorn child_exit hook conditionally when PROMETHEUS_MULTIPROC_DIR is set. Correct integration point, though import is inline rather than module-level.
tests/test_litellm/proxy/test_prometheus_cleanup.py Adds 3 well-structured mock-only tests for mark_worker_exit(): positive case, env-not-set noop, and exception handling. No real network calls.

Sequence Diagram

sequenceDiagram
    participant G as Gunicorn Master
    participant W as Worker Process
    participant PC as proxy_cli.py
    participant PM as prometheus_cleanup.py
    participant PL as prometheus_client

    G->>W: Worker starts
    W->>W: Serves requests (writes .db files)
    W-->>G: Worker exits (crash/recycle)
    G->>PC: child_exit(server, worker)
    PC->>PM: mark_worker_exit(worker.pid)
    PM->>PM: Check PROMETHEUS_MULTIPROC_DIR env
    PM->>PL: multiprocess.mark_process_dead(pid)
    PL->>PL: Remove stale .db files for PID
Loading

Last reviewed commit: 4fa6742

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ryan-crabbe ryan-crabbe merged commit e445a32 into main Feb 28, 2026
77 of 94 checks passed
@ryan-crabbe ryan-crabbe deleted the litellm_prometheus_child_exit_cleanup branch February 28, 2026 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant