Skip to content

fix(proxy): auto-configure PROMETHEUS_MULTIPROC_DIR for multi-worker setups#20911

Open
jquinter wants to merge 2 commits intoBerriAI:mainfrom
jquinter:fix/prometheus-multiprocess-auto-setup
Open

fix(proxy): auto-configure PROMETHEUS_MULTIPROC_DIR for multi-worker setups#20911
jquinter wants to merge 2 commits intoBerriAI:mainfrom
jquinter:fix/prometheus-multiprocess-auto-setup

Conversation

@jquinter
Copy link
Contributor

Summary

When running the LiteLLM proxy with multiple uvicorn workers (--num_workers > 1) and Prometheus callbacks enabled, Prometheus metrics are silently lost because each worker process maintains its own metrics registry. This PR auto-detects this scenario in proxy_cli.py and creates a temporary shared directory for PROMETHEUS_MULTIPROC_DIR, enabling MultiProcessCollector (already supported in prometheus.py) to aggregate metrics across workers.

  • Auto-creates a temp directory and sets PROMETHEUS_MULTIPROC_DIR when num_workers > 1 and prometheus is in litellm_settings.callbacks
  • Registers an atexit handler to clean up the temp directory on shutdown
  • Respects any existing PROMETHEUS_MULTIPROC_DIR environment variable (does not overwrite)
  • Logs a green status message so operators know the feature is active

Fixes #10595
Supersedes #11067 — reimplemented against current codebase. Full credit to @Penagwin for the original approach and thorough investigation.

Test plan

  • Added TestPrometheusMultiprocessSetup with 3 test cases:
    • test_prometheus_multiproc_dir_auto_created — verifies env var is set and directory exists when num_workers=4 with prometheus callback
    • test_prometheus_multiproc_dir_not_set_for_single_worker — verifies env var is NOT set for single worker
    • test_prometheus_multiproc_dir_respects_existing_env — verifies pre-existing env var is not overwritten
  • make test-unit passes

🤖 Generated with Claude Code

…setups

When running LiteLLM proxy with multiple uvicorn workers and Prometheus
callbacks enabled, automatically create and set PROMETHEUS_MULTIPROC_DIR
so metrics are correctly aggregated across all worker processes.

Fixes BerriAI#10595
Supersedes BerriAI#11067 — reimplemented against current codebase, crediting
original author @Penagwin for the approach.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Feb 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Feb 11, 2026 2:06am

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

Auto-configures PROMETHEUS_MULTIPROC_DIR when running the proxy with multiple uvicorn workers and Prometheus callbacks enabled, fixing silent metric loss in multi-worker deployments. Creates a temp directory, sets the env var before workers fork, and registers an atexit cleanup handler. Respects pre-existing env var values.

  • The detection only checks litellm_settings.callbacks but misses litellm_settings.success_callback, which is another valid way to configure Prometheus in the proxy config — users configuring prometheus via success_callback will still silently lose metrics in multi-worker mode.
  • Inline imports of atexit, shutil, tempfile diverge from the project's module-level import convention.

Confidence Score: 3/5

  • Low-risk change scoped to CLI startup, but has a gap in prometheus detection that could leave some users unprotected.
  • The core logic is sound and well-tested, but it only detects prometheus via callbacks and misses the success_callback configuration path, which means multi-worker users configuring prometheus that way will still lose metrics silently. The change is additive and won't break existing behavior.
  • litellm/proxy/proxy_cli.py — the callback detection logic should also check success_callback.

Important Files Changed

Filename Overview
litellm/proxy/proxy_cli.py Adds auto-configuration of PROMETHEUS_MULTIPROC_DIR for multi-worker setups. Logic is correct but only checks callbacks, missing success_callback as another valid configuration path for prometheus.
tests/test_litellm/proxy/test_proxy_cli.py Adds 3 well-structured mock-only tests covering auto-creation, single-worker skip, and existing env var preservation. Proper cleanup in finally blocks. No real network calls.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

…to top

- Also check `litellm_settings.success_callback` for prometheus (not just `callbacks`)
- Move atexit, shutil, tempfile imports to module-level per CLAUDE.md style guide
- Add test for prometheus detection via success_callback

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@jquinter
Copy link
Contributor Author

@greptile-apps wake up dude! can you re-review this?

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

Auto-configures PROMETHEUS_MULTIPROC_DIR when running the proxy with multiple workers (num_workers > 1) and Prometheus callbacks enabled. This fixes silent metric loss in multi-worker setups by creating a temp directory for prometheus_client's MultiProcessCollector (already supported in prometheus.py). The env var is set before uvicorn.run() forks workers, and an atexit handler cleans up the directory on shutdown.

  • Auto-detects Prometheus in both litellm_settings.callbacks and litellm_settings.success_callback
  • Respects pre-existing PROMETHEUS_MULTIPROC_DIR values
  • Adds 4 mock-only unit tests with proper env var save/restore

Confidence Score: 4/5

  • This PR is safe to merge — it only activates when multi-worker + prometheus conditions are met, and gracefully handles existing configurations.
  • The changes are well-scoped and address a real issue (silent metric loss in multi-worker setups). The env var is set at the correct point in the lifecycle (before workers fork), the existing prometheus.py code already supports MultiProcessCollector, and tests are thorough. Addressed all prior review feedback (imports at top, success_callback check). Minor deduction because the feature only works when a config file is provided (not when callbacks are set via env vars alone), but that's a narrow edge case.
  • No files require special attention.

Important Files Changed

Filename Overview
litellm/proxy/proxy_cli.py Added auto-configuration of PROMETHEUS_MULTIPROC_DIR for multi-worker setups. Checks both callbacks and success_callback, creates a temp directory, sets the env var before workers fork, and registers an atexit cleanup handler. Imports moved to top of file per style guide.
tests/test_litellm/proxy/test_proxy_cli.py Added 4 well-structured mock-only tests covering: auto-creation with callbacks, single-worker no-op, existing env var preservation, and auto-creation with success_callback. Proper env var save/restore in all test cases.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Prometheus metrics aren't shared across Uvicorn workers

1 participant