Skip to content

[Fix] Remove dangerous schema diff force-apply from Prisma migration startup#23464

Draft
yuneng-jiang wants to merge 12 commits intolitellm_internal_dev_03_12_2026from
litellm_fix_prisma_migrations
Draft

[Fix] Remove dangerous schema diff force-apply from Prisma migration startup#23464
yuneng-jiang wants to merge 12 commits intolitellm_internal_dev_03_12_2026from
litellm_fix_prisma_migrations

Conversation

@yuneng-jiang
Copy link
Contributor

Relevant issues

Reported by a user who experienced schema thrashing during rolling deployments due to the migration system's diff-and-apply mechanism.

Summary

Failure Path (Before Fix)

_resolve_all_migrations was called after every successful prisma migrate deploy as a "sanity check." This method:

  1. Generated a diff between the DB and schema.prisma
  2. Force-applied that diff via prisma db execute, bypassing Prisma's migration tracking

During rolling deployments, if version A and version B had different schema.prisma files, each startup would force-push its own schema over the other's, causing a thrashing loop. Additionally, non-idempotent P3009 migration failures were silently retried instead of failing fast.

Fix

  • Removed _resolve_all_migrations and the post-deploy "sanity check" call — if prisma migrate deploy succeeds, the DB is already correct
  • Replaced with _mark_all_migrations_applied which only updates the _prisma_migrations tracking table without generating or applying diffs
  • P3005 (non-empty DB, first migration) now uses _mark_all_migrations_applied + prisma migrate deploy instead of the diff mechanism
  • P3009 non-idempotent errors now raise RuntimeError with actionable instructions instead of silently retrying

Testing

  • All 24 unit tests pass (tests/litellm-proxy-extras/test_litellm_proxy_extras_utils.py)
  • Added 5 new tests:
    • TestMarkAllMigrationsApplied: verifies the new method marks migrations without generating/applying diffs, and skips already-applied migrations
    • TestSetupDatabaseFailFast: verifies P3009 non-idempotent errors raise RuntimeError, and successful deploys don't call diff/resolve logic
  • Audited all 110 existing migration SQL files — all contain complete, self-contained DDL and do not depend on the removed diff mechanism
  • Verified no CI tests reference _resolve_all_migrations or depend on the removed behavior

Type

🐛 Bug Fix
✅ Test

The `_resolve_all_migrations` method was generating a diff between the DB
and schema.prisma and force-applying it via `prisma db execute`, bypassing
migration tracking entirely. This caused schema thrashing during rolling
deployments when two versions with different schemas were running.

- Remove `_resolve_all_migrations` and its post-deploy "sanity check" call
- Replace with `_mark_all_migrations_applied` (marks only, no diff/apply)
- P3005 handler now uses `_mark_all_migrations_applied` + `prisma migrate deploy`
- P3009 non-idempotent errors now fail fast with actionable error messages

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Mar 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Error Error Mar 12, 2026 7:26pm

Request Review

@yuneng-jiang yuneng-jiang changed the base branch from main to litellm_internal_dev_03_12_2026 March 12, 2026 18:04
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 12, 2026

Greptile Summary

This PR addresses a real and critical rolling-deployment bug where _resolve_all_migrations was force-applying schema diffs on every startup, causing thrashing between pods running different schema versions. The fix removes that mechanism and replaces it with a cleaner _deploy_with_idempotent_resolution loop that handles P3009/P3018 Prisma errors natively. The new approach is architecturally sound for the normal deploy path.

Key changes and remaining concerns:

  • Core fix is correct: Removing _resolve_all_migrations eliminates the schema-thrashing root cause. If prisma migrate deploy succeeds, no post-deploy diff is needed.
  • _deploy_with_idempotent_resolution properly loops through multiple idempotent failures and raises RuntimeError for non-recoverable ones — this is a significant improvement over the previous silent-retry pattern.
  • Deleted migration file (20260309115809_add_missing_indexes): Removing this file without a corresponding DROP INDEX migration leaves LiteLLM_VerificationToken_key_alias_idx and LiteLLM_SpendLogs_user_startTime_idx orphaned in any database that already applied this migration. Schema drift will be permanent unless manually resolved. Additionally, new deployments will silently lose the index-backed query performance those entries were providing.
  • P3005 baseline path (_mark_all_migrations_applied then return True): Multiple prior review threads have detailed that this path marks all migrations — including newly-shipped ones — as applied without executing their SQL, and then returns success without re-running deploy. This remains an open concern.
  • Tests: New tests for _deploy_with_idempotent_resolution and _mark_all_migrations_applied are a good addition, but several assertions against the exact Prisma command array do not mock _get_prisma_command(), making them sensitive to the PRISMA_OFFLINE_MODE environment variable.

Confidence Score: 2/5

  • Not safe to merge without resolving the deleted migration file issue and the open P3005 baseline path concerns from prior threads
  • The core rolling-deployment fix is correct and well-tested. However, deleting 20260309115809_add_missing_indexes without a DROP INDEX migration creates permanent schema drift and potential index loss for any environment that already applied it. Additionally, multiple substantive concerns about the P3005 path (new migrations silently skipped) and the idempotent resolution re-deploy gap remain open from prior review threads and have not been addressed in this revision.
  • Pay close attention to litellm-proxy-extras/litellm_proxy_extras/migrations/20260309115809_add_missing_indexes/migration.sql (deleted without drop migration) and litellm-proxy-extras/litellm_proxy_extras/utils.py lines 443–460 (P3005 baseline path).

Important Files Changed

Filename Overview
litellm-proxy-extras/litellm_proxy_extras/utils.py Core migration logic refactored to remove dangerous diff-and-apply mechanism; adds _deploy_with_idempotent_resolution and _mark_all_migrations_applied; multiple logic concerns in P3005 baseline path and error handling already flagged in prior threads
litellm-proxy-extras/litellm_proxy_extras/migrations/20260311180521_schema_sync/migration.sql New migration drops six columns and one index from LiteLLM_MCPServerTable — destructive DDL already flagged as unsafe for rolling deployments in prior threads
litellm-proxy-extras/litellm_proxy_extras/migrations/20260309115809_add_missing_indexes/migration.sql Migration file deleted without a corresponding DROP INDEX migration; any database that already applied this migration will have orphaned indexes that are no longer tracked by the Prisma schema
litellm-proxy-extras/litellm_proxy_extras/schema.prisma Removes @@index([key_alias]) from LiteLLM_VerificationToken and @@index([user, startTime]) from LiteLLM_SpendLogs without any DROP INDEX migration, creating schema drift for existing deployments and silent performance regression for new ones
tests/litellm-proxy-extras/test_litellm_proxy_extras_utils.py Adds meaningful new tests for _mark_all_migrations_applied and _deploy_with_idempotent_resolution; assertions against exact Prisma command arrays are environment-sensitive because _get_prisma_command() is not mocked

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[setup_database called\nuse_migrate=True] --> B[_deploy_with_idempotent_resolution]
    B --> C{prisma migrate deploy}
    C -->|Success| D[return True ✅]
    C -->|CalledProcessError| E{P3009 or P3018?}
    E -->|No| F[raise CalledProcessError]
    E -->|Yes| G[_resolve_failed_migration]
    G --> H{idempotent error?}
    H -->|No - non-idempotent| I[_roll_back_migration\nthen raise RuntimeError ❌]
    H -->|Yes - idempotent| J[_roll_back_migration\n+ _resolve_specific_migration]
    J --> K[continue loop\nRe-deploy]
    K --> C
    B -->|max_resolutions exceeded| L[raise RuntimeError\nmax iterations ❌]

    C -->|CalledProcessError\nP3005 - non-empty DB| M[_create_baseline_migration]
    M --> N[_mark_all_migrations_applied\nmarks ALL migrations applied\nwithout running SQL]
    N --> O[return True ⚠️\nNew migrations not deployed]

    style D fill:#90EE90
    style I fill:#FFB6C1
    style L fill:#FFB6C1
    style O fill:#FFD700
Loading

Comments Outside Diff (1)

  1. litellm-proxy-extras/litellm_proxy_extras/migrations/20260309115809_add_missing_indexes/migration.sql

    Deleted migration file causes orphaned indexes and schema drift

    This migration file created LiteLLM_VerificationToken_key_alias_idx and LiteLLM_SpendLogs_user_startTime_idx via CREATE INDEX CONCURRENTLY. Deleting the file without adding a corresponding DROP INDEX migration in its place leaves two problems:

    1. Schema drift for existing deployments: Any database that already applied 20260309115809_add_missing_indexes (e.g. a staging or CI environment on the litellm_internal_dev_03_12_2026 branch) will have these indexes present in the database. Because schema.prisma no longer declares @@index([key_alias]) or @@index([user, startTime]), Prisma will permanently consider the DB to be drifted, and prisma migrate diff/prisma migrate status will flag it as out of sync.

    2. Silent performance regression for new deployments: New databases will never get these indexes. The comments in the deleted schema entries indicate they were specifically added to accelerate ORDER BY key_alias queries on LiteLLM_VerificationToken and time-bounded spend queries on LiteLLM_SpendLogs. Removing them without documentation will silently degrade those query paths.

    The standard Prisma approach is to add a new migration that drops the indexes rather than delete the migration that created them:

    -- DropIndex
    DROP INDEX CONCURRENTLY IF EXISTS "LiteLLM_VerificationToken_key_alias_idx";
    
    -- DropIndex
    DROP INDEX CONCURRENTLY IF EXISTS "LiteLLM_SpendLogs_user_startTime_idx";

    If no environment has ever run 20260309115809_add_missing_indexes (i.e. it was only ever on an internal dev branch), please add a comment confirming this so reviewers can verify the deletion is safe.

Last reviewed commit: b4ee36c

Comment on lines +357 to +363
raise RuntimeError(
f"Migration {failed_migration} failed and requires manual intervention. "
f"Please inspect the migration and database state, then either:\n"
f" - Fix the issue and restart, or\n"
f" - Run: prisma migrate resolve --applied {failed_migration}\n"
f"Original error: {e.stderr}"
) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading remediation instruction in error message

The RuntimeError message suggests running prisma migrate resolve --applied {failed_migration} as an alternative to fixing and restarting. However, --applied marks the migration as applied in Prisma's tracking table without actually executing the migration SQL, which can leave the database schema in an inconsistent state if the migration contained important DDL changes.

A user following this instruction after a partial migration failure would silently skip the migration entirely. A clearer message would clarify that --applied should only be used when the changes were already applied to the database by other means (e.g., run manually):

raise RuntimeError(
    f"Migration {failed_migration} failed and requires manual intervention. "
    f"The migration has been marked as rolled-back. Please either:\n"
    f"  1. Fix the migration SQL and restart to re-run it, or\n"
    f"  2. Manually apply the changes to the DB, then run:\n"
    f"       prisma migrate resolve --applied {failed_migration}\n"
    f"Original error: {e.stderr}"
) from e

- P3009 with unparseable migration name now fails fast instead of
  silently falling through to the generic retry handler
- Removed --applied suggestion from RuntimeError message since it
  skips running migration SQL and can leave schema inconsistent
- Increased retry attempts from 4 to 5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +332 to +345
if ProxyExtrasDBManager._is_idempotent_error(e.stderr):
logger.info(
f"Migration {failed_migration} failed due to idempotent error (e.g., column already exists), resolving as applied"
)
ProxyExtrasDBManager._roll_back_migration(
failed_migration
)
ProxyExtrasDBManager._resolve_specific_migration(
failed_migration
)
logger.info(
f"✅ Migration {failed_migration} resolved."
)
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3009 idempotent path returns True without re-running deploy

After resolving the idempotent P3009 failure the code immediately returns True, but there may be additional pending migrations after the now-resolved one that were never applied. prisma migrate deploy stops at the first failing migration, so any migrations B, C, … that followed the resolved migration A are still unapplied when the function returns successfully.

This stands in contrast to the P3005 path added in this same PR, which correctly re-runs prisma migrate deploy after _mark_all_migrations_applied to pick up remaining pending migrations. The inconsistency means an idempotent P3009 failure leaves the database in a partially-migrated state that is only corrected on the next startup (when deploy runs again cleanly).

Suggested fix — add a re-deploy after the rollback/resolve, mirroring the P3005 pattern:

                ProxyExtrasDBManager._roll_back_migration(
                    failed_migration
                )
                ProxyExtrasDBManager._resolve_specific_migration(
                    failed_migration
                )
                logger.info(
                    f"✅ Migration {failed_migration} resolved."
                )
                # Re-run deploy to apply any remaining pending migrations
                result = subprocess.run(
                    [_get_prisma_command(), "migrate", "deploy"],
                    timeout=60,
                    check=True,
                    capture_output=True,
                    text=True,
                    env=_get_prisma_env(),
                )
                logger.info(f"prisma migrate deploy stdout: {result.stdout}")
                return True

Comment on lines +210 to +224
def test_p3009_non_idempotent_raises_runtime_error(
self, mock_run, mock_dir, mock_getcwd, mock_chdir
):
"""P3009 with non-idempotent error should raise RuntimeError, not silently retry"""
error = subprocess.CalledProcessError(
1,
"prisma",
stderr="P3009: migrate found failed migrations in the target database, `20250329084805_new_cron_job_table` migration. Error: syntax error at or near 'ALTR'",
output="",
)
mock_run.side_effect = error

with pytest.raises(RuntimeError, match="requires manual intervention"):
ProxyExtrasDBManager.setup_database(use_migrate=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test doesn't isolate rollback call from initial deploy call

mock_run.side_effect = error causes every subprocess.run call to raise the same P3009 CalledProcessError. This means _roll_back_migration (which calls subprocess.run internally) also raises, is silently caught by except Exception as rollback_error, and then RuntimeError is raised. The test passes, but it is actually testing the "rollback-also-fails" code path rather than the primary one.

A more representative test would let _roll_back_migration succeed (mock side-effects as an iterable — first call raises, second call returns normally) so the test exercises the path where rollback succeeds but RuntimeError is still raised:

mock_run.side_effect = [
    error,              # prisma migrate deploy → raises P3009
    MagicMock(),        # _roll_back_migration → succeeds
]

With the current single side-effect this also makes it impossible to assert that the rollback was actually attempted (mock_run.call_count is unverified here).

- P3009 idempotent path now re-runs prisma migrate deploy after resolving
  the failed migration, so subsequent migrations are not left unapplied
- _mark_all_migrations_applied raises RuntimeError on unexpected errors
  instead of silently swallowing them
- P3009 non-idempotent test now exercises the rollback-succeeds path
  and verifies rollback was called

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous approach resolved one idempotent migration per outer retry
attempt, exhausting all 5 retries when a DB had 4+ idempotent failures
(e.g., tables/columns already created by the old force-apply logic).

Now _deploy_with_idempotent_resolution loops internally: deploy → detect
idempotent P3009/P3018 → rollback + resolve → re-deploy, until all
idempotent migrations are resolved or a non-recoverable error is hit.
This keeps outer retries available for transient errors (timeouts, etc).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +395 to +414
text=True,
env=_get_prisma_env(),
)
logger.info(f"prisma migrate deploy stdout: {result.stdout}")
logger.info("✅ prisma migrate deploy completed")
return
except subprocess.CalledProcessError as e:
logger.info(f"prisma db error: {e.stderr}, e: {e.stdout}")
if "P3009" in e.stderr or "P3018" in e.stderr:
# Raises RuntimeError for non-recoverable errors,
# returns normally for resolved idempotent errors
ProxyExtrasDBManager._resolve_failed_migration(e)
logger.info("Re-deploying remaining migrations...")
continue
else:
raise

raise RuntimeError(
f"Exceeded maximum idempotent resolutions ({max_resolutions}). "
f"This likely indicates a deeper issue with migration state."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3005 path: new pending migrations silently skipped

_mark_all_migrations_applied globs all migration folders under migrations_dir/migrations/*/migration.sql and marks each one as applied via prisma migrate resolve --applied. This includes any migration file that was shipped in the current version but has not yet been run against the database.

Concretely: if version N+1 ships a new migration (20250401_add_new_feature/migration.sql), the P3005 path will:

  1. Create the 0_init baseline (correct – captures current DB state)
  2. Call _mark_all_migrations_applied, which marks 20250401_add_new_feature as applied without executing its SQL
  3. Call prisma migrate deploy – this is now a no-op because every migration is already recorded as applied

The net result is that the new migration's DDL changes are never applied to the database, and the function returns True, silently reporting success.

The old _resolve_all_migrations avoided this by generating and executing a diff from the current DB to schema.prisma, so new schema changes were actually materialised even when all migrations were then marked as applied. The new approach drops that safety net for the P3005 path.

A minimal fix is to only mark migrations that are truly already reflected in the DB — i.e., those whose SQL was captured in the 0_init baseline — and let prisma migrate deploy run the genuinely pending ones normally:

# After _create_baseline_migration, only mark 0_init as applied
# (it already does this internally). Then let prisma migrate deploy
# apply any truly pending migrations.
logger.info(
    "Baseline migration created; running prisma migrate deploy for pending migrations..."
)
result = subprocess.run(
    [_get_prisma_command(), "migrate", "deploy"],
    timeout=60,
    check=True,
    capture_output=True,
    text=True,
    env=_get_prisma_env(),
)

_create_baseline_migration already marks 0_init as applied, so Prisma will skip it and only run the truly pending migrations on the subsequent deploy. _mark_all_migrations_applied would no longer be needed for this path.

) from e
migration_name = migration_match.group(1)
else:
raise # Not a P3009/P3018 — let outer handler deal with it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare raise is dead code and relies on fragile implicit exception context

This raise statement appears in the else branch meaning neither "P3009" nor "P3018" appears in stderr. However, _resolve_failed_migration is only ever called from _deploy_with_idempotent_resolution after an explicit check if "P3009" in e.stderr or "P3018" in e.stderr:, so this else branch is unreachable dead code.

Additionally, a bare raise inside a @staticmethod called from an except block works only because Python implicitly preserves the active exception in sys.exc_info() across function call frames during exception handling. If _resolve_failed_migration is ever called from a non-exception context in the future (e.g. a new call site), this line will instead raise RuntimeError: No active exception to re-raise, obscuring the original problem.

Since the method already receives e as a parameter, it should use raise e to make intent explicit and avoid the fragile implicit dependency:

Suggested change
raise # Not a P3009/P3018 — let outer handler deal with it
raise e # Not a P3009/P3018 — let outer handler deal with it

- Fix bare `raise` → `raise e` in unreachable else branch of
  _resolve_failed_migration for correctness and clarity
- Remove _mark_all_migrations_applied from P3005 path — it was marking
  ALL migrations (including new ones) as applied, causing new migration
  SQL to never run. Now baseline creation is followed directly by
  _deploy_with_idempotent_resolution which actually runs pending migrations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines 444 to 458
@@ -456,86 +450,14 @@ def setup_database(use_migrate: bool = False) -> bool:
)
ProxyExtrasDBManager._create_baseline_migration(schema_path)
logger.info(
"Baseline migration created, resolving all migrations"
)
ProxyExtrasDBManager._resolve_all_migrations(
migrations_dir, schema_path
"Baseline migration created, running deploy with idempotent resolution "
"to apply remaining migrations..."
)
logger.info("✅ All migrations resolved.")
ProxyExtrasDBManager._deploy_with_idempotent_resolution()
logger.info("✅ All migrations applied.")
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3005 path silently skips new migrations on non-empty databases

In the P3005 path, _create_baseline_migration creates and marks 0_init as applied (representing the current DB state), then immediately calls _deploy_with_idempotent_resolution(). Any subsequent migration that performs a DROP COLUMN, DROP INDEX, or DROP TABLE on an object that doesn't yet exist in this particular DB will fail with a "does not exist" error. The _is_idempotent_error pattern list includes r"does not exist", so this failure is treated as idempotent and the migration is silently marked as applied without its SQL ever running.

Concrete scenario: a database was set up via db push (not migrate), so it lacks the columns being dropped by 20260311180521_schema_sync. The P3005 path will silently skip that migration, which is fine here — but if a future migration adds something that depends on a prior migration's ALTER TABLE ... ADD COLUMN having run, the silent skip will leave the DB in a broken state.

The PR description states "P3005 now uses _mark_all_migrations_applied + prisma migrate deploy", but the actual code uses _create_baseline_migration + _deploy_with_idempotent_resolution without calling _mark_all_migrations_applied at all. The two approaches differ: the current code attempts to re-execute each migration (relying on idempotent resolution to skip already-applied ones), whereas _mark_all_migrations_applied would skip them unconditionally. Consider whether the stated intent matches the implementation, and whether a conditional _mark_all_migrations_applied call before _deploy_with_idempotent_resolution would be safer for the P3005 case.

Comment on lines +1 to +11
-- DropIndex
DROP INDEX "LiteLLM_MCPServerTable_approval_status_idx";

-- AlterTable
ALTER TABLE "LiteLLM_MCPServerTable" DROP COLUMN "approval_status",
DROP COLUMN "review_notes",
DROP COLUMN "reviewed_at",
DROP COLUMN "source_url",
DROP COLUMN "submitted_at",
DROP COLUMN "submitted_by";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destructive column removal is not safe for rolling deployments

This migration permanently drops approval_status, review_notes, reviewed_at, source_url, submitted_at, and submitted_by from LiteLLM_MCPServerTable. During a rolling deployment, pods still running the previous version will attempt to read or write these columns and receive a column does not exist database error the moment the new pod applies this migration.

The PR is specifically addressing rolling-deployment safety, so this bundled schema change works against that goal. The standard zero-downtime pattern is: (1) deploy new code that no longer references the columns, then (2) in a subsequent deploy, run the DROP COLUMN migration once all pods have moved to the new code. If these columns are already unreferenced by application code in the current branch, please confirm that before merging, and consider documenting that a brief read-path error window is expected.

Comment on lines 431 to +460
@@ -382,72 +437,11 @@ def setup_database(use_migrate: bool = False) -> bool:
if use_migrate:
logger.info("Running prisma migrate deploy")
try:
# Set migrations directory for Prisma
result = subprocess.run(
[_get_prisma_command(), "migrate", "deploy"],
timeout=60,
check=True,
capture_output=True,
text=True,
env=_get_prisma_env(),
)
logger.info(f"prisma migrate deploy stdout: {result.stdout}")

logger.info("prisma migrate deploy completed")

# Run sanity check to ensure DB matches schema
logger.info("Running post-migration sanity check...")
ProxyExtrasDBManager._resolve_all_migrations(
migrations_dir, schema_path, mark_all_applied=False
)
logger.info("✅ Post-migration sanity check completed")
ProxyExtrasDBManager._deploy_with_idempotent_resolution()
return True
except subprocess.CalledProcessError as e:
logger.info(f"prisma db error: {e.stderr}, e: {e.stdout}")
if "P3009" in e.stderr:
# Extract the failed migration name from the error message
migration_match = re.search(
r"`(\d+_.*)` migration", e.stderr
)
if migration_match:
failed_migration = migration_match.group(1)
if ProxyExtrasDBManager._is_idempotent_error(e.stderr):
logger.info(
f"Migration {failed_migration} failed due to idempotent error (e.g., column already exists), resolving as applied"
)
ProxyExtrasDBManager._roll_back_migration(
failed_migration
)
ProxyExtrasDBManager._resolve_specific_migration(
failed_migration
)
logger.info(
f"✅ Migration {failed_migration} resolved."
)
return True
else:
logger.info(
f"Found failed migration: {failed_migration}, marking as rolled back"
)
# Mark the failed migration as rolled back
subprocess.run(
[
_get_prisma_command(),
"migrate",
"resolve",
"--rolled-back",
failed_migration,
],
timeout=60,
check=True,
capture_output=True,
text=True,
env=_get_prisma_env(),
)
logger.info(
f"✅ Migration {failed_migration} marked as rolled back... retrying"
)
elif (
if (
"P3005" in e.stderr
and "database schema is not empty" in e.stderr
):
@@ -456,86 +450,14 @@ def setup_database(use_migrate: bool = False) -> bool:
)
ProxyExtrasDBManager._create_baseline_migration(schema_path)
logger.info(
"Baseline migration created, resolving all migrations"
)
ProxyExtrasDBManager._resolve_all_migrations(
migrations_dir, schema_path
"Baseline migration created, running deploy with idempotent resolution "
"to apply remaining migrations..."
)
logger.info("✅ All migrations resolved.")
ProxyExtrasDBManager._deploy_with_idempotent_resolution()
logger.info("✅ All migrations applied.")
return True
elif "P3018" in e.stderr:
# Check if this is a permission error or idempotent error
if ProxyExtrasDBManager._is_permission_error(e.stderr):
# Permission errors should NOT be marked as applied
# Extract migration name for logging
migration_match = re.search(
r"Migration name: (\d+_.*)", e.stderr
)
migration_name = (
migration_match.group(1)
if migration_match
else "unknown"
)

logger.error(
f"❌ Migration {migration_name} failed due to insufficient permissions. "
f"Please check database user privileges. Error: {e.stderr}"
)

# Mark as rolled back and exit with error
if migration_match:
try:
ProxyExtrasDBManager._roll_back_migration(
migration_name
)
logger.info(
f"Migration {migration_name} marked as rolled back"
)
except Exception as rollback_error:
logger.warning(
f"Failed to mark migration as rolled back: {rollback_error}"
)

# Re-raise the error to prevent silent failures
raise RuntimeError(
f"Migration failed due to permission error. Migration {migration_name} "
f"was NOT applied. Please grant necessary database permissions and retry."
) from e

elif ProxyExtrasDBManager._is_idempotent_error(e.stderr):
# Idempotent errors mean the migration has effectively been applied
logger.info(
"Migration failed due to idempotent error (e.g., column already exists), "
"resolving as applied"
)
# Extract the migration name from the error message
migration_match = re.search(
r"Migration name: (\d+_.*)", e.stderr
)
if migration_match:
migration_name = migration_match.group(1)
logger.info(
f"Rolling back migration {migration_name}"
)
ProxyExtrasDBManager._roll_back_migration(
migration_name
)
logger.info(
f"Resolving migration {migration_name} that failed "
f"due to existing schema objects"
)
ProxyExtrasDBManager._resolve_specific_migration(
migration_name
)
logger.info("✅ Migration resolved.")
else:
# Unknown P3018 error - log and re-raise for safety
logger.warning(
f"P3018 error encountered but could not classify "
f"as permission or idempotent error. "
f"Error: {e.stderr}"
)
raise
else:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RuntimeError from _deploy_with_idempotent_resolution bypasses all retries and propagates uncaught

_deploy_with_idempotent_resolution raises RuntimeError for non-idempotent failures (via _resolve_failed_migration). In setup_database, the outer retry loop only catches subprocess.CalledProcessError and subprocess.TimeoutExpired:

for attempt in range(5):
    try:
        ...
        ProxyExtrasDBManager._deploy_with_idempotent_resolution()  # can raise RuntimeError
        ...
    except subprocess.TimeoutExpired:
        ...
    except subprocess.CalledProcessError as e:
        ...  # retried
    # RuntimeError is NOT caught → escapes setup_database entirely

A RuntimeError from a non-idempotent migration failure escapes setup_database without being caught, which is intentional for fail-fast semantics. However, call sites of setup_database that previously only expected a bool return value (or a CalledProcessError) will now receive an uncaught RuntimeError. The callers should be updated to handle this explicitly, or at minimum the function docstring should document that RuntimeError can be raised.

"duplicate key value violates" on CREATE UNIQUE INDEX means the index
can't be created due to duplicate data — this is NOT idempotent. Marking
it as applied silently skips index creation, leaving the DB without the
unique constraint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines 244 to +283
@@ -354,10 +273,145 @@ def _resolve_all_migrations(
)
logger.debug(f"Resolved migration: {migration_name}")
except subprocess.CalledProcessError as e:
if "is already recorded as applied in the database." not in e.stderr:
logger.warning(
f"Failed to resolve migration {migration_name}: {e.stderr}"
if "is already recorded as applied in the database." in e.stderr:
logger.debug(
f"Migration {migration_name} already recorded as applied"
)
else:
raise RuntimeError(
f"Failed to mark migration {migration_name} as applied: {e.stderr}"
) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mark_all_migrations_applied is dead production code

This method is defined, documented, and covered by four new tests, but it is never called in any production code path. The PR description states "P3005 (non-empty DB, first migration) now uses _mark_all_migrations_applied + prisma migrate deploy", but the actual implementation uses _create_baseline_migration (which marks only 0_init as applied) followed by _deploy_with_idempotent_resolution.

A quick grep confirms no call site exists outside of tests:

grep -n "_mark_all_migrations_applied" utils.py
# 245: def _mark_all_migrations_applied(migrations_dir: str):  ← definition only

The intended semantics — mark all migrations as already reflected in the DB — are important to document correctly. If this method is meant as a utility for future use, a clear # NOTE: not currently called in production; see _create_baseline_migration for the P3005 path comment would help. If it is genuinely needed in the P3005 flow, it should be wired in; otherwise, the tests are covering a code path that never executes in production.

P3005 fires only during the db-push-to-migrate transition, when the
_prisma_migrations table doesn't exist yet. Since db push already kept
the schema in sync, all shipped migrations are already reflected in the
DB. Mark them all as applied (standard Prisma baselining) instead of
running deploy-with-resolution, which would fail on pre-existing data
issues like duplicate rows preventing unique index creation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +188 to +196
return_value=["20250326162113_baseline"],
)
def test_skips_already_applied_migration(self, mock_get_names, mock_run):
"""Verify already-applied migrations are silently skipped"""
mock_run.side_effect = subprocess.CalledProcessError(
1,
"prisma",
stderr="Migration `20250326162113_baseline` is already recorded as applied in the database.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests assert exact Prisma command without mocking _get_prisma_command()

Multiple tests assert against a hardcoded ["prisma", "migrate", "deploy"] array:

assert cmd == ["prisma", "migrate", "deploy"]

_get_prisma_command() is environment-sensitive — when PRISMA_OFFLINE_MODE=true is set (a valid production and potentially CI configuration), it returns a full binary path like /app/.cache/prisma-python/binaries/node_modules/.bin/prisma or a PRISMA_CLI_PATH override instead of "prisma". These assertions would then fail despite the production code working correctly.

The same pattern appears in test_p3009_non_idempotent_raises_runtime_error (line 222), test_p3009_idempotent_redeploys (line 341), and test_successful_deploy (line 343). Patching _get_prisma_command to return a fixed value in each test class would make the assertions environment-agnostic:

@patch("litellm_proxy_extras.utils._get_prisma_command", return_value="prisma")

@yuneng-jiang yuneng-jiang marked this pull request as draft March 12, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant