[Fix] Remove dangerous schema diff force-apply from Prisma migration startup#23464
[Fix] Remove dangerous schema diff force-apply from Prisma migration startup#23464yuneng-jiang wants to merge 12 commits intolitellm_internal_dev_03_12_2026from
Conversation
The `_resolve_all_migrations` method was generating a diff between the DB and schema.prisma and force-applying it via `prisma db execute`, bypassing migration tracking entirely. This caused schema thrashing during rolling deployments when two versions with different schemas were running. - Remove `_resolve_all_migrations` and its post-deploy "sanity check" call - Replace with `_mark_all_migrations_applied` (marks only, no diff/apply) - P3005 handler now uses `_mark_all_migrations_applied` + `prisma migrate deploy` - P3009 non-idempotent errors now fail fast with actionable error messages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR addresses a real and critical rolling-deployment bug where Key changes and remaining concerns:
Confidence Score: 2/5
|
| Filename | Overview |
|---|---|
| litellm-proxy-extras/litellm_proxy_extras/utils.py | Core migration logic refactored to remove dangerous diff-and-apply mechanism; adds _deploy_with_idempotent_resolution and _mark_all_migrations_applied; multiple logic concerns in P3005 baseline path and error handling already flagged in prior threads |
| litellm-proxy-extras/litellm_proxy_extras/migrations/20260311180521_schema_sync/migration.sql | New migration drops six columns and one index from LiteLLM_MCPServerTable — destructive DDL already flagged as unsafe for rolling deployments in prior threads |
| litellm-proxy-extras/litellm_proxy_extras/migrations/20260309115809_add_missing_indexes/migration.sql | Migration file deleted without a corresponding DROP INDEX migration; any database that already applied this migration will have orphaned indexes that are no longer tracked by the Prisma schema |
| litellm-proxy-extras/litellm_proxy_extras/schema.prisma | Removes @@index([key_alias]) from LiteLLM_VerificationToken and @@index([user, startTime]) from LiteLLM_SpendLogs without any DROP INDEX migration, creating schema drift for existing deployments and silent performance regression for new ones |
| tests/litellm-proxy-extras/test_litellm_proxy_extras_utils.py | Adds meaningful new tests for _mark_all_migrations_applied and _deploy_with_idempotent_resolution; assertions against exact Prisma command arrays are environment-sensitive because _get_prisma_command() is not mocked |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[setup_database called\nuse_migrate=True] --> B[_deploy_with_idempotent_resolution]
B --> C{prisma migrate deploy}
C -->|Success| D[return True ✅]
C -->|CalledProcessError| E{P3009 or P3018?}
E -->|No| F[raise CalledProcessError]
E -->|Yes| G[_resolve_failed_migration]
G --> H{idempotent error?}
H -->|No - non-idempotent| I[_roll_back_migration\nthen raise RuntimeError ❌]
H -->|Yes - idempotent| J[_roll_back_migration\n+ _resolve_specific_migration]
J --> K[continue loop\nRe-deploy]
K --> C
B -->|max_resolutions exceeded| L[raise RuntimeError\nmax iterations ❌]
C -->|CalledProcessError\nP3005 - non-empty DB| M[_create_baseline_migration]
M --> N[_mark_all_migrations_applied\nmarks ALL migrations applied\nwithout running SQL]
N --> O[return True ⚠️\nNew migrations not deployed]
style D fill:#90EE90
style I fill:#FFB6C1
style L fill:#FFB6C1
style O fill:#FFD700
Comments Outside Diff (1)
-
litellm-proxy-extras/litellm_proxy_extras/migrations/20260309115809_add_missing_indexes/migration.sqlDeleted migration file causes orphaned indexes and schema drift
This migration file created
LiteLLM_VerificationToken_key_alias_idxandLiteLLM_SpendLogs_user_startTime_idxviaCREATE INDEX CONCURRENTLY. Deleting the file without adding a correspondingDROP INDEXmigration in its place leaves two problems:-
Schema drift for existing deployments: Any database that already applied
20260309115809_add_missing_indexes(e.g. a staging or CI environment on thelitellm_internal_dev_03_12_2026branch) will have these indexes present in the database. Becauseschema.prismano longer declares@@index([key_alias])or@@index([user, startTime]), Prisma will permanently consider the DB to be drifted, andprisma migrate diff/prisma migrate statuswill flag it as out of sync. -
Silent performance regression for new deployments: New databases will never get these indexes. The comments in the deleted schema entries indicate they were specifically added to accelerate
ORDER BY key_aliasqueries onLiteLLM_VerificationTokenand time-bounded spend queries onLiteLLM_SpendLogs. Removing them without documentation will silently degrade those query paths.
The standard Prisma approach is to add a new migration that drops the indexes rather than delete the migration that created them:
-- DropIndex DROP INDEX CONCURRENTLY IF EXISTS "LiteLLM_VerificationToken_key_alias_idx"; -- DropIndex DROP INDEX CONCURRENTLY IF EXISTS "LiteLLM_SpendLogs_user_startTime_idx";
If no environment has ever run
20260309115809_add_missing_indexes(i.e. it was only ever on an internal dev branch), please add a comment confirming this so reviewers can verify the deletion is safe. -
Last reviewed commit: b4ee36c
| raise RuntimeError( | ||
| f"Migration {failed_migration} failed and requires manual intervention. " | ||
| f"Please inspect the migration and database state, then either:\n" | ||
| f" - Fix the issue and restart, or\n" | ||
| f" - Run: prisma migrate resolve --applied {failed_migration}\n" | ||
| f"Original error: {e.stderr}" | ||
| ) from e |
There was a problem hiding this comment.
Misleading remediation instruction in error message
The RuntimeError message suggests running prisma migrate resolve --applied {failed_migration} as an alternative to fixing and restarting. However, --applied marks the migration as applied in Prisma's tracking table without actually executing the migration SQL, which can leave the database schema in an inconsistent state if the migration contained important DDL changes.
A user following this instruction after a partial migration failure would silently skip the migration entirely. A clearer message would clarify that --applied should only be used when the changes were already applied to the database by other means (e.g., run manually):
raise RuntimeError(
f"Migration {failed_migration} failed and requires manual intervention. "
f"The migration has been marked as rolled-back. Please either:\n"
f" 1. Fix the migration SQL and restart to re-run it, or\n"
f" 2. Manually apply the changes to the DB, then run:\n"
f" prisma migrate resolve --applied {failed_migration}\n"
f"Original error: {e.stderr}"
) from e- P3009 with unparseable migration name now fails fast instead of silently falling through to the generic retry handler - Removed --applied suggestion from RuntimeError message since it skips running migration SQL and can leave schema inconsistent - Increased retry attempts from 4 to 5 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if ProxyExtrasDBManager._is_idempotent_error(e.stderr): | ||
| logger.info( | ||
| f"Migration {failed_migration} failed due to idempotent error (e.g., column already exists), resolving as applied" | ||
| ) | ||
| ProxyExtrasDBManager._roll_back_migration( | ||
| failed_migration | ||
| ) | ||
| ProxyExtrasDBManager._resolve_specific_migration( | ||
| failed_migration | ||
| ) | ||
| logger.info( | ||
| f"✅ Migration {failed_migration} resolved." | ||
| ) | ||
| return True |
There was a problem hiding this comment.
P3009 idempotent path returns True without re-running deploy
After resolving the idempotent P3009 failure the code immediately returns True, but there may be additional pending migrations after the now-resolved one that were never applied. prisma migrate deploy stops at the first failing migration, so any migrations B, C, … that followed the resolved migration A are still unapplied when the function returns successfully.
This stands in contrast to the P3005 path added in this same PR, which correctly re-runs prisma migrate deploy after _mark_all_migrations_applied to pick up remaining pending migrations. The inconsistency means an idempotent P3009 failure leaves the database in a partially-migrated state that is only corrected on the next startup (when deploy runs again cleanly).
Suggested fix — add a re-deploy after the rollback/resolve, mirroring the P3005 pattern:
ProxyExtrasDBManager._roll_back_migration(
failed_migration
)
ProxyExtrasDBManager._resolve_specific_migration(
failed_migration
)
logger.info(
f"✅ Migration {failed_migration} resolved."
)
# Re-run deploy to apply any remaining pending migrations
result = subprocess.run(
[_get_prisma_command(), "migrate", "deploy"],
timeout=60,
check=True,
capture_output=True,
text=True,
env=_get_prisma_env(),
)
logger.info(f"prisma migrate deploy stdout: {result.stdout}")
return True| def test_p3009_non_idempotent_raises_runtime_error( | ||
| self, mock_run, mock_dir, mock_getcwd, mock_chdir | ||
| ): | ||
| """P3009 with non-idempotent error should raise RuntimeError, not silently retry""" | ||
| error = subprocess.CalledProcessError( | ||
| 1, | ||
| "prisma", | ||
| stderr="P3009: migrate found failed migrations in the target database, `20250329084805_new_cron_job_table` migration. Error: syntax error at or near 'ALTR'", | ||
| output="", | ||
| ) | ||
| mock_run.side_effect = error | ||
|
|
||
| with pytest.raises(RuntimeError, match="requires manual intervention"): | ||
| ProxyExtrasDBManager.setup_database(use_migrate=True) | ||
|
|
There was a problem hiding this comment.
Test doesn't isolate rollback call from initial deploy call
mock_run.side_effect = error causes every subprocess.run call to raise the same P3009 CalledProcessError. This means _roll_back_migration (which calls subprocess.run internally) also raises, is silently caught by except Exception as rollback_error, and then RuntimeError is raised. The test passes, but it is actually testing the "rollback-also-fails" code path rather than the primary one.
A more representative test would let _roll_back_migration succeed (mock side-effects as an iterable — first call raises, second call returns normally) so the test exercises the path where rollback succeeds but RuntimeError is still raised:
mock_run.side_effect = [
error, # prisma migrate deploy → raises P3009
MagicMock(), # _roll_back_migration → succeeds
]With the current single side-effect this also makes it impossible to assert that the rollback was actually attempted (mock_run.call_count is unverified here).
- P3009 idempotent path now re-runs prisma migrate deploy after resolving the failed migration, so subsequent migrations are not left unapplied - _mark_all_migrations_applied raises RuntimeError on unexpected errors instead of silently swallowing them - P3009 non-idempotent test now exercises the rollback-succeeds path and verifies rollback was called Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous approach resolved one idempotent migration per outer retry attempt, exhausting all 5 retries when a DB had 4+ idempotent failures (e.g., tables/columns already created by the old force-apply logic). Now _deploy_with_idempotent_resolution loops internally: deploy → detect idempotent P3009/P3018 → rollback + resolve → re-deploy, until all idempotent migrations are resolved or a non-recoverable error is hit. This keeps outer retries available for transient errors (timeouts, etc). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| text=True, | ||
| env=_get_prisma_env(), | ||
| ) | ||
| logger.info(f"prisma migrate deploy stdout: {result.stdout}") | ||
| logger.info("✅ prisma migrate deploy completed") | ||
| return | ||
| except subprocess.CalledProcessError as e: | ||
| logger.info(f"prisma db error: {e.stderr}, e: {e.stdout}") | ||
| if "P3009" in e.stderr or "P3018" in e.stderr: | ||
| # Raises RuntimeError for non-recoverable errors, | ||
| # returns normally for resolved idempotent errors | ||
| ProxyExtrasDBManager._resolve_failed_migration(e) | ||
| logger.info("Re-deploying remaining migrations...") | ||
| continue | ||
| else: | ||
| raise | ||
|
|
||
| raise RuntimeError( | ||
| f"Exceeded maximum idempotent resolutions ({max_resolutions}). " | ||
| f"This likely indicates a deeper issue with migration state." |
There was a problem hiding this comment.
P3005 path: new pending migrations silently skipped
_mark_all_migrations_applied globs all migration folders under migrations_dir/migrations/*/migration.sql and marks each one as applied via prisma migrate resolve --applied. This includes any migration file that was shipped in the current version but has not yet been run against the database.
Concretely: if version N+1 ships a new migration (20250401_add_new_feature/migration.sql), the P3005 path will:
- Create the
0_initbaseline (correct – captures current DB state) - Call
_mark_all_migrations_applied, which marks20250401_add_new_featureas applied without executing its SQL - Call
prisma migrate deploy– this is now a no-op because every migration is already recorded as applied
The net result is that the new migration's DDL changes are never applied to the database, and the function returns True, silently reporting success.
The old _resolve_all_migrations avoided this by generating and executing a diff from the current DB to schema.prisma, so new schema changes were actually materialised even when all migrations were then marked as applied. The new approach drops that safety net for the P3005 path.
A minimal fix is to only mark migrations that are truly already reflected in the DB — i.e., those whose SQL was captured in the 0_init baseline — and let prisma migrate deploy run the genuinely pending ones normally:
# After _create_baseline_migration, only mark 0_init as applied
# (it already does this internally). Then let prisma migrate deploy
# apply any truly pending migrations.
logger.info(
"Baseline migration created; running prisma migrate deploy for pending migrations..."
)
result = subprocess.run(
[_get_prisma_command(), "migrate", "deploy"],
timeout=60,
check=True,
capture_output=True,
text=True,
env=_get_prisma_env(),
)_create_baseline_migration already marks 0_init as applied, so Prisma will skip it and only run the truly pending migrations on the subsequent deploy. _mark_all_migrations_applied would no longer be needed for this path.
Revert "fix: add missing indexes for top CPU-consuming queries (#23147)"
| ) from e | ||
| migration_name = migration_match.group(1) | ||
| else: | ||
| raise # Not a P3009/P3018 — let outer handler deal with it |
There was a problem hiding this comment.
Bare raise is dead code and relies on fragile implicit exception context
This raise statement appears in the else branch meaning neither "P3009" nor "P3018" appears in stderr. However, _resolve_failed_migration is only ever called from _deploy_with_idempotent_resolution after an explicit check if "P3009" in e.stderr or "P3018" in e.stderr:, so this else branch is unreachable dead code.
Additionally, a bare raise inside a @staticmethod called from an except block works only because Python implicitly preserves the active exception in sys.exc_info() across function call frames during exception handling. If _resolve_failed_migration is ever called from a non-exception context in the future (e.g. a new call site), this line will instead raise RuntimeError: No active exception to re-raise, obscuring the original problem.
Since the method already receives e as a parameter, it should use raise e to make intent explicit and avoid the fragile implicit dependency:
| raise # Not a P3009/P3018 — let outer handler deal with it | |
| raise e # Not a P3009/P3018 — let outer handler deal with it |
- Fix bare `raise` → `raise e` in unreachable else branch of _resolve_failed_migration for correctness and clarity - Remove _mark_all_migrations_applied from P3005 path — it was marking ALL migrations (including new ones) as applied, causing new migration SQL to never run. Now baseline creation is followed directly by _deploy_with_idempotent_resolution which actually runs pending migrations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| @@ -456,86 +450,14 @@ def setup_database(use_migrate: bool = False) -> bool: | |||
| ) | |||
| ProxyExtrasDBManager._create_baseline_migration(schema_path) | |||
| logger.info( | |||
| "Baseline migration created, resolving all migrations" | |||
| ) | |||
| ProxyExtrasDBManager._resolve_all_migrations( | |||
| migrations_dir, schema_path | |||
| "Baseline migration created, running deploy with idempotent resolution " | |||
| "to apply remaining migrations..." | |||
| ) | |||
| logger.info("✅ All migrations resolved.") | |||
| ProxyExtrasDBManager._deploy_with_idempotent_resolution() | |||
| logger.info("✅ All migrations applied.") | |||
| return True | |||
There was a problem hiding this comment.
P3005 path silently skips new migrations on non-empty databases
In the P3005 path, _create_baseline_migration creates and marks 0_init as applied (representing the current DB state), then immediately calls _deploy_with_idempotent_resolution(). Any subsequent migration that performs a DROP COLUMN, DROP INDEX, or DROP TABLE on an object that doesn't yet exist in this particular DB will fail with a "does not exist" error. The _is_idempotent_error pattern list includes r"does not exist", so this failure is treated as idempotent and the migration is silently marked as applied without its SQL ever running.
Concrete scenario: a database was set up via db push (not migrate), so it lacks the columns being dropped by 20260311180521_schema_sync. The P3005 path will silently skip that migration, which is fine here — but if a future migration adds something that depends on a prior migration's ALTER TABLE ... ADD COLUMN having run, the silent skip will leave the DB in a broken state.
The PR description states "P3005 now uses _mark_all_migrations_applied + prisma migrate deploy", but the actual code uses _create_baseline_migration + _deploy_with_idempotent_resolution without calling _mark_all_migrations_applied at all. The two approaches differ: the current code attempts to re-execute each migration (relying on idempotent resolution to skip already-applied ones), whereas _mark_all_migrations_applied would skip them unconditionally. Consider whether the stated intent matches the implementation, and whether a conditional _mark_all_migrations_applied call before _deploy_with_idempotent_resolution would be safer for the P3005 case.
| -- DropIndex | ||
| DROP INDEX "LiteLLM_MCPServerTable_approval_status_idx"; | ||
|
|
||
| -- AlterTable | ||
| ALTER TABLE "LiteLLM_MCPServerTable" DROP COLUMN "approval_status", | ||
| DROP COLUMN "review_notes", | ||
| DROP COLUMN "reviewed_at", | ||
| DROP COLUMN "source_url", | ||
| DROP COLUMN "submitted_at", | ||
| DROP COLUMN "submitted_by"; | ||
|
|
There was a problem hiding this comment.
Destructive column removal is not safe for rolling deployments
This migration permanently drops approval_status, review_notes, reviewed_at, source_url, submitted_at, and submitted_by from LiteLLM_MCPServerTable. During a rolling deployment, pods still running the previous version will attempt to read or write these columns and receive a column does not exist database error the moment the new pod applies this migration.
The PR is specifically addressing rolling-deployment safety, so this bundled schema change works against that goal. The standard zero-downtime pattern is: (1) deploy new code that no longer references the columns, then (2) in a subsequent deploy, run the DROP COLUMN migration once all pods have moved to the new code. If these columns are already unreferenced by application code in the current branch, please confirm that before merging, and consider documenting that a brief read-path error window is expected.
| @@ -382,72 +437,11 @@ def setup_database(use_migrate: bool = False) -> bool: | |||
| if use_migrate: | |||
| logger.info("Running prisma migrate deploy") | |||
| try: | |||
| # Set migrations directory for Prisma | |||
| result = subprocess.run( | |||
| [_get_prisma_command(), "migrate", "deploy"], | |||
| timeout=60, | |||
| check=True, | |||
| capture_output=True, | |||
| text=True, | |||
| env=_get_prisma_env(), | |||
| ) | |||
| logger.info(f"prisma migrate deploy stdout: {result.stdout}") | |||
|
|
|||
| logger.info("prisma migrate deploy completed") | |||
|
|
|||
| # Run sanity check to ensure DB matches schema | |||
| logger.info("Running post-migration sanity check...") | |||
| ProxyExtrasDBManager._resolve_all_migrations( | |||
| migrations_dir, schema_path, mark_all_applied=False | |||
| ) | |||
| logger.info("✅ Post-migration sanity check completed") | |||
| ProxyExtrasDBManager._deploy_with_idempotent_resolution() | |||
| return True | |||
| except subprocess.CalledProcessError as e: | |||
| logger.info(f"prisma db error: {e.stderr}, e: {e.stdout}") | |||
| if "P3009" in e.stderr: | |||
| # Extract the failed migration name from the error message | |||
| migration_match = re.search( | |||
| r"`(\d+_.*)` migration", e.stderr | |||
| ) | |||
| if migration_match: | |||
| failed_migration = migration_match.group(1) | |||
| if ProxyExtrasDBManager._is_idempotent_error(e.stderr): | |||
| logger.info( | |||
| f"Migration {failed_migration} failed due to idempotent error (e.g., column already exists), resolving as applied" | |||
| ) | |||
| ProxyExtrasDBManager._roll_back_migration( | |||
| failed_migration | |||
| ) | |||
| ProxyExtrasDBManager._resolve_specific_migration( | |||
| failed_migration | |||
| ) | |||
| logger.info( | |||
| f"✅ Migration {failed_migration} resolved." | |||
| ) | |||
| return True | |||
| else: | |||
| logger.info( | |||
| f"Found failed migration: {failed_migration}, marking as rolled back" | |||
| ) | |||
| # Mark the failed migration as rolled back | |||
| subprocess.run( | |||
| [ | |||
| _get_prisma_command(), | |||
| "migrate", | |||
| "resolve", | |||
| "--rolled-back", | |||
| failed_migration, | |||
| ], | |||
| timeout=60, | |||
| check=True, | |||
| capture_output=True, | |||
| text=True, | |||
| env=_get_prisma_env(), | |||
| ) | |||
| logger.info( | |||
| f"✅ Migration {failed_migration} marked as rolled back... retrying" | |||
| ) | |||
| elif ( | |||
| if ( | |||
| "P3005" in e.stderr | |||
| and "database schema is not empty" in e.stderr | |||
| ): | |||
| @@ -456,86 +450,14 @@ def setup_database(use_migrate: bool = False) -> bool: | |||
| ) | |||
| ProxyExtrasDBManager._create_baseline_migration(schema_path) | |||
| logger.info( | |||
| "Baseline migration created, resolving all migrations" | |||
| ) | |||
| ProxyExtrasDBManager._resolve_all_migrations( | |||
| migrations_dir, schema_path | |||
| "Baseline migration created, running deploy with idempotent resolution " | |||
| "to apply remaining migrations..." | |||
| ) | |||
| logger.info("✅ All migrations resolved.") | |||
| ProxyExtrasDBManager._deploy_with_idempotent_resolution() | |||
| logger.info("✅ All migrations applied.") | |||
| return True | |||
| elif "P3018" in e.stderr: | |||
| # Check if this is a permission error or idempotent error | |||
| if ProxyExtrasDBManager._is_permission_error(e.stderr): | |||
| # Permission errors should NOT be marked as applied | |||
| # Extract migration name for logging | |||
| migration_match = re.search( | |||
| r"Migration name: (\d+_.*)", e.stderr | |||
| ) | |||
| migration_name = ( | |||
| migration_match.group(1) | |||
| if migration_match | |||
| else "unknown" | |||
| ) | |||
|
|
|||
| logger.error( | |||
| f"❌ Migration {migration_name} failed due to insufficient permissions. " | |||
| f"Please check database user privileges. Error: {e.stderr}" | |||
| ) | |||
|
|
|||
| # Mark as rolled back and exit with error | |||
| if migration_match: | |||
| try: | |||
| ProxyExtrasDBManager._roll_back_migration( | |||
| migration_name | |||
| ) | |||
| logger.info( | |||
| f"Migration {migration_name} marked as rolled back" | |||
| ) | |||
| except Exception as rollback_error: | |||
| logger.warning( | |||
| f"Failed to mark migration as rolled back: {rollback_error}" | |||
| ) | |||
|
|
|||
| # Re-raise the error to prevent silent failures | |||
| raise RuntimeError( | |||
| f"Migration failed due to permission error. Migration {migration_name} " | |||
| f"was NOT applied. Please grant necessary database permissions and retry." | |||
| ) from e | |||
|
|
|||
| elif ProxyExtrasDBManager._is_idempotent_error(e.stderr): | |||
| # Idempotent errors mean the migration has effectively been applied | |||
| logger.info( | |||
| "Migration failed due to idempotent error (e.g., column already exists), " | |||
| "resolving as applied" | |||
| ) | |||
| # Extract the migration name from the error message | |||
| migration_match = re.search( | |||
| r"Migration name: (\d+_.*)", e.stderr | |||
| ) | |||
| if migration_match: | |||
| migration_name = migration_match.group(1) | |||
| logger.info( | |||
| f"Rolling back migration {migration_name}" | |||
| ) | |||
| ProxyExtrasDBManager._roll_back_migration( | |||
| migration_name | |||
| ) | |||
| logger.info( | |||
| f"Resolving migration {migration_name} that failed " | |||
| f"due to existing schema objects" | |||
| ) | |||
| ProxyExtrasDBManager._resolve_specific_migration( | |||
| migration_name | |||
| ) | |||
| logger.info("✅ Migration resolved.") | |||
| else: | |||
| # Unknown P3018 error - log and re-raise for safety | |||
| logger.warning( | |||
| f"P3018 error encountered but could not classify " | |||
| f"as permission or idempotent error. " | |||
| f"Error: {e.stderr}" | |||
| ) | |||
| raise | |||
| else: | |||
| raise | |||
There was a problem hiding this comment.
RuntimeError from _deploy_with_idempotent_resolution bypasses all retries and propagates uncaught
_deploy_with_idempotent_resolution raises RuntimeError for non-idempotent failures (via _resolve_failed_migration). In setup_database, the outer retry loop only catches subprocess.CalledProcessError and subprocess.TimeoutExpired:
for attempt in range(5):
try:
...
ProxyExtrasDBManager._deploy_with_idempotent_resolution() # can raise RuntimeError
...
except subprocess.TimeoutExpired:
...
except subprocess.CalledProcessError as e:
... # retried
# RuntimeError is NOT caught → escapes setup_database entirelyA RuntimeError from a non-idempotent migration failure escapes setup_database without being caught, which is intentional for fail-fast semantics. However, call sites of setup_database that previously only expected a bool return value (or a CalledProcessError) will now receive an uncaught RuntimeError. The callers should be updated to handle this explicitly, or at minimum the function docstring should document that RuntimeError can be raised.
"duplicate key value violates" on CREATE UNIQUE INDEX means the index can't be created due to duplicate data — this is NOT idempotent. Marking it as applied silently skips index creation, leaving the DB without the unique constraint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| @@ -354,10 +273,145 @@ def _resolve_all_migrations( | |||
| ) | |||
| logger.debug(f"Resolved migration: {migration_name}") | |||
| except subprocess.CalledProcessError as e: | |||
| if "is already recorded as applied in the database." not in e.stderr: | |||
| logger.warning( | |||
| f"Failed to resolve migration {migration_name}: {e.stderr}" | |||
| if "is already recorded as applied in the database." in e.stderr: | |||
| logger.debug( | |||
| f"Migration {migration_name} already recorded as applied" | |||
| ) | |||
| else: | |||
| raise RuntimeError( | |||
| f"Failed to mark migration {migration_name} as applied: {e.stderr}" | |||
| ) from e | |||
There was a problem hiding this comment.
_mark_all_migrations_applied is dead production code
This method is defined, documented, and covered by four new tests, but it is never called in any production code path. The PR description states "P3005 (non-empty DB, first migration) now uses _mark_all_migrations_applied + prisma migrate deploy", but the actual implementation uses _create_baseline_migration (which marks only 0_init as applied) followed by _deploy_with_idempotent_resolution.
A quick grep confirms no call site exists outside of tests:
grep -n "_mark_all_migrations_applied" utils.py
# 245: def _mark_all_migrations_applied(migrations_dir: str): ← definition only
The intended semantics — mark all migrations as already reflected in the DB — are important to document correctly. If this method is meant as a utility for future use, a clear # NOTE: not currently called in production; see _create_baseline_migration for the P3005 path comment would help. If it is genuinely needed in the P3005 flow, it should be wired in; otherwise, the tests are covering a code path that never executes in production.
P3005 fires only during the db-push-to-migrate transition, when the _prisma_migrations table doesn't exist yet. Since db push already kept the schema in sync, all shipped migrations are already reflected in the DB. Mark them all as applied (standard Prisma baselining) instead of running deploy-with-resolution, which would fail on pre-existing data issues like duplicate rows preventing unique index creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| return_value=["20250326162113_baseline"], | ||
| ) | ||
| def test_skips_already_applied_migration(self, mock_get_names, mock_run): | ||
| """Verify already-applied migrations are silently skipped""" | ||
| mock_run.side_effect = subprocess.CalledProcessError( | ||
| 1, | ||
| "prisma", | ||
| stderr="Migration `20250326162113_baseline` is already recorded as applied in the database.", | ||
| ) |
There was a problem hiding this comment.
Tests assert exact Prisma command without mocking _get_prisma_command()
Multiple tests assert against a hardcoded ["prisma", "migrate", "deploy"] array:
assert cmd == ["prisma", "migrate", "deploy"]_get_prisma_command() is environment-sensitive — when PRISMA_OFFLINE_MODE=true is set (a valid production and potentially CI configuration), it returns a full binary path like /app/.cache/prisma-python/binaries/node_modules/.bin/prisma or a PRISMA_CLI_PATH override instead of "prisma". These assertions would then fail despite the production code working correctly.
The same pattern appears in test_p3009_non_idempotent_raises_runtime_error (line 222), test_p3009_idempotent_redeploys (line 341), and test_successful_deploy (line 343). Patching _get_prisma_command to return a fixed value in each test class would make the assertions environment-agnostic:
@patch("litellm_proxy_extras.utils._get_prisma_command", return_value="prisma")
Relevant issues
Reported by a user who experienced schema thrashing during rolling deployments due to the migration system's diff-and-apply mechanism.
Summary
Failure Path (Before Fix)
_resolve_all_migrationswas called after every successfulprisma migrate deployas a "sanity check." This method:schema.prismaprisma db execute, bypassing Prisma's migration trackingDuring rolling deployments, if version A and version B had different
schema.prismafiles, each startup would force-push its own schema over the other's, causing a thrashing loop. Additionally, non-idempotent P3009 migration failures were silently retried instead of failing fast.Fix
_resolve_all_migrationsand the post-deploy "sanity check" call — ifprisma migrate deploysucceeds, the DB is already correct_mark_all_migrations_appliedwhich only updates the_prisma_migrationstracking table without generating or applying diffs_mark_all_migrations_applied+prisma migrate deployinstead of the diff mechanismRuntimeErrorwith actionable instructions instead of silently retryingTesting
tests/litellm-proxy-extras/test_litellm_proxy_extras_utils.py)TestMarkAllMigrationsApplied: verifies the new method marks migrations without generating/applying diffs, and skips already-applied migrationsTestSetupDatabaseFailFast: verifies P3009 non-idempotent errors raiseRuntimeError, and successful deploys don't call diff/resolve logic_resolve_all_migrationsor depend on the removed behaviorType
🐛 Bug Fix
✅ Test