Skip to content

fix: minimize the occurrence of deadlocks#15281

Merged
krrishdholakia merged 2 commits intoBerriAI:mainfrom
CAFxX:cafxx-avoid-deadlock
Oct 24, 2025
Merged

fix: minimize the occurrence of deadlocks#15281
krrishdholakia merged 2 commits intoBerriAI:mainfrom
CAFxX:cafxx-avoid-deadlock

Conversation

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Oct 7, 2025

Title

Attempt to avoid deadlocks during upserts in daily spend tables.

In our deployments we are observing repeated deadlocks terminating in exceptions under load. While the documentation mentions adding redis to accumulate writes and - primarily - avoid concurrent write transactions, with the appropriate care the database should be able to deal with this workload even without redis and the write coordination (that FWIW would not require redis to begin with, because RDBMSes normally provide some mechanism to emulate a semaphore/mutex1).

Relevant issues

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🐛 Bug Fix
🧹 Refactoring
✅ Test

Changes

The main change is providing a consistent ordering across transactions for the rows being updated, that is the textbook solution to deadlocks. The actual ordering does not matter too much (it matters for data locality, but that depends on other factors as well), what matters is making sure that all concurrent writes use the same sort order.

A smaller secondary change is adding a bit of randomness to the wait durations before retries. When two or more transactions deadlock, the database aborts all of the deadlocked transactions at the same time. If the delay before retry is the same for all involved transactions, as it can be currently, all involved transactions will be retried roughly at the same time. This is more likely, especially under load, to lead to more deadlocks. Instead with this change we wait a random time (still with exponential backoff), and then retry.

Other formatting changes to the files being modified are added by make format.

One test is added to check that the queued rows are sorted before being sent to the database:

tests/test_litellm/proxy/db/test_db_spend_update_writer.py::test_update_daily_spend_sorting 
[gw0] [ 91%] PASSED tests/test_litellm/proxy/db/test_db_spend_update_writer.py::test_update_daily_spend_sorting 

https://github.com/BerriAI/litellm/actions/runs/18333952249/job/52214304926?pr=15281#step:8:6926

Some other tests are failing but they seem unrelated.

Footnotes

  1. e.g. GET_LOCK in MySQL and pg_advisory_lock in PostgreSQL

@vercel
Copy link

vercel bot commented Oct 7, 2025

@CAFxX is attempting to deploy a commit to the CLERKIEAI Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLAassistant commented Oct 7, 2025

CLA assistant check
All committers have signed the CLA.

@CAFxX CAFxX force-pushed the cafxx-avoid-deadlock branch 2 times, most recently from 1771e42 to f735257 Compare October 7, 2025 15:31
@JehandadK
Copy link
Contributor

Alert type: db_exceptions
Level: High
Timestamp: 07:30:38
Message: DB read/write call failed: Error occurred during query execution:
ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(PostgresError { code: "40P01", message: "deadlock detected", severity: "ERROR", detail: Some("Process 1837691 waits for ShareLock on transaction 2775259; blocked by process 1837653.\nProcess 1837653 waits for ShareLock on transaction 2775258; blocked by process 1837691."), column: None, hint: Some("See server log for query details.") }), transient: false })[Non-Blocking]LiteLLM Prisma Client Exception - update spend logs: Error occurred during query execution:
ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(PostgresError { code: "40P01", message: "deadlock detected", severity: "ERROR", detail: Some("Process 1837691 waits for ShareLock on transaction 2775259; blocked by process 1837653.\nProcess 1837653 waits for ShareLock on transaction 2775258; blocked by process 1837691."), column: None, hint: Some("See server log for query details.") }), transient: false })
Traceback (most recent call last):
  File "/usr/lib/python3.13/site-packages/litellm/proxy/db/db_spend_update_writer.py", line 862, in _update_daily_spend
    async with prisma_client.db.batch_() as batcher:
               ~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/site-packages/prisma/client.py", line 844, in __aexit__
    await self.commit()
  File "/usr/lib/python3.13/site-packages/prisma/client.py", line 820, in commit
    await self.__clien
    
    ```
    Error log for deadlocks. 

@CAFxX CAFxX force-pushed the cafxx-avoid-deadlock branch 5 times, most recently from 17f1257 to c7fd3f4 Compare October 8, 2025 04:37
@CAFxX CAFxX marked this pull request as ready for review October 8, 2025 05:32
@CAFxX CAFxX force-pushed the cafxx-avoid-deadlock branch from c7fd3f4 to 83674d9 Compare October 8, 2025 05:41
@CAFxX CAFxX force-pushed the cafxx-avoid-deadlock branch from 83674d9 to 8e32bac Compare October 8, 2025 05:45
@CAFxX CAFxX changed the title attempt to avoid/minimize deadlocks fix: attempt to minimize the occurrence of deadlocks Oct 9, 2025
@CAFxX CAFxX changed the title fix: attempt to minimize the occurrence of deadlocks fix: minimize the occurrence of deadlocks Oct 9, 2025
@CAFxX
Copy link
Contributor Author

CAFxX commented Oct 9, 2025

I think this is ready to be looked at. CONTRIBUTING.md only mentions I should wait for review without doing anything else, but I am unsure if that is up to date (this is my first contribution here).

@CAFxX
Copy link
Contributor Author

CAFxX commented Oct 16, 2025

@TeddyAmkie this is the PR I was talking about on Slack. It got a bit stale in the meantime, will rebase ASAP.

@berri-teddy
Copy link
Contributor

@TeddyAmkie this is the PR I was talking about on Slack. It got a bit stale in the meantime, will rebase ASAP.

Thanks! Bumping to team

@krrishdholakia krrishdholakia merged commit 8b14241 into BerriAI:main Oct 24, 2025
4 of 7 checks passed
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

)

# Verify that table.upsert was called
mock_table.upsert.assert_has_calls(upsert_calls)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test Fails Due to Lexicographic User ID Sorting

The test_update_daily_spend_sorting test expects upsert calls to be made in a specific order (user11 through user60), but the sorting logic in the code sorts by user_id lexicographically as strings, not numerically. Lexicographic sorting of "user11", "user12", ..., "user19", "user2", "user20", etc. will produce: user11, user12, ..., user19, user2, user20, ..., user29, user3, user30, ..., user60. This does not match the expected order in the test. The test should either pass any_order=True parameter to assert_has_calls() to ignore order, or the expected calls list should be corrected to match the actual lexicographic sort order.

Fix in Cursor Fix in Web

Copy link
Contributor Author

@CAFxX CAFxX Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true; starting from 11 and going to 60 obviously means that the lexicographic order is the same as the numeric order. And passing any_order=True in assert_has_calls would completely defy the very purpose of the test.
Also FWIW the test passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants