fix: minimize the occurrence of deadlocks#15281
Conversation
|
@CAFxX is attempting to deploy a commit to the CLERKIEAI Team on Vercel. A member of the Team first needs to authorize it. |
1771e42 to
f735257
Compare
|
17f1257 to
c7fd3f4
Compare
c7fd3f4 to
83674d9
Compare
83674d9 to
8e32bac
Compare
|
I think this is ready to be looked at. CONTRIBUTING.md only mentions I should wait for review without doing anything else, but I am unsure if that is up to date (this is my first contribution here). |
|
@TeddyAmkie this is the PR I was talking about on Slack. It got a bit stale in the meantime, will rebase ASAP. |
Thanks! Bumping to team |
There was a problem hiding this comment.
This PR is being reviewed by Cursor Bugbot
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| ) | ||
|
|
||
| # Verify that table.upsert was called | ||
| mock_table.upsert.assert_has_calls(upsert_calls) |
There was a problem hiding this comment.
Bug: Test Fails Due to Lexicographic User ID Sorting
The test_update_daily_spend_sorting test expects upsert calls to be made in a specific order (user11 through user60), but the sorting logic in the code sorts by user_id lexicographically as strings, not numerically. Lexicographic sorting of "user11", "user12", ..., "user19", "user2", "user20", etc. will produce: user11, user12, ..., user19, user2, user20, ..., user29, user3, user30, ..., user60. This does not match the expected order in the test. The test should either pass any_order=True parameter to assert_has_calls() to ignore order, or the expected calls list should be corrected to match the actual lexicographic sort order.
There was a problem hiding this comment.
This is not true; starting from 11 and going to 60 obviously means that the lexicographic order is the same as the numeric order. And passing any_order=True in assert_has_calls would completely defy the very purpose of the test.
Also FWIW the test passes.
Title
Attempt to avoid deadlocks during upserts in daily spend tables.
In our deployments we are observing repeated deadlocks terminating in exceptions under load. While the documentation mentions adding redis to accumulate writes and - primarily - avoid concurrent write transactions, with the appropriate care the database should be able to deal with this workload even without redis and the write coordination (that FWIW would not require redis to begin with, because RDBMSes normally provide some mechanism to emulate a semaphore/mutex1).
Relevant issues
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unitType
🐛 Bug Fix
🧹 Refactoring
✅ Test
Changes
The main change is providing a consistent ordering across transactions for the rows being updated, that is the textbook solution to deadlocks. The actual ordering does not matter too much (it matters for data locality, but that depends on other factors as well), what matters is making sure that all concurrent writes use the same sort order.
A smaller secondary change is adding a bit of randomness to the wait durations before retries. When two or more transactions deadlock, the database aborts all of the deadlocked transactions at the same time. If the delay before retry is the same for all involved transactions, as it can be currently, all involved transactions will be retried roughly at the same time. This is more likely, especially under load, to lead to more deadlocks. Instead with this change we wait a random time (still with exponential backoff), and then retry.
Other formatting changes to the files being modified are added by
make format.One test is added to check that the queued rows are sorted before being sent to the database:
https://github.com/BerriAI/litellm/actions/runs/18333952249/job/52214304926?pr=15281#step:8:6926
Some other tests are failing but they seem unrelated.
Footnotes
e.g.
GET_LOCKin MySQL andpg_advisory_lockin PostgreSQL ↩