-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix starvation with async server and interleaving optimization #13
Conversation
JoeZijunZhou
commented
Mar 15, 2024
•
edited
Loading
edited
- Make server async to handle high throughput from client (the number of requests a sync server can handle is limited by server side thread number, since each thread handles a request in blocking manner).
- Add AsyncMultiFuture data structure for the retuern_channel to stream response tokens asynchronously (threadsafe ensures no token drop in async server)
- Using blocking fixed size queue to block and yield threads efficiently
- blocking for get and put operations for generate and detokenize queue
- blocking for prefill queue get operation
- unblocking for prefill queue put operation (unbound queue size to enqueue all requests from client)
- Optimized interleaving prefill, insert, and generate
- When decode slots are empty, yield to prefill, then insert the prefill results as many as possible to the decode slots
- When decode slots are full, block insert and prefill, keep decoding and detokenizing until some deocde completes and returns available slot.
- generate queue size set to 3/1 decode batch size
- get decode slots saturated fast (larger generate queue size ensures it has enough requests to insert to slots)
- avoid OOM when both decode slots and generate queue are both saturated (can't set generate queue size too large, otherwise too many prefill results enqueued in generate queue)
This CR cannot merge. It is driving contention (probably GIL) causing stalls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot merge these changes because they drive a duty cycle migration.
@@ -0,0 +1,89 @@ | |||
import asyncio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this module used? I can't find any references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a xprof with the new changes so we can compare against the GIL-contention xprof that Rafi provided and rule-out any contention issues? |
Added a xprof in https://docs.google.com/document/d/1PWje3lup0ZHjtgbcWQbK8r1nNwgKE2rHBdVLbWU2aas/edit?tab=t.0#heading=h.f39bxdhl45e2 (Since here is public, just shared in the internal doc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for stacktrace, fix looks good. I had a quick question about detokenize queue
# We don't let detokenization accumulate more than 8 steps to avoid | ||
# synchronization issues. | ||
queue.Queue(8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you comeup with 8? Is that hardware dependent -- will this work for all configs (v5e-8, v5e-1 etc?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JET has an investigation on GIL contention issue and detokenize queue size, and figured out that this size can avoid synchronization issue. It's not dependent to hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Issue resolved and code reviewed.