Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix starvation with async server and interleaving optimization #13

Merged
merged 9 commits into from
Mar 27, 2024

Conversation

JoeZijunZhou
Copy link
Collaborator

@JoeZijunZhou JoeZijunZhou commented Mar 15, 2024

  • Make server async to handle high throughput from client (the number of requests a sync server can handle is limited by server side thread number, since each thread handles a request in blocking manner).
  • Add AsyncMultiFuture data structure for the retuern_channel to stream response tokens asynchronously (threadsafe ensures no token drop in async server)
  • Using blocking fixed size queue to block and yield threads efficiently
    • blocking for get and put operations for generate and detokenize queue
    • blocking for prefill queue get operation
    • unblocking for prefill queue put operation (unbound queue size to enqueue all requests from client)
  • Optimized interleaving prefill, insert, and generate
    • When decode slots are empty, yield to prefill, then insert the prefill results as many as possible to the decode slots
    • When decode slots are full, block insert and prefill, keep decoding and detokenizing until some deocde completes and returns available slot.
    • generate queue size set to 3/1 decode batch size
      • get decode slots saturated fast (larger generate queue size ensures it has enough requests to insert to slots)
      • avoid OOM when both decode slots and generate queue are both saturated (can't set generate queue size too large, otherwise too many prefill results enqueued in generate queue)

@rwitten
Copy link
Collaborator

rwitten commented Mar 17, 2024

This CR cannot merge. It is driving contention (probably GIL) causing stalls.
https://xprof.corp.google.com/trace_viewer/rwitten-7572422995686954905?host_index=0

@rwitten rwitten self-requested a review March 17, 2024 18:09
Copy link
Collaborator

@rwitten rwitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot merge these changes because they drive a duty cycle migration.

@JoeZijunZhou JoeZijunZhou changed the title Fix starvation with async server Fix starvation with async server and interleaving optimization Mar 25, 2024
@@ -0,0 +1,89 @@
import asyncio
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this module used? I can't find any references.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vipannalla
Copy link
Collaborator

Can you please add a xprof with the new changes so we can compare against the GIL-contention xprof that Rafi provided and rule-out any contention issues?

@JoeZijunZhou
Copy link
Collaborator Author

Can you please add a xprof with the new changes so we can compare against the GIL-contention xprof that Rafi provided and rule-out any contention issues?

Added a xprof in https://docs.google.com/document/d/1PWje3lup0ZHjtgbcWQbK8r1nNwgKE2rHBdVLbWU2aas/edit?tab=t.0#heading=h.f39bxdhl45e2 (Since here is public, just shared in the internal doc).

Copy link
Collaborator

@vipannalla vipannalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for stacktrace, fix looks good. I had a quick question about detokenize queue

Comment on lines +264 to +266
# We don't let detokenization accumulate more than 8 steps to avoid
# synchronization issues.
queue.Queue(8)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you comeup with 8? Is that hardware dependent -- will this work for all configs (v5e-8, v5e-1 etc?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JET has an investigation on GIL contention issue and detokenize queue size, and figured out that this size can avoid synchronization issue. It's not dependent to hardware.

Copy link
Collaborator

@vipannalla vipannalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JoeZijunZhou JoeZijunZhou dismissed rwitten’s stale review March 27, 2024 00:02

Issue resolved and code reviewed.

@JoeZijunZhou JoeZijunZhou merged commit 64b1e50 into main Mar 27, 2024
3 checks passed
@JoeZijunZhou JoeZijunZhou deleted the zijun/fix-starvation branch March 27, 2024 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants