-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BOUNTY - $500] Pipeline Parallel Inference #4
Comments
I'll like to work on this |
That would be excellent! I can help here and on Discord with any questions / issues you have. |
Hi there, I was taking a look at what it would take to make this work and did some testing, found out that when you start two chat sessions and run inference at the same time they mess each other up and tokens from the two sessions bleed into each other. See the two last messages: The one on the left hangs after a while, the right one finishes but is also gibberish. Does this reproduce on your end? I think fixing session isolation might precede parallel pipelining? |
@the-alex-b Very interesting - you're totally right, we should fix session isolation first. This makes sense since both would share the same kv caches (it's stateful). This can still be part of the same bounty. |
Hi @AlexCheema, |
Hey @pranav4501 Can you also DM me on discord so we can find a good task for you. I can update bounties with something that you'd be interested to work on, as there aren't that many left now! |
Hi @AlexCheema, |
Hello, can we update the GSheet to denote this is taken (if it is, which it seems to be)? cc @AlexCheema [apologies for the pings] |
That pytorch page is giving me a 404. Is the idea here to be able to process multiple separate requests at once or to have a batch api that accepts multiple requests in one api call? |
Hey! First off, I love this project, props for the great work! And I love your mission! I think I did implement this for my MS thesis - you can find it here. The whole point of it was to show how maximizing GPU utilization by introducing pipeline parallelism at inference time leads to the ability to serve multiple requests efficiently. The code I wrote is definitely not production ready (more like tomato sauce-ready considering the amount of spaghetti code), and I gotta say I didn't dive deep into the Exo codebase yet, but if someone is working on it, maybe I could give some help or support (or just another pair of eyes in debugging). Let me know if this sounds good, I'd be super glad to be part of this! |
Your thesis is interesting. We're working on this issue for exo v2. |
Prerequisite: #1
Motivation: exo should use device resources as efficiently as possible. Current implementation underutilises available resources.
What: See https://pytorch.org/docs/stable/pipeline.html
Reward: $500 Bounty paid out with USDC on Ethereum, email [email protected].
The text was updated successfully, but these errors were encountered: