[BOUNTY - $500] Pipeline Parallel Inference #4

AlexCheema · 2024-07-15T07:03:39Z

Prerequisite: #1

Motivation: exo should use device resources as efficiently as possible. Current implementation underutilises available resources.

What: See https://pytorch.org/docs/stable/pipeline.html

Reward: $500 Bounty paid out with USDC on Ethereum, email [email protected].

Myestery · 2024-07-15T07:10:20Z

I'll like to work on this

AlexCheema · 2024-07-18T07:30:16Z

I'll like to work on this

That would be excellent! I can help here and on Discord with any questions / issues you have.

the-alex-b · 2024-07-18T09:24:37Z

Hi there,

I was taking a look at what it would take to make this work and did some testing, found out that when you start two chat sessions and run inference at the same time they mess each other up and tokens from the two sessions bleed into each other. See the two last messages:

The one on the left hangs after a while, the right one finishes but is also gibberish. Does this reproduce on your end? I think fixing session isolation might precede parallel pipelining?

AlexCheema · 2024-07-18T18:51:19Z

@the-alex-b Very interesting - you're totally right, we should fix session isolation first. This makes sense since both would share the same kv caches (it's stateful).
What we really need is the ability to create multiple instances of the same model that only hold the weights in memory once.

This can still be part of the same bounty.

pranav4501 · 2024-08-20T16:45:04Z

Hi @AlexCheema,
Can I work on session isolation?

AlexCheema · 2024-08-20T17:10:42Z

Hi @AlexCheema, Can I work on session isolation?

Hey @pranav4501
I think @varshith15 is already working on that so best to check with him if you can contribute.

Can you also DM me on discord so we can find a good task for you. I can update bounties with something that you'd be interested to work on, as there aren't that many left now!

pranav4501 · 2024-08-20T17:45:58Z

Hi @AlexCheema,
I DM'ed you on discord, I will also take a look at the stable diffusion bounty

merge fork

moosh3 · 2024-11-10T01:11:06Z

Hello, can we update the GSheet to denote this is taken (if it is, which it seems to be)? cc @AlexCheema [apologies for the pings]

Hf helper refactor

FrostyTheSouthernSnowman · 2025-01-19T14:42:35Z

Prerequisite: #1

Motivation: exo should use device resources as efficiently as possible. Current implementation underutilises available resources.

What: See https://pytorch.org/docs/stable/pipeline.html

Reward: $500 Bounty paid out with USDC on Ethereum, email [email protected].

That pytorch page is giving me a 404. Is the idea here to be able to process multiple separate requests at once or to have a batch api that accepts multiple requests in one api call?

davmacario · 2025-02-19T20:56:14Z

Hey! First off, I love this project, props for the great work! And I love your mission!

I think I did implement this for my MS thesis - you can find it here. The whole point of it was to show how maximizing GPU utilization by introducing pipeline parallelism at inference time leads to the ability to serve multiple requests efficiently.

The code I wrote is definitely not production ready (more like tomato sauce-ready considering the amount of spaghetti code), and I gotta say I didn't dive deep into the Exo codebase yet, but if someone is working on it, maybe I could give some help or support (or just another pair of eyes in debugging).

Let me know if this sounds good, I'd be super glad to be part of this!

AlexCheema · 2025-02-19T21:08:42Z

Hey! First off, I love this project, props for the great work! And I love your mission!

I think I did implement this for my MS thesis - you can find it here. The whole point of it was to show how maximizing GPU utilization by introducing pipeline parallelism at inference time leads to the ability to serve multiple requests efficiently.

The code I wrote is definitely not production ready (more like tomato sauce-ready considering the amount of spaghetti code), and I gotta say I didn't dive deep into the Exo codebase yet, but if someone is working on it, maybe I could give some help or support (or just another pair of eyes in debugging).

Let me know if this sounds good, I'd be super glad to be part of this!

Your thesis is interesting. We're working on this issue for exo v2.

AlexCheema changed the title ~~[BOUNTY] Pipeline Parallel Inference~~ [BOUNTY - $500] Pipeline Parallel Inference Jul 15, 2024

AlexCheema added the enhancement New feature or request label Jul 18, 2024

This was referenced Jul 23, 2024

15 nodes, job not divided properly #56

Open

cache per request #67

Closed

varshith15 mentioned this issue Aug 12, 2024

batched inference #108

Closed

8 tasks

HysenX-LI mentioned this issue Aug 27, 2024

Segmentation fault(Core dumped) in tinygrad #180

Open

lipere123 referenced this issue in lipere123/exo Oct 11, 2024

Merge pull request #4 from exo-explore/main

c365749

merge fork

AlexCheema pushed a commit that referenced this issue Dec 6, 2024

Merge pull request #4 from cadenmackenzie/hf_helperRefactor

fad0591

Hf helper refactor

mariano referenced this issue Feb 19, 2025

Merge branch 'main' into runners2

218c1e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BOUNTY - $500] Pipeline Parallel Inference #4

[BOUNTY - $500] Pipeline Parallel Inference #4

AlexCheema commented Jul 15, 2024 •

edited

Loading

Myestery commented Jul 15, 2024

AlexCheema commented Jul 18, 2024

the-alex-b commented Jul 18, 2024

AlexCheema commented Jul 18, 2024 •

edited

Loading

pranav4501 commented Aug 20, 2024

AlexCheema commented Aug 20, 2024

pranav4501 commented Aug 20, 2024 •

edited

Loading

moosh3 commented Nov 10, 2024

FrostyTheSouthernSnowman commented Jan 19, 2025

davmacario commented Feb 19, 2025 •

edited

Loading

AlexCheema commented Feb 19, 2025 •

edited

Loading

[BOUNTY - $500] Pipeline Parallel Inference #4

[BOUNTY - $500] Pipeline Parallel Inference #4

Comments

AlexCheema commented Jul 15, 2024 • edited Loading

Myestery commented Jul 15, 2024

AlexCheema commented Jul 18, 2024

the-alex-b commented Jul 18, 2024

AlexCheema commented Jul 18, 2024 • edited Loading

pranav4501 commented Aug 20, 2024

AlexCheema commented Aug 20, 2024

pranav4501 commented Aug 20, 2024 • edited Loading

moosh3 commented Nov 10, 2024

FrostyTheSouthernSnowman commented Jan 19, 2025

davmacario commented Feb 19, 2025 • edited Loading

AlexCheema commented Feb 19, 2025 • edited Loading

AlexCheema commented Jul 15, 2024 •

edited

Loading

AlexCheema commented Jul 18, 2024 •

edited

Loading

pranav4501 commented Aug 20, 2024 •

edited

Loading

davmacario commented Feb 19, 2025 •

edited

Loading

AlexCheema commented Feb 19, 2025 •

edited

Loading