You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
exo currently implements Pipeline Parallel inference. This splits up layers of a model over multiple devices and executes them sequentially, device-by-device.
There are different ways we can split up the model layers. For this purpose, exo defines something called a PartitioningStrategy:
This takes a Topology and gives a List[Partition].
A Partition consists of node_id, start and end. The Partitions must be continuous ranges [start, end). The first start must be 0. The last end must be 1.
There's two things going on here:
It decides the order that nodes execute and send messages between each other. For example, if you return [node1, node2, node3], then node1 will execute first, followed by node2, followed by node3, which will then send an output token to node1 to continue the cycle. However, if you return [node2, node1, node3] then node2 will execute first, followed by node1, followed by node3, which will then send an output token to node2 to continue the cycle.
It decides how many layers each node gets. Each node gets a number of layers proportional to end-start. For example if start=0,end=1 then that node will get all the layers. If start=0,end=0.5 for node1 and start=0.5,end=1 for node2 then node1 will get 50% of the layers and node2 will get 50% of the layers.
The default and only PartitioningStrategy right now is RingMemoryWeightedPartitioningStrategy:
What this does is it sorts primarily by memory, secondarily by node_id. The size of each partition is proportional to the memory of the device, i.e. if deviceA has 4GB memory and deviceB has 6GB memory, deviceA will get 40% of the layers and deviceB will get 60% of the layers (modulo some rounding). Note that it's important that we sort secondarily by node_id to ensure deterministic and consistent sorting in the case that memory is the same for two devices.
The task
The task is to implement a new, improved PartitioningStrategy that takes into account more than just memory. This may require augmenting the Topology class with more information that it currently has, which will require changes across the codebase. Some things you might want to consider here are: device FLOPS and inter-node latency. There are many other things you could take into account here which I will leave to you to decide.
I have some ideas for how to do this, and there are many potential approaches however I'm looking for out of the box ideas here.
I'll leave it up to you to reason about how to lay this out, but there are two high level metrics that would make sense to optimise for (should they be optimised separately or together?):
Time-to-first token (latency)
Tokens per second (throughput)
Deliverables
A set of unit tests for your new PartitioningStrategy that show it works in different cases.
A set of unit tests that "simulate" different scenarios and show that this PartitioningStrategy achieves the optimal solution in each scenario.
An option added to the main script to enable this PartitioningStrategy (you decide if other parameters should be added to configure the new PartitioningStrategy
The text was updated successfully, but these errors were encountered:
AlexCheema
changed the title
[BOUNTY - $200] Better Partitioning Strategy
[BOUNTY - $500] Better Partitioning Strategy
Oct 3, 2024
AlexCheema
changed the title
[BOUNTY - $500] Better Partitioning Strategy
[BOUNTY - $500] Better PartitioningStrategy
Oct 3, 2024
Hi @AlexCheema, I've seen that you usually create bounties on issues, maybe you're interested in using Opire. You don't pay until someone claims the bounties with a PR.
PS: I'm the cofounder, so if you need anything, feel free to contact me
Introduction
exo currently implements Pipeline Parallel inference. This splits up layers of a model over multiple devices and executes them sequentially, device-by-device.
There are different ways we can split up the model layers. For this purpose, exo defines something called a
PartitioningStrategy
:exo/exo/topology/partitioning_strategy.py
Lines 16 to 19 in 5e0db20
This takes a
Topology
and gives aList[Partition]
.A
Partition
consists ofnode_id
,start
andend
. The Partitions must be continuous ranges[start, end)
. The first start must be 0. The last end must be 1.There's two things going on here:
[node1, node2, node3]
, then node1 will execute first, followed by node2, followed by node3, which will then send an output token tonode1
to continue the cycle. However, if you return[node2, node1, node3]
then node2 will execute first, followed by node1, followed by node3, which will then send an output token tonode2
to continue the cycle.end-start
. For example ifstart=0,end=1
then that node will get all the layers. Ifstart=0,end=0.5
for node1 andstart=0.5,end=1
for node2 then node1 will get 50% of the layers and node2 will get 50% of the layers.The default and only
PartitioningStrategy
right now isRingMemoryWeightedPartitioningStrategy
:exo/exo/topology/ring_memory_weighted_partitioning_strategy.py
Lines 7 to 18 in 5e0db20
What this does is it sorts primarily by memory, secondarily by node_id. The size of each partition is proportional to the memory of the device, i.e. if deviceA has 4GB memory and deviceB has 6GB memory, deviceA will get 40% of the layers and deviceB will get 60% of the layers (modulo some rounding). Note that it's important that we sort secondarily by node_id to ensure deterministic and consistent sorting in the case that memory is the same for two devices.
The task
The task is to implement a new, improved
PartitioningStrategy
that takes into account more than just memory. This may require augmenting theTopology
class with more information that it currently has, which will require changes across the codebase. Some things you might want to consider here are: device FLOPS and inter-node latency. There are many other things you could take into account here which I will leave to you to decide.I have some ideas for how to do this, and there are many potential approaches however I'm looking for out of the box ideas here.
I'll leave it up to you to reason about how to lay this out, but there are two high level metrics that would make sense to optimise for (should they be optimised separately or together?):
Deliverables
PartitioningStrategy
that show it works in different cases.PartitioningStrategy
achieves the optimal solution in each scenario.PartitioningStrategy
(you decide if other parameters should be added to configure the newPartitioningStrategy
The text was updated successfully, but these errors were encountered: