Train working #480

blindcrone · 2024-11-21T01:43:44Z

Currently this trains on MLX on llama-3.2-3B, but I had to pull a different version of it because I guess the MLX ops needed to train on quantized models are not yet implemented

Definitely still a toy, but it does as advertised: This is distributed training across an exo network cluster

exo/inference/mlx/stateful_model.py

exo/train/dataset.py

eugenehp · 2024-11-25T16:31:55Z

Very interesting. @dtnewman what's your take on the training?

dtnewman · 2024-11-25T17:59:29Z

Very interesting. @dtnewman what's your take on the training?

My overall take is that @blindcrone is a brilliant engineer, doing amazing work here. I think it will be awesome when we can all train our own models, locally. I can't wait to see this built out further.

Some high level notes though:

Running command exo train doesn't work. It throws Error: This train ain't leaving the station without a model. I think this would be more helpful if it were `Error: This train ain't leaving the station without a model (e.g. command 'exo train llama-3.2-1b').
The readme should probably be updated to explain how this works (although I do understand this is a work in progress, so maybe that's for later)
Could use some better error handling... _process_example in standard_node.py currently returns None when an exception is called. So when I run this with 'exo train llama-3.2-1b', it fails, but the error gets somewhat obscured since it returns anything at all. I'm not sure that it's worth return anything in the error case there, rather than just raising an exception.
@blindcrone I'm assuming you built this against llama-3.2-3b since that's what's modified in the model_cards in models.py. I'm assuming that if you flipped mlx-community/Llama-3.2-1B-Instruct-4bit to mlx-community/Llama-3.2-1B-Instruct it would probably work for the smaller model too?
I am running it now and it seems to be working so far, although I'm only 12/1000 steps in. I might need to run this overnight instead :)

eugenehp · 2024-11-25T18:27:36Z

Awesome feedback @dtnewman, indeed great work from @blindcrone, thank you so much for the effort!

I'm probably around 2/1000 steps in reading this PR, so trying to see what would be the architecture choice if training worked as good as inference. (what kind of bottlenecks to expect).

blindcrone · 2024-11-25T20:37:50Z

Aye, like "exo run" this is mostly meant for testing and demo purposes, you'll need to feed it a specific model as it doesn't yet hook up to the UX
Sure is. I think this needs to have tinygrad support and some either explanation or UX hooks before it's ready for most people to use, as there are definitely some caveats (see 4)
A lot of this was intended to get the training working so we have something to build an actual workflow off of, and this was a pretty involved process that included tons of debug output that would be really obnoxious to keep in the main repo. I've cleaned most of the instrumentation I used to develop this, but agree that better error messages are necessary to support this being used more widely, and will come from targeting use cases rather than a minimal example
Quantized models seemingly cannot be trained in MLX right now (which makes sense from theory, you'd need some bespoke ops to get gradients out of them), thus you need a model with the full original weights to train it. It's also probably why MLX tends to only suggest training LoRAs, whereas this PR is for full fine-tuning on the specified layers. Further work on this will include probably trying to solve this problem via offering both paths to users, as I want to support LoRA training, but also requantization as a preprocessing step for switching models between training and inference modes. I also have some more bizarre experiments I'm targeting for taking advantage of the distributed nature of exo, but that's future work for sure
Yea training takes a long time, especially without big GPUs.

The bottlenecks are interesting here, because rhe asynchronous nature of the communication and the heavy operation that is backpropagation means that most of the communication overhead is effectively "hidden" by being able to happen while the system overall is waiting for training steps

blindcrone · 2024-11-25T21:01:43Z

Since training on exo clusters is a whole new use case, I think the features of it will need to be built out over time, and in conversation with the community as they play with this capability on their clusters. However, some deeper architectural decisions need to be made in order to support this functionality, so I'm requesting this merge specifically because I'm considering the future and want to avoid massive conflicts with other work people are doing as more of it gets built out

AlexCheema

Looks good as a prototype for training.
No big structural changes needed, just some small things you can check out.

exo/inference/inference_engine.py

exo/inference/mlx/sharded_inference_engine.py

exo/inference/mlx/stateful_model.py

exo/inference/mlx/sharded_inference_engine.py

exo/inference/tinygrad/inference.py

exo/main.py

exo/inference/mlx/sharded_inference_engine.py

AlexCheema · 2024-12-10T19:57:07Z

Please see unresolved threads @blindcrone

blindcrone · 2024-12-10T21:09:08Z

Okay so at this point it seems like removing the abstract base class as part of the requirements for merging this is starting to produce conflicts. I think I've resolved every outstanding issue and will try to integrate the changes from main into the reconciled node class, but maintaining this as a separate branch will continue to be less feasible the more other changes happen elsewhere now that this refactor is part of it

Still debugging some tinygrad stuff, and fixing comms

Only works on unquantized models on MLX so far. Also for some weird reason any opt but SGD seems to NaN everything

…t the requestor shard's loss

… training works

blindcrone · 2024-12-10T21:14:45Z

Fixed the conflicts AFAICT, hopefully this can be merged now

Ultimately this whole synchronization mechanism is kind of a hack for this proof of concept, and making a more robust mechanism would be part of designing a more integrated testing system. However, I think this addresses the code comprehensibility concern somewhat while still keeping the synchronization in a working state

Now tinygrad will build a model from just the shard just like mlx does

blindcrone marked this pull request as draft November 21, 2024 01:44

blindcrone force-pushed the train-working branch from 0797f32 to 39f2267 Compare November 22, 2024 01:34

blindcrone marked this pull request as ready for review November 22, 2024 05:15

blindcrone force-pushed the train-working branch from b658d66 to e766b31 Compare November 22, 2024 10:02

dtnewman reviewed Nov 22, 2024

View reviewed changes

exo/inference/mlx/stateful_model.py Outdated Show resolved Hide resolved

dtnewman reviewed Nov 22, 2024

View reviewed changes

exo/train/dataset.py Show resolved Hide resolved

blindcrone force-pushed the train-working branch from 742ed1d to 54263ec Compare November 27, 2024 01:11

blindcrone force-pushed the train-working branch from 54263ec to 181d770 Compare December 6, 2024 08:52

AlexCheema requested changes Dec 7, 2024

View reviewed changes

AlexCheema reviewed Dec 10, 2024

View reviewed changes

exo/inference/mlx/sharded_inference_engine.py Outdated Show resolved Hide resolved

exo/inference/mlx/sharded_inference_engine.py Show resolved Hide resolved

AlexCheema assigned blindcrone Dec 10, 2024

blindcrone added 12 commits December 10, 2024 13:11

Initial distributed evaluation implementation

f5efbe1

Generalizing some of the dataset biz while also creating uniform batches

a6fd7a3

WIP: Training works on mlx

8368568

Still debugging some tinygrad stuff, and fixing comms

Naive network-propagated loss implementation on MLX

75c8650

Okay we should probably await the update

3e86905

Coordination biz

175ebc1

Working distributed training

dd3d990

Only works on unquantized models on MLX so far. Also for some weird reason any opt but SGD seems to NaN everything

Fixed up the ops so that batches work

38e368f

Basic model saving

9eadee3

Correct loss propagation so we can see the actual loss instead of jus…

9283f6d

…t the requestor shard's loss

Made models save properly

0d3abfc

Fixing tinygrad model

37a75d6

blindcrone added 7 commits December 10, 2024 13:11

Fixing tinygrad model

bfa3b36

Fixing tinygrad model

67f5ae2

Removed tinygrad StatefulModel class, as it's no longer used

b7bbda3

Okay let's turn no_grad back on. We'll worry about that when tinygrad…

bcf87e7

… training works

Initialize inference engine session in base class

98edb39

Some session method cleanup

b22c21a

Do we need casting here?

59af2dd

blindcrone force-pushed the train-working branch from a5704f2 to dffa17b Compare December 10, 2024 21:13

blindcrone added 12 commits December 11, 2024 02:54

Updated node refs

763fbf8

Moved nodes around

c2332e2

Node rename

0c5762d

circular include lol

2a3a2e5

Abstract load checkpoint method

6aaea8c

Removed ensure_session to clean stuff up. May revisit later

0673d64

Removed statefulModel stuff from mlx impl too

a4313da

Missed one

cc66a0b

Dummied up an abstact save_checkpoint

bd31144

embed fix

7f0c12a

Proper sharding in tinygrad

b1397b4

Now tinygrad will build a model from just the shard just like mlx does

blindcrone force-pushed the train-working branch from 608a3d8 to b1397b4 Compare December 11, 2024 11:13

blindcrone requested a review from AlexCheema December 11, 2024 11:14

Model loading and saving for tinygrad

329efb2

AlexCheema approved these changes Dec 11, 2024

View reviewed changes

AlexCheema merged commit d4cc2cf into exo-explore:main Dec 11, 2024

blindcrone deleted the train-working branch December 13, 2024 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train working #480

Train working #480

blindcrone commented Nov 21, 2024

eugenehp commented Nov 25, 2024

dtnewman commented Nov 25, 2024

eugenehp commented Nov 25, 2024

blindcrone commented Nov 25, 2024 •

edited

Loading

blindcrone commented Nov 25, 2024 •

edited

Loading

AlexCheema left a comment

AlexCheema commented Dec 10, 2024

blindcrone commented Dec 10, 2024

blindcrone commented Dec 10, 2024

Train working #480

Train working #480

Conversation

blindcrone commented Nov 21, 2024

eugenehp commented Nov 25, 2024

dtnewman commented Nov 25, 2024

eugenehp commented Nov 25, 2024

blindcrone commented Nov 25, 2024 • edited Loading

blindcrone commented Nov 25, 2024 • edited Loading

AlexCheema left a comment

Choose a reason for hiding this comment

AlexCheema commented Dec 10, 2024

blindcrone commented Dec 10, 2024

blindcrone commented Dec 10, 2024

blindcrone commented Nov 25, 2024 •

edited

Loading

blindcrone commented Nov 25, 2024 •

edited

Loading