-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Train working #480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train working #480
Conversation
0797f32
to
39f2267
Compare
b658d66
to
e766b31
Compare
Very interesting. @dtnewman what's your take on the training? |
My overall take is that @blindcrone is a brilliant engineer, doing amazing work here. I think it will be awesome when we can all train our own models, locally. I can't wait to see this built out further. Some high level notes though:
|
Awesome feedback @dtnewman, indeed great work from @blindcrone, thank you so much for the effort! I'm probably around 2/1000 steps in reading this PR, so trying to see what would be the architecture choice if training worked as good as inference. (what kind of bottlenecks to expect). |
The bottlenecks are interesting here, because rhe asynchronous nature of the communication and the heavy operation that is backpropagation means that most of the communication overhead is effectively "hidden" by being able to happen while the system overall is waiting for training steps |
Since training on exo clusters is a whole new use case, I think the features of it will need to be built out over time, and in conversation with the community as they play with this capability on their clusters. However, some deeper architectural decisions need to be made in order to support this functionality, so I'm requesting this merge specifically because I'm considering the future and want to avoid massive conflicts with other work people are doing as more of it gets built out |
742ed1d
to
54263ec
Compare
54263ec
to
181d770
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as a prototype for training.
No big structural changes needed, just some small things you can check out.
Please see unresolved threads @blindcrone |
Okay so at this point it seems like removing the abstract base class as part of the requirements for merging this is starting to produce conflicts. I think I've resolved every outstanding issue and will try to integrate the changes from main into the reconciled node class, but maintaining this as a separate branch will continue to be less feasible the more other changes happen elsewhere now that this refactor is part of it |
Still debugging some tinygrad stuff, and fixing comms
Only works on unquantized models on MLX so far. Also for some weird reason any opt but SGD seems to NaN everything
…t the requestor shard's loss
a5704f2
to
dffa17b
Compare
Fixed the conflicts AFAICT, hopefully this can be merged now |
Ultimately this whole synchronization mechanism is kind of a hack for this proof of concept, and making a more robust mechanism would be part of designing a more integrated testing system. However, I think this addresses the code comprehensibility concern somewhat while still keeping the synchronization in a working state
Now tinygrad will build a model from just the shard just like mlx does
608a3d8
to
b1397b4
Compare
Currently this trains on MLX on llama-3.2-3B, but I had to pull a different version of it because I guess the MLX ops needed to train on quantized models are not yet implemented
Definitely still a toy, but it does as advertised: This is distributed training across an exo network cluster