-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue running llama3_distributed.py: #14
Comments
Hey, thanks for trying exo out. Really appreciate you taking the time to make an issue. Let's figure this out. The asyncio stuff I think is just a side effect. Try these things:
|
Thanks for the quick response, I've also been having fun hacking on it over the last day. Node 1 is at After noticing that they are together in the topology cluster description, I decided to try to curl from Node 1 and I noticed that it started to download the 4-bit Meta-Llama-3 70b, which wasn't already downloaded (the 8b model was the only one downloaded when I initially was trying to get exo nodes running. This actually triggered a download on node2 ! So that's interesting and aligns with the readme. Here is the requested outputs. Node 1 logs:
(py312) agent@Karens-MBP exo % DEBUG=9 python main.py
Server got itself in trouble% Server got itself in trouble%
and the other direction also has 0.0% packet loss I specifically created two python3.12 venvs on both nodes and then installed requirements.txt |
Oh no, it looks like multiple processes are still running on each node. I can tell from this log:
That's a hell of a lot of uuids. I need to fix this properly to prevent multiple nodes running together (I'm surprised the port binding doesn't fail tbh - or maybe it does but silently). For now, just follow these steps:
Thanks again - this helps a lot - and let me know how it goes. Created 2 issues based on your feedback: #15 and #16 |
Did you try again @arthur-brainchain? |
This should be fixed now. Please reopen if still an issue @arthur-brainchain |
Yep, it's been working. Thanks. Also p.s. I deliberately added multiple nodes to a single machine to see what the failure mode is, but cool project. Looking forward to continuing to work with it. |
…ust topology updates. both fix exo-explore#15 and exo-explore#14
I moved the examples/llama3_distributed.py to the root to get around exo. module issue
Then I ran it after having 2 nodes successfully connect (2x 64GB unified memory M2 Max).
Here is the output I get:
After running two ndoes and getting these logs from DEBUG=9 python main.py in two python3.12 environments.
Here is node1 server logs:
Node 2 logs look similar. It looks like they are able to discover, but when I try to run inference I get the above
The text was updated successfully, but these errors were encountered: