Skip to content

Commit 43a1f61

Browse files
authored
Merge pull request #11 from risingsunomi/exo-fork-update
Exo fork update
2 parents 46667b6 + 131c158 commit 43a1f61

35 files changed

+1924
-641
lines changed

.circleci/config.yml

+3-3
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ commands:
1717
source env/bin/activate
1818
1919
# Start first instance
20-
HF_HOME="$(pwd)/.hf_cache_node1" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node1" --listen-port 5678 --broadcast-port 5679 --chatgpt-api-port 8000 --chatgpt-api-response-timeout-secs 900 > output1.log 2>&1 &
20+
HF_HOME="$(pwd)/.hf_cache_node1" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node1" --listen-port 5678 --broadcast-port 5679 --chatgpt-api-port 8000 --chatgpt-api-response-timeout 900 2>&1 | tee output1.log &
2121
PID1=$!
2222
2323
# Start second instance
24-
HF_HOME="$(pwd)/.hf_cache_node2" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node2" --listen-port 5679 --broadcast-port 5678 --chatgpt-api-port 8001 --chatgpt-api-response-timeout-secs 900 > output2.log 2>&1 &
24+
HF_HOME="$(pwd)/.hf_cache_node2" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node2" --listen-port 5679 --broadcast-port 5678 --chatgpt-api-port 8001 --chatgpt-api-response-timeout 900 2>&1 | tee output2.log &
2525
PID2=$!
2626
2727
# Wait for discovery
@@ -144,7 +144,7 @@ jobs:
144144
PID2=$!
145145
sleep 10
146146
kill $PID1 $PID2
147-
if grep -q "Connected to peer" output1.log && grep -q "Connected to peer" output2.log; then
147+
if grep -q "Successfully connected peers: \['node2@.*:.*'\]" output1.log && ! grep -q "Failed to connect peers:" output1.log && grep -q "Successfully connected peers: \['node1@.*:.*'\]" output2.log && ! grep -q "Failed to connect peers:" output2.log; then
148148
echo "Test passed: Both instances discovered each other"
149149
exit 0
150150
else

README.md

+31-9
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@
55
<img alt="exo logo" src="/docs/exo-logo-transparent.png" width="50%" height="50%">
66
</picture>
77

8-
exo: Run your own AI cluster at home with everyday devices. Maintained by [exo labs](https://x.com/exolabs_).
8+
exo: Run your own AI cluster at home with everyday devices. Maintained by [exo labs](https://x.com/exolabs).
99

1010

1111
<h3>
1212

13-
[Discord](https://discord.gg/EUnjGpsmWw) | [Telegram](https://t.me/+Kh-KqHTzFYg3MGNk) | [X](https://x.com/exolabs_)
13+
[Discord](https://discord.gg/EUnjGpsmWw) | [Telegram](https://t.me/+Kh-KqHTzFYg3MGNk) | [X](https://x.com/exolabs)
1414

1515
</h3>
1616

@@ -25,14 +25,12 @@ exo: Run your own AI cluster at home with everyday devices. Maintained by [exo l
2525
Forget expensive NVIDIA GPUs, unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, Linux, pretty much any device!
2626

2727
<div align="center">
28-
<h2>Update: Exo Supports Llama 3.1</h2>
29-
<p>Run 8B, 70B and 405B parameter Llama 3.1 models on your own devices</p>
30-
<p><a href="https://github.com/exo-explore/exo/blob/main/exo/inference/mlx/models/llama.py">See the code</a></p>
28+
<h2>Update: exo is hiring. See <a href="https://exolabs.net">here</a> for more details.</h2>
3129
</div>
3230

3331
## Get Involved
3432

35-
exo is **experimental** software. Expect bugs early on. Create issues so they can be fixed. The [exo labs](https://x.com/exolabs_) team will strive to resolve issues quickly.
33+
exo is **experimental** software. Expect bugs early on. Create issues so they can be fixed. The [exo labs](https://x.com/exolabs) team will strive to resolve issues quickly.
3634

3735
We also welcome contributions from the community. We have a list of bounties in [this sheet](https://docs.google.com/spreadsheets/d/1cTCpTIp48UnnIvHeLEUNg1iMy_Q6lRybgECSFCoVJpE/edit?usp=sharing).
3836

@@ -52,7 +50,7 @@ exo will [automatically discover](https://github.com/exo-explore/exo/blob/945f90
5250

5351
### ChatGPT-compatible API
5452

55-
exo provides a [ChatGPT-compatible API](exo/api/chatgpt_api.py) for running models. It's a [one-line change](examples/chatgpt_api.py) in your application to run models on your own hardware using exo.
53+
exo provides a [ChatGPT-compatible API](exo/api/chatgpt_api.py) for running models. It's a [one-line change](examples/chatgpt_api.sh) in your application to run models on your own hardware using exo.
5654

5755
### Device Equality
5856

@@ -108,8 +106,6 @@ python3 main.py
108106

109107
That's it! No configuration required - exo will automatically discover the other device(s).
110108

111-
The native way to access models running on exo is using the exo library with peer handles. See how in [this example for Llama 3](examples/llama3_distributed.py).
112-
113109
exo starts a ChatGPT-like WebUI (powered by [tinygrad tinychat](https://github.com/tinygrad/tinygrad/tree/master/examples/tinychat)) on http://localhost:8000
114110

115111
For developers, exo also starts a ChatGPT-compatible API endpoint on http://localhost:8000/v1/chat/completions. Example with curls:
@@ -150,6 +146,26 @@ curl http://localhost:8000/v1/chat/completions \
150146
}'
151147
```
152148

149+
### Example Usage on Multiple Heterogenous Devices (MacOS + Linux)
150+
151+
#### Device 1 (MacOS):
152+
153+
```sh
154+
python3 main.py --inference-engine tinygrad
155+
```
156+
157+
Here we explicitly tell exo to use the **tinygrad** inference engine.
158+
159+
#### Device 2 (Linux):
160+
```sh
161+
python3 main.py
162+
```
163+
164+
Linux devices will automatically default to using the **tinygrad** inference engine.
165+
166+
You can read about tinygrad-specific env vars [here](https://docs.tinygrad.org/env_vars/). For example, you can configure tinygrad to use the cpu by specifying `CLANG=1`.
167+
168+
153169
## Debugging
154170

155171
Enable debug logs with the DEBUG environment variable (0-9).
@@ -158,6 +174,12 @@ Enable debug logs with the DEBUG environment variable (0-9).
158174
DEBUG=9 python3 main.py
159175
```
160176

177+
For the **tinygrad** inference engine specifically, there is a separate DEBUG flag `TINYGRAD_DEBUG` that can be used to enable debug logs (1-6).
178+
179+
```sh
180+
TINYGRAD_DEBUG=2 python3 main.py
181+
```
182+
161183
## Known Issues
162184

163185
- 🚧 As the library is evolving so quickly, the iOS implementation has fallen behind Python. We have decided for now not to put out the buggy iOS version and receive a bunch of GitHub issues for outdated code. We are working on solving this properly and will make an announcement when it's ready. If you would like access to the iOS implementation now, please email [email protected] with your GitHub username explaining your use-case and you will be granted access on GitHub.

examples/chatgpt_api.sh

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# exo provides an API that aims to be a drop-in replacements for the ChatGPT-API.
2+
# This example shows how you can use the API first without streaming and second with streaming.
3+
# This works the same in a single-node set up and in a multi-node setup.
4+
# You need to start exo before running this by running `python3 main.py`.
5+
6+
API_ENDPOINT="http://${API_ENDPOINT:-$(ifconfig | grep 'inet ' | grep -v '127.0.0.1' | awk '{print $2}' | head -n 1):8000}"
7+
MODEL="llama-3.1-8b"
8+
PROMPT="What is the meaning of exo?"
9+
TEMPERATURE=0.7
10+
11+
echo ""
12+
echo ""
13+
echo "--- Output without streaming:"
14+
echo ""
15+
curl "${API_ENDPOINT}/v1/chat/completions" --silent \
16+
-H "Content-Type: application/json" \
17+
-d '{
18+
"model": "'"${MODEL}"'",
19+
"messages": [{"role": "user", "content": "'"${PROMPT}"'"}],
20+
"temperature": '"${TEMPERATURE}"'
21+
}'
22+
23+
echo ""
24+
echo ""
25+
echo "--- Output with streaming:"
26+
echo ""
27+
curl "${API_ENDPOINT}/v1/chat/completions" --silent \
28+
-H "Content-Type: application/json" \
29+
-d '{
30+
"model": "'"${MODEL}"'",
31+
"messages": [{"role": "user", "content": "'"${PROMPT}"'"}],
32+
"temperature": '"${TEMPERATURE}"',
33+
"stream": true
34+
}' | while read -r line; do
35+
if [[ $line == data:* ]]; then
36+
content=$(echo "$line" | sed 's/^data: //')
37+
echo "$content" | jq -r '.choices[].delta.content' --unbuffered | tr -d '\n'
38+
fi
39+
done

examples/llama3_distributed.py

-81
This file was deleted.

exo/api/chatgpt_api.py

+19-16
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,9 @@ def generate_completion(
5858
"finish_reason": finish_reason,
5959
}],
6060
}
61+
62+
if DEBUG >= 3:
63+
print(f"completion: {completion}")
6164

6265
if not stream:
6366
completion["usage"] = {
@@ -67,9 +70,16 @@ def generate_completion(
6770
}
6871

6972
choice = completion["choices"][0]
73+
print(f"\nchoice {choice}")
7074
if object_type.startswith("chat.completion"):
7175
key_name = "delta" if stream else "message"
72-
choice[key_name] = {"role": "assistant", "content": tokenizer.decode(tokens)}
76+
77+
token_decode = tokenizer.batch_decode(
78+
tokens,
79+
skip_special_tokens=True,
80+
clean_up_tokenization_spaces=False
81+
)
82+
choice[key_name] = {"role": "assistant", "content": token_decode}
7383
elif object_type == "text_completion":
7484
choice["text"] = tokenizer.decode(tokens)
7585
else:
@@ -113,16 +123,9 @@ def remap_messages(messages: List[Message]) -> List[Message]:
113123

114124

115125
def build_prompt(tokenizer, _messages: List[Message]):
116-
if len(_messages) == 1:
117-
user_msg = _messages[0]
118-
119-
# get instruct sys message
120-
sys_msg = Message(role="system", content="You are a helpful assistant.")
121-
122-
# restructure for sys_msg to go first
123-
_messages = [sys_msg, user_msg]
124-
125126
messages = remap_messages(_messages)
127+
if DEBUG >= 3:
128+
print(f"messages: {messages}")
126129
prompt = tokenizer.apply_chat_template(
127130
messages,
128131
tokenize=False,
@@ -140,7 +143,7 @@ def build_prompt(tokenizer, _messages: List[Message]):
140143
continue
141144

142145
for content in message.content:
143-
# note: we only support one image at a time right now. Multiple is possible. See: https://github.com/huggingface/transformers/blob/e68ec18ce224af879f22d904c7505a765fb77de3/docs/source/en/model_doc/llava.md?plain=1#L41
146+
# note: wae only support one image at time right now. Multiple is possible. See: https://github.com/huggingface/transformers/blob/e68ec18ce224af879f22d904c7505a765fb77de3/docs/source/en/model_doc/llava.md?plain=1#L41
144147
# follows the convention in https://platform.openai.com/docs/guides/vision
145148
if isinstance(content, dict) and content.get("type", None) == "image":
146149
image_str = content.get("image", None)
@@ -171,10 +174,10 @@ def __init__(self, request_id: str, timestamp: int, prompt: str):
171174

172175

173176
class ChatGPTAPI:
174-
def __init__(self, node: Node, inference_engine_classname: str, response_timeout_secs: int = 90, on_chat_completion_request: Callable[[str, ChatCompletionRequest, str], None] = None):
177+
def __init__(self, node: Node, inference_engine_classname: str, response_timeout: int = 90, on_chat_completion_request: Callable[[str, ChatCompletionRequest, str], None] = None):
175178
self.node = node
176179
self.inference_engine_classname = inference_engine_classname
177-
self.response_timeout_secs = response_timeout_secs
180+
self.response_timeout = response_timeout
178181
self.on_chat_completion_request = on_chat_completion_request
179182
self.app = web.Application(client_max_size=100*1024*1024) # 100MB to support image upload
180183
self.prompts: PrefixDict[str, PromptSession] = PrefixDict()
@@ -273,7 +276,7 @@ async def handle_post_chat_completions(self, request):
273276
return web.json_response({"detail": f"Error processing prompt (see logs with DEBUG>=2): {str(e)}"}, status=500)
274277

275278
try:
276-
if DEBUG >= 2: print(f"Waiting for response to finish. timeout={self.response_timeout_secs}s")
279+
if DEBUG >= 2: print(f"Waiting for response to finish. timeout={self.response_timeout}s")
277280

278281
if stream:
279282
response = web.StreamResponse(
@@ -322,7 +325,7 @@ def on_result(_request_id: str, tokens: List[int], is_finished: bool):
322325

323326
return _request_id == request_id and is_finished
324327

325-
_, tokens, _ = await callback.wait(on_result, timeout=self.response_timeout_secs)
328+
_, tokens, _ = await callback.wait(on_result, timeout=self.response_timeout)
326329
if request_id in self.stream_tasks: # in case there is still a stream task running, wait for it to complete
327330
if DEBUG >= 2: print("Pending stream task. Waiting for stream task to complete.")
328331
try:
@@ -334,7 +337,7 @@ def on_result(_request_id: str, tokens: List[int], is_finished: bool):
334337
else:
335338
_, tokens, _ = await callback.wait(
336339
lambda _request_id, tokens, is_finished: _request_id == request_id and is_finished,
337-
timeout=self.response_timeout_secs,
340+
timeout=self.response_timeout,
338341
)
339342

340343
finish_reason = "length"

0 commit comments

Comments
 (0)