Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exo fork update #11

Merged
merged 107 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
65e0488
logs for file filtering, grpc_discovery -> udp_discovery
AlexCheema Aug 21, 2024
5a9f4ba
update examples: remove old llama3_distributed, add chatgpt_api
AlexCheema Aug 28, 2024
2341aa1
Merge branch 'main' into better_networking
AlexCheema Aug 29, 2024
d4a932e
fix merge
AlexCheema Aug 29, 2024
f93f811
generalise UDPDiscovery to any kind of PeerHandle that accepts an add…
AlexCheema Aug 29, 2024
12609cb
integration test for udp discovery with grpc server
AlexCheema Aug 30, 2024
dc3b2bd
use NousResearch/Meta-Llama-3.1-70B-Instruct as tinygrad llama-3.1-70…
AlexCheema Aug 30, 2024
032c9b1
rewrite of sharded model using new split testing of huggingface models
risingsunomi Sep 1, 2024
626b223
building out new hf.py class, testing qwen and llama3 8b
risingsunomi Sep 1, 2024
5721504
todo for speculative model
AlexCheema Sep 1, 2024
8baaad7
Merge branch 'main' into better_networking
AlexCheema Sep 2, 2024
f983e93
trying to load in weights but transformers/pytorch doesnt allow that …
risingsunomi Sep 4, 2024
baf6efd
cleaner discovery
AlexCheema Sep 4, 2024
15b5043
test for reconnect
AlexCheema Sep 4, 2024
3dd81a1
fix UDPDiscovery params, create a new transport every time we broadcast
AlexCheema Sep 4, 2024
dcb3ac7
test kill pids
AlexCheema Sep 4, 2024
8114a79
add back listen and cleanup tasks
AlexCheema Sep 4, 2024
355c579
more robust discovery / peer handling. now we track if the same node …
AlexCheema Sep 4, 2024
80c48b9
update visited with self.id, timeout on collecting topology from a pe…
AlexCheema Sep 4, 2024
c97da54
add id to set
AlexCheema Sep 4, 2024
8cb678e
better logs around peer connecting / disconnecting
AlexCheema Sep 4, 2024
a0d9c90
shorten cli name --chatgpt-api-response-timeout
AlexCheema Sep 4, 2024
f342cdc
get rid of -secs suffix
AlexCheema Sep 4, 2024
56c1bf9
consistent remove _secs / -secs suffix
AlexCheema Sep 4, 2024
4537d61
circleci use tee to output logs in realtime as well as capture them
AlexCheema Sep 4, 2024
35aba75
Merge pull request #194 from exo-explore/better_networking
AlexCheema Sep 4, 2024
41dd700
less aggressive logs for opaque status / download progress. too much …
AlexCheema Sep 4, 2024
01cc6a4
fix Mistral-Large special case when we pass in a path
AlexCheema Sep 4, 2024
41f0a22
DEBUG>=8 for SendOpaqueStatus logs
AlexCheema Sep 4, 2024
2950373
experiment with tinygrad on its own thread, so it doesnt block event …
AlexCheema Sep 5, 2024
58f535d
formatting
AlexCheema Sep 5, 2024
0ca5c26
run mlx inference engine on a single thread too
AlexCheema Sep 5, 2024
9db16f8
use a queue for non-blocking mlx inference
AlexCheema Sep 5, 2024
6881722
simplify non-blocking mlx inference
AlexCheema Sep 5, 2024
8f65e1e
fix weight_map resolution. previously we were always defaulting to al…
AlexCheema Sep 5, 2024
b239c8a
Merge branch 'main' into non_blocking
AlexCheema Sep 5, 2024
e0fda94
use sets for shard specific patterns
AlexCheema Sep 5, 2024
ea3322d
remove comment
AlexCheema Sep 5, 2024
11dd952
use set for shard specific patterns
AlexCheema Sep 5, 2024
2948a83
add llama-3.1-70b-bf16 model option
AlexCheema Sep 5, 2024
de19f0a
Merge branch 'main' into non_blocking
AlexCheema Sep 5, 2024
a1a0ffa
add tinychat option for llama-3.1-70b-bf16
AlexCheema Sep 5, 2024
6342384
Merge branch 'main' into non_blocking
AlexCheema Sep 5, 2024
4ec613d
simplify tinygrad non blocking
AlexCheema Sep 5, 2024
8418711
add a test for hf get_weight_map
AlexCheema Sep 5, 2024
caf9b57
trigger ci
AlexCheema Sep 5, 2024
d6e661f
match previous impl with np.array in mlx
AlexCheema Sep 5, 2024
9345684
closely match prev impl mlx non blocking
AlexCheema Sep 5, 2024
e616d4e
run realize on the result in tinygrad
AlexCheema Sep 5, 2024
874886a
simplify mlx non blocking
AlexCheema Sep 5, 2024
87e08f8
Merge pull request #203 from exo-explore/non_blocking
AlexCheema Sep 5, 2024
ca64456
fix broken links in README
AlexCheema Sep 8, 2024
0fa1536
Merge pull request #208 from exo-explore/broken_links_readme
AlexCheema Sep 8, 2024
4b00940
move `.exo_used_ports` to `/tmp`
GaetanLepage Sep 8, 2024
e0ed917
Merge pull request #209 from GaetanLepage/used-ports
AlexCheema Sep 10, 2024
20522e0
update docs to make tinygrad usage clearer
AlexCheema Sep 11, 2024
198cd6f
trigger ci
AlexCheema Sep 13, 2024
074228e
update README with hiring
AlexCheema Sep 13, 2024
6c875dc
update hiring link
AlexCheema Sep 13, 2024
db9f44d
website link
AlexCheema Sep 13, 2024
d142be0
adding more testing, refining logit selection
risingsunomi Sep 13, 2024
be8d7fb
working split model test, updating class
risingsunomi Sep 15, 2024
9d1ecdd
working on class and inference engine updates
risingsunomi Sep 15, 2024
4b0df06
building out inference engine test
risingsunomi Sep 15, 2024
623468c
adding working tests, update to forward function to just use input_id…
risingsunomi Sep 16, 2024
19b322d
cleaning up code and tests, debugging and adding in cleaned up loggin…
risingsunomi Sep 16, 2024
cc2c14c
getting infer and stop token issues
risingsunomi Sep 16, 2024
583629c
add tracking of next token and other logits into the full input_ids s…
risingsunomi Sep 17, 2024
7ec5bb8
grpc testing
risingsunomi Sep 17, 2024
5903e63
grpc testing
risingsunomi Sep 17, 2024
e7a3fd0
grpc testing
risingsunomi Sep 17, 2024
f6eec5a
grpc testing
risingsunomi Sep 17, 2024
d441a51
grpc testing
risingsunomi Sep 17, 2024
e7f6dcb
grpc testing
risingsunomi Sep 17, 2024
ba5b005
grpc testing
risingsunomi Sep 17, 2024
6242d76
grpc testing
risingsunomi Sep 17, 2024
5630731
grpc testing
risingsunomi Sep 17, 2024
4a29268
testing passing hidden states in inference_state
risingsunomi Sep 17, 2024
2daf65f
testing passing hidden states in inference_state
risingsunomi Sep 17, 2024
36d5cde
fixing scalar issue, reversing passing hidden_states
risingsunomi Sep 17, 2024
6917f30
inference bug fix, grpc testing
risingsunomi Sep 17, 2024
adab336
inference bug fix, grpc testing
risingsunomi Sep 17, 2024
73146dd
fixing hf model for hidden_states
risingsunomi Sep 17, 2024
929386d
fixing hf model for hidden_states
risingsunomi Sep 17, 2024
32b8f67
fixing hf model for hidden_states
risingsunomi Sep 17, 2024
c86facb
fixing hf model for hidden_states
risingsunomi Sep 17, 2024
d15b20d
fixing hf model for hidden_states
risingsunomi Sep 17, 2024
5e41bc4
fixing hf model for hidden_states
risingsunomi Sep 17, 2024
b29c5f8
fixing hf model for hidden_states
risingsunomi Sep 17, 2024
ddaa79c
fixing kvcache issue
risingsunomi Sep 17, 2024
3164d38
fixing kvcache issue
risingsunomi Sep 17, 2024
e8532bc
fixing kvcache issue
risingsunomi Sep 17, 2024
6a5b8db
fixing kvcache issue
risingsunomi Sep 17, 2024
515687d
working on passing past input_ids between infers and nodes
risingsunomi Sep 18, 2024
3597fba
add support for qwen2.5, initially adding mlx-community/Qwen2.5-14B-I…
AlexCheema Sep 18, 2024
dee83e4
add more qwen2.5 models: mlx-community/Qwen2.5-7B-Instruct-4bit mlx-c…
AlexCheema Sep 18, 2024
68028cc
ignore Qwen models in tokenizers test until bos issue is fixed
AlexCheema Sep 18, 2024
8ad19a5
Merge pull request #221 from exo-explore/qwen2.5
AlexCheema Sep 18, 2024
311c819
update twitter handle exolabs_ -> exolabs
AlexCheema Sep 19, 2024
92ebdd5
implemented infer caching and passing cache information via inference…
risingsunomi Sep 19, 2024
f0795bd
removing dynamic cache passing in inference_state as model does its o…
risingsunomi Sep 19, 2024
b8f15a0
removed clearning cache on infer prompt and only on finished infer te…
risingsunomi Sep 19, 2024
d0f3cb7
hidden state dropping between nodes issue
risingsunomi Sep 19, 2024
fa6f263
hidden state dropping between nodes issue
risingsunomi Sep 19, 2024
2b0e7b5
hidden state dropping between nodes issue
risingsunomi Sep 19, 2024
f793c00
Merge branch 'main' of github.com:exo-explore/exo into exo-fork-update
risingsunomi Sep 19, 2024
131c158
Merge branch 'exo-fork-update' of github.com:risingsunomi/exo-nvidia …
risingsunomi Sep 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@ commands:
source env/bin/activate

# Start first instance
HF_HOME="$(pwd)/.hf_cache_node1" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node1" --listen-port 5678 --broadcast-port 5679 --chatgpt-api-port 8000 --chatgpt-api-response-timeout-secs 900 > output1.log 2>&1 &
HF_HOME="$(pwd)/.hf_cache_node1" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node1" --listen-port 5678 --broadcast-port 5679 --chatgpt-api-port 8000 --chatgpt-api-response-timeout 900 2>&1 | tee output1.log &
PID1=$!

# Start second instance
HF_HOME="$(pwd)/.hf_cache_node2" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node2" --listen-port 5679 --broadcast-port 5678 --chatgpt-api-port 8001 --chatgpt-api-response-timeout-secs 900 > output2.log 2>&1 &
HF_HOME="$(pwd)/.hf_cache_node2" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node2" --listen-port 5679 --broadcast-port 5678 --chatgpt-api-port 8001 --chatgpt-api-response-timeout 900 2>&1 | tee output2.log &
PID2=$!

# Wait for discovery
Expand Down Expand Up @@ -144,7 +144,7 @@ jobs:
PID2=$!
sleep 10
kill $PID1 $PID2
if grep -q "Connected to peer" output1.log && grep -q "Connected to peer" output2.log; then
if grep -q "Successfully connected peers: \['node2@.*:.*'\]" output1.log && ! grep -q "Failed to connect peers:" output1.log && grep -q "Successfully connected peers: \['node1@.*:.*'\]" output2.log && ! grep -q "Failed to connect peers:" output2.log; then
echo "Test passed: Both instances discovered each other"
exit 0
else
Expand Down
40 changes: 31 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@
<img alt="exo logo" src="/docs/exo-logo-transparent.png" width="50%" height="50%">
</picture>

exo: Run your own AI cluster at home with everyday devices. Maintained by [exo labs](https://x.com/exolabs_).
exo: Run your own AI cluster at home with everyday devices. Maintained by [exo labs](https://x.com/exolabs).


<h3>

[Discord](https://discord.gg/EUnjGpsmWw) | [Telegram](https://t.me/+Kh-KqHTzFYg3MGNk) | [X](https://x.com/exolabs_)
[Discord](https://discord.gg/EUnjGpsmWw) | [Telegram](https://t.me/+Kh-KqHTzFYg3MGNk) | [X](https://x.com/exolabs)

</h3>

Expand All @@ -25,14 +25,12 @@ exo: Run your own AI cluster at home with everyday devices. Maintained by [exo l
Forget expensive NVIDIA GPUs, unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, Linux, pretty much any device!

<div align="center">
<h2>Update: Exo Supports Llama 3.1</h2>
<p>Run 8B, 70B and 405B parameter Llama 3.1 models on your own devices</p>
<p><a href="https://github.com/exo-explore/exo/blob/main/exo/inference/mlx/models/llama.py">See the code</a></p>
<h2>Update: exo is hiring. See <a href="https://exolabs.net">here</a> for more details.</h2>
</div>

## Get Involved

exo is **experimental** software. Expect bugs early on. Create issues so they can be fixed. The [exo labs](https://x.com/exolabs_) team will strive to resolve issues quickly.
exo is **experimental** software. Expect bugs early on. Create issues so they can be fixed. The [exo labs](https://x.com/exolabs) team will strive to resolve issues quickly.

We also welcome contributions from the community. We have a list of bounties in [this sheet](https://docs.google.com/spreadsheets/d/1cTCpTIp48UnnIvHeLEUNg1iMy_Q6lRybgECSFCoVJpE/edit?usp=sharing).

Expand All @@ -52,7 +50,7 @@ exo will [automatically discover](https://github.com/exo-explore/exo/blob/945f90

### ChatGPT-compatible API

exo provides a [ChatGPT-compatible API](exo/api/chatgpt_api.py) for running models. It's a [one-line change](examples/chatgpt_api.py) in your application to run models on your own hardware using exo.
exo provides a [ChatGPT-compatible API](exo/api/chatgpt_api.py) for running models. It's a [one-line change](examples/chatgpt_api.sh) in your application to run models on your own hardware using exo.

### Device Equality

Expand Down Expand Up @@ -108,8 +106,6 @@ python3 main.py

That's it! No configuration required - exo will automatically discover the other device(s).

The native way to access models running on exo is using the exo library with peer handles. See how in [this example for Llama 3](examples/llama3_distributed.py).

exo starts a ChatGPT-like WebUI (powered by [tinygrad tinychat](https://github.com/tinygrad/tinygrad/tree/master/examples/tinychat)) on http://localhost:8000

For developers, exo also starts a ChatGPT-compatible API endpoint on http://localhost:8000/v1/chat/completions. Example with curls:
Expand Down Expand Up @@ -150,6 +146,26 @@ curl http://localhost:8000/v1/chat/completions \
}'
```

### Example Usage on Multiple Heterogenous Devices (MacOS + Linux)

#### Device 1 (MacOS):

```sh
python3 main.py --inference-engine tinygrad
```

Here we explicitly tell exo to use the **tinygrad** inference engine.

#### Device 2 (Linux):
```sh
python3 main.py
```

Linux devices will automatically default to using the **tinygrad** inference engine.

You can read about tinygrad-specific env vars [here](https://docs.tinygrad.org/env_vars/). For example, you can configure tinygrad to use the cpu by specifying `CLANG=1`.


## Debugging

Enable debug logs with the DEBUG environment variable (0-9).
Expand All @@ -158,6 +174,12 @@ Enable debug logs with the DEBUG environment variable (0-9).
DEBUG=9 python3 main.py
```

For the **tinygrad** inference engine specifically, there is a separate DEBUG flag `TINYGRAD_DEBUG` that can be used to enable debug logs (1-6).

```sh
TINYGRAD_DEBUG=2 python3 main.py
```

## Known Issues

- 🚧 As the library is evolving so quickly, the iOS implementation has fallen behind Python. We have decided for now not to put out the buggy iOS version and receive a bunch of GitHub issues for outdated code. We are working on solving this properly and will make an announcement when it's ready. If you would like access to the iOS implementation now, please email [email protected] with your GitHub username explaining your use-case and you will be granted access on GitHub.
Expand Down
39 changes: 39 additions & 0 deletions examples/chatgpt_api.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# exo provides an API that aims to be a drop-in replacements for the ChatGPT-API.
# This example shows how you can use the API first without streaming and second with streaming.
# This works the same in a single-node set up and in a multi-node setup.
# You need to start exo before running this by running `python3 main.py`.

API_ENDPOINT="http://${API_ENDPOINT:-$(ifconfig | grep 'inet ' | grep -v '127.0.0.1' | awk '{print $2}' | head -n 1):8000}"
MODEL="llama-3.1-8b"
PROMPT="What is the meaning of exo?"
TEMPERATURE=0.7

echo ""
echo ""
echo "--- Output without streaming:"
echo ""
curl "${API_ENDPOINT}/v1/chat/completions" --silent \
-H "Content-Type: application/json" \
-d '{
"model": "'"${MODEL}"'",
"messages": [{"role": "user", "content": "'"${PROMPT}"'"}],
"temperature": '"${TEMPERATURE}"'
}'

echo ""
echo ""
echo "--- Output with streaming:"
echo ""
curl "${API_ENDPOINT}/v1/chat/completions" --silent \
-H "Content-Type: application/json" \
-d '{
"model": "'"${MODEL}"'",
"messages": [{"role": "user", "content": "'"${PROMPT}"'"}],
"temperature": '"${TEMPERATURE}"',
"stream": true
}' | while read -r line; do
if [[ $line == data:* ]]; then
content=$(echo "$line" | sed 's/^data: //')
echo "$content" | jq -r '.choices[].delta.content' --unbuffered | tr -d '\n'
fi
done
81 changes: 0 additions & 81 deletions examples/llama3_distributed.py

This file was deleted.

35 changes: 19 additions & 16 deletions exo/api/chatgpt_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ def generate_completion(
"finish_reason": finish_reason,
}],
}

if DEBUG >= 3:
print(f"completion: {completion}")

if not stream:
completion["usage"] = {
Expand All @@ -67,9 +70,16 @@ def generate_completion(
}

choice = completion["choices"][0]
print(f"\nchoice {choice}")
if object_type.startswith("chat.completion"):
key_name = "delta" if stream else "message"
choice[key_name] = {"role": "assistant", "content": tokenizer.decode(tokens)}

token_decode = tokenizer.batch_decode(
tokens,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
choice[key_name] = {"role": "assistant", "content": token_decode}
elif object_type == "text_completion":
choice["text"] = tokenizer.decode(tokens)
else:
Expand Down Expand Up @@ -113,16 +123,9 @@ def remap_messages(messages: List[Message]) -> List[Message]:


def build_prompt(tokenizer, _messages: List[Message]):
if len(_messages) == 1:
user_msg = _messages[0]

# get instruct sys message
sys_msg = Message(role="system", content="You are a helpful assistant.")

# restructure for sys_msg to go first
_messages = [sys_msg, user_msg]

messages = remap_messages(_messages)
if DEBUG >= 3:
print(f"messages: {messages}")
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
Expand All @@ -140,7 +143,7 @@ def build_prompt(tokenizer, _messages: List[Message]):
continue

for content in message.content:
# note: we only support one image at a time right now. Multiple is possible. See: https://github.com/huggingface/transformers/blob/e68ec18ce224af879f22d904c7505a765fb77de3/docs/source/en/model_doc/llava.md?plain=1#L41
# note: wae only support one image at time right now. Multiple is possible. See: https://github.com/huggingface/transformers/blob/e68ec18ce224af879f22d904c7505a765fb77de3/docs/source/en/model_doc/llava.md?plain=1#L41
# follows the convention in https://platform.openai.com/docs/guides/vision
if isinstance(content, dict) and content.get("type", None) == "image":
image_str = content.get("image", None)
Expand Down Expand Up @@ -171,10 +174,10 @@ def __init__(self, request_id: str, timestamp: int, prompt: str):


class ChatGPTAPI:
def __init__(self, node: Node, inference_engine_classname: str, response_timeout_secs: int = 90, on_chat_completion_request: Callable[[str, ChatCompletionRequest, str], None] = None):
def __init__(self, node: Node, inference_engine_classname: str, response_timeout: int = 90, on_chat_completion_request: Callable[[str, ChatCompletionRequest, str], None] = None):
self.node = node
self.inference_engine_classname = inference_engine_classname
self.response_timeout_secs = response_timeout_secs
self.response_timeout = response_timeout
self.on_chat_completion_request = on_chat_completion_request
self.app = web.Application(client_max_size=100*1024*1024) # 100MB to support image upload
self.prompts: PrefixDict[str, PromptSession] = PrefixDict()
Expand Down Expand Up @@ -273,7 +276,7 @@ async def handle_post_chat_completions(self, request):
return web.json_response({"detail": f"Error processing prompt (see logs with DEBUG>=2): {str(e)}"}, status=500)

try:
if DEBUG >= 2: print(f"Waiting for response to finish. timeout={self.response_timeout_secs}s")
if DEBUG >= 2: print(f"Waiting for response to finish. timeout={self.response_timeout}s")

if stream:
response = web.StreamResponse(
Expand Down Expand Up @@ -322,7 +325,7 @@ def on_result(_request_id: str, tokens: List[int], is_finished: bool):

return _request_id == request_id and is_finished

_, tokens, _ = await callback.wait(on_result, timeout=self.response_timeout_secs)
_, tokens, _ = await callback.wait(on_result, timeout=self.response_timeout)
if request_id in self.stream_tasks: # in case there is still a stream task running, wait for it to complete
if DEBUG >= 2: print("Pending stream task. Waiting for stream task to complete.")
try:
Expand All @@ -334,7 +337,7 @@ def on_result(_request_id: str, tokens: List[int], is_finished: bool):
else:
_, tokens, _ = await callback.wait(
lambda _request_id, tokens, is_finished: _request_id == request_id and is_finished,
timeout=self.response_timeout_secs,
timeout=self.response_timeout,
)

finish_reason = "length"
Expand Down
Loading