risingsunomi
diff --git a/‎.circleci/config.yml
+3-3 b/‎.circleci/config.yml
+3-3
diff --git a/‎README.md
+31-9 b/‎README.md
+31-9
diff --git a/‎examples/chatgpt_api.sh
+39 b/‎examples/chatgpt_api.sh
+39
diff --git a/‎examples/llama3_distributed.py
-81 b/‎examples/llama3_distributed.py
-81
diff --git a/‎exo/api/chatgpt_api.py
+19-16 b/‎exo/api/chatgpt_api.py
+19-16
@@ -17,11 +17,11 @@ commands:
             source env/bin/activate
 
             # Start first instance
-            HF_HOME="$(pwd)/.hf_cache_node1" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node1" --listen-port 5678 --broadcast-port 5679 --chatgpt-api-port 8000 --chatgpt-api-response-timeout-secs 900 > output1.log 2>&1 &
+            HF_HOME="$(pwd)/.hf_cache_node1" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node1" --listen-port 5678 --broadcast-port 5679 --chatgpt-api-port 8000 --chatgpt-api-response-timeout 900 2>&1 | tee output1.log &
             PID1=$!
 
             # Start second instance
-            HF_HOME="$(pwd)/.hf_cache_node2" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node2" --listen-port 5679 --broadcast-port 5678 --chatgpt-api-port 8001 --chatgpt-api-response-timeout-secs 900 > output2.log 2>&1 &
+            HF_HOME="$(pwd)/.hf_cache_node2" DEBUG_DISCOVERY=7 DEBUG=7 python3 main.py --inference-engine <<parameters.inference_engine>> --node-id "node2" --listen-port 5679 --broadcast-port 5678 --chatgpt-api-port 8001 --chatgpt-api-response-timeout 900 2>&1 | tee output2.log &
             PID2=$!
 
             # Wait for discovery
@@ -144,7 +144,7 @@ jobs:
             PID2=$!
             sleep 10
             kill $PID1 $PID2
-            if grep -q "Connected to peer" output1.log && grep -q "Connected to peer" output2.log; then
+            if grep -q "Successfully connected peers: \['node2@.*:.*'\]" output1.log && ! grep -q "Failed to connect peers:" output1.log && grep -q "Successfully connected peers: \['node1@.*:.*'\]" output2.log && ! grep -q "Failed to connect peers:" output2.log; then
               echo "Test passed: Both instances discovered each other"
               exit 0
             else
 
@@ -5,12 +5,12 @@
   <img alt="exo logo" src="/docs/exo-logo-transparent.png" width="50%" height="50%">
 </picture>
 
-exo: Run your own AI cluster at home with everyday devices. Maintained by [exo labs](https://x.com/exolabs_).
+exo: Run your own AI cluster at home with everyday devices. Maintained by [exo labs](https://x.com/exolabs).
 
 
 <h3>
 
-[Discord](https://discord.gg/EUnjGpsmWw) | [Telegram](https://t.me/+Kh-KqHTzFYg3MGNk) | [X](https://x.com/exolabs_)
+[Discord](https://discord.gg/EUnjGpsmWw) | [Telegram](https://t.me/+Kh-KqHTzFYg3MGNk) | [X](https://x.com/exolabs)
 
 </h3>
 
@@ -25,14 +25,12 @@ exo: Run your own AI cluster at home with everyday devices. Maintained by [exo l
 Forget expensive NVIDIA GPUs, unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, Linux, pretty much any device!
 
 <div align="center">
-  <h2>Update: Exo Supports Llama 3.1</h2>
-  <p>Run 8B, 70B and 405B parameter Llama 3.1 models on your own devices</p>
-  <p><a href="https://github.com/exo-explore/exo/blob/main/exo/inference/mlx/models/llama.py">See the code</a></p>
+  <h2>Update: exo is hiring. See <a href="https://exolabs.net">here</a> for more details.</h2>
 </div>
 
 ## Get Involved
 
-exo is **experimental** software. Expect bugs early on. Create issues so they can be fixed. The [exo labs](https://x.com/exolabs_) team will strive to resolve issues quickly.
+exo is **experimental** software. Expect bugs early on. Create issues so they can be fixed. The [exo labs](https://x.com/exolabs) team will strive to resolve issues quickly.
 
 We also welcome contributions from the community. We have a list of bounties in [this sheet](https://docs.google.com/spreadsheets/d/1cTCpTIp48UnnIvHeLEUNg1iMy_Q6lRybgECSFCoVJpE/edit?usp=sharing).
 
@@ -52,7 +50,7 @@ exo will [automatically discover](https://github.com/exo-explore/exo/blob/945f90
 
 ### ChatGPT-compatible API
 
-exo provides a [ChatGPT-compatible API](exo/api/chatgpt_api.py) for running models. It's a [one-line change](examples/chatgpt_api.py) in your application to run models on your own hardware using exo.
+exo provides a [ChatGPT-compatible API](exo/api/chatgpt_api.py) for running models. It's a [one-line change](examples/chatgpt_api.sh) in your application to run models on your own hardware using exo.
 
 ### Device Equality
 
@@ -108,8 +106,6 @@ python3 main.py
 
 That's it! No configuration required - exo will automatically discover the other device(s).
 
-The native way to access models running on exo is using the exo library with peer handles. See how in [this example for Llama 3](examples/llama3_distributed.py).
-
 exo starts a ChatGPT-like WebUI (powered by [tinygrad tinychat](https://github.com/tinygrad/tinygrad/tree/master/examples/tinychat)) on http://localhost:8000
 
 For developers, exo also starts a ChatGPT-compatible API endpoint on http://localhost:8000/v1/chat/completions. Example with curls:
@@ -150,6 +146,26 @@ curl http://localhost:8000/v1/chat/completions \
    }'
 ```
 
+### Example Usage on Multiple Heterogenous Devices (MacOS + Linux)
+
+#### Device 1 (MacOS):
+
+```sh
+python3 main.py --inference-engine tinygrad
+```
+
+Here we explicitly tell exo to use the **tinygrad** inference engine.
+
+#### Device 2 (Linux):
+```sh
+python3 main.py
+```
+
+Linux devices will automatically default to using the **tinygrad** inference engine.
+
+You can read about tinygrad-specific env vars [here](https://docs.tinygrad.org/env_vars/). For example, you can configure tinygrad to use the cpu by specifying `CLANG=1`.
+
+
 ## Debugging
 
 Enable debug logs with the DEBUG environment variable (0-9).
@@ -158,6 +174,12 @@ Enable debug logs with the DEBUG environment variable (0-9).
 DEBUG=9 python3 main.py
 ```
 
+For the **tinygrad** inference engine specifically, there is a separate DEBUG flag `TINYGRAD_DEBUG` that can be used to enable debug logs (1-6).
+
+```sh
+TINYGRAD_DEBUG=2 python3 main.py
+```
+
 ## Known Issues
 
 - 🚧 As the library is evolving so quickly, the iOS implementation has fallen behind Python. We have decided for now not to put out the buggy iOS version and receive a bunch of GitHub issues for outdated code. We are working on solving this properly and will make an announcement when it's ready. If you would like access to the iOS implementation now, please email [email protected] with your GitHub username explaining your use-case and you will be granted access on GitHub.
 
@@ -0,0 +1,39 @@
+# exo provides an API that aims to be a drop-in replacements for the ChatGPT-API.
+# This example shows how you can use the API first without streaming and second with streaming.
+# This works the same in a single-node set up and in a multi-node setup.
+# You need to start exo before running this by running `python3 main.py`.
+
+API_ENDPOINT="http://${API_ENDPOINT:-$(ifconfig | grep 'inet ' | grep -v '127.0.0.1' | awk '{print $2}' | head -n 1):8000}"
+MODEL="llama-3.1-8b"
+PROMPT="What is the meaning of exo?"
+TEMPERATURE=0.7
+
+echo ""
+echo ""
+echo "--- Output without streaming:"
+echo ""
+curl "${API_ENDPOINT}/v1/chat/completions" --silent \
+  -H "Content-Type: application/json" \
+  -d '{
+     "model": "'"${MODEL}"'",
+     "messages": [{"role": "user", "content": "'"${PROMPT}"'"}],
+     "temperature": '"${TEMPERATURE}"'
+   }'
+
+echo ""
+echo ""
+echo "--- Output with streaming:"
+echo ""
+curl "${API_ENDPOINT}/v1/chat/completions" --silent \
+  -H "Content-Type: application/json" \
+  -d '{
+     "model": "'"${MODEL}"'",
+     "messages": [{"role": "user", "content": "'"${PROMPT}"'"}],
+     "temperature": '"${TEMPERATURE}"',
+     "stream": true
+   }' | while read -r line; do
+       if [[ $line == data:* ]]; then
+           content=$(echo "$line" | sed 's/^data: //')
+           echo "$content" | jq -r '.choices[].delta.content' --unbuffered | tr -d '\n'
+       fi
+   done
@@ -58,6 +58,9 @@ def generate_completion(
       "finish_reason": finish_reason,
     }],
   }
+  
+  if DEBUG >= 3:
+    print(f"completion: {completion}")
 
   if not stream:
     completion["usage"] = {
@@ -67,9 +70,16 @@ def generate_completion(
     }
 
   choice = completion["choices"][0]
+  print(f"\nchoice {choice}")
   if object_type.startswith("chat.completion"):
     key_name = "delta" if stream else "message"
-    choice[key_name] = {"role": "assistant", "content": tokenizer.decode(tokens)}
+
+    token_decode = tokenizer.batch_decode(
+       tokens,
+       skip_special_tokens=True,
+       clean_up_tokenization_spaces=False
+    )
+    choice[key_name] = {"role": "assistant", "content": token_decode}
   elif object_type == "text_completion":
     choice["text"] = tokenizer.decode(tokens)
   else:
@@ -113,16 +123,9 @@ def remap_messages(messages: List[Message]) -> List[Message]:
 
 
 def build_prompt(tokenizer, _messages: List[Message]):
-  if len(_messages) == 1:
-    user_msg = _messages[0]
-
-    # get instruct sys message
-    sys_msg = Message(role="system", content="You are a helpful assistant.")
-
-    # restructure for sys_msg to go first
-    _messages = [sys_msg, user_msg]
-
   messages = remap_messages(_messages)
+  if DEBUG >= 3:
+    print(f"messages: {messages}")
   prompt = tokenizer.apply_chat_template(
     messages, 
     tokenize=False, 
@@ -140,7 +143,7 @@ def build_prompt(tokenizer, _messages: List[Message]):
       continue
 
     for content in message.content:
-      # note: we only support one image at a time right now. Multiple is possible. See: https://github.com/huggingface/transformers/blob/e68ec18ce224af879f22d904c7505a765fb77de3/docs/source/en/model_doc/llava.md?plain=1#L41
+      # note: wae only support one image at  time right now. Multiple is possible. See: https://github.com/huggingface/transformers/blob/e68ec18ce224af879f22d904c7505a765fb77de3/docs/source/en/model_doc/llava.md?plain=1#L41
       # follows the convention in https://platform.openai.com/docs/guides/vision
       if isinstance(content, dict) and content.get("type", None) == "image":
         image_str = content.get("image", None)
@@ -171,10 +174,10 @@ def __init__(self, request_id: str, timestamp: int, prompt: str):
 
 
 class ChatGPTAPI:
-  def __init__(self, node: Node, inference_engine_classname: str, response_timeout_secs: int = 90, on_chat_completion_request: Callable[[str, ChatCompletionRequest, str], None] = None):
+  def __init__(self, node: Node, inference_engine_classname: str, response_timeout: int = 90, on_chat_completion_request: Callable[[str, ChatCompletionRequest, str], None] = None):
     self.node = node
     self.inference_engine_classname = inference_engine_classname
-    self.response_timeout_secs = response_timeout_secs
+    self.response_timeout = response_timeout
     self.on_chat_completion_request = on_chat_completion_request
     self.app = web.Application(client_max_size=100*1024*1024)  # 100MB to support image upload
     self.prompts: PrefixDict[str, PromptSession] = PrefixDict()
@@ -273,7 +276,7 @@ async def handle_post_chat_completions(self, request):
       return web.json_response({"detail": f"Error processing prompt (see logs with DEBUG>=2): {str(e)}"}, status=500)
 
     try:
-      if DEBUG >= 2: print(f"Waiting for response to finish. timeout={self.response_timeout_secs}s")
+      if DEBUG >= 2: print(f"Waiting for response to finish. timeout={self.response_timeout}s")
 
       if stream:
         response = web.StreamResponse(
@@ -322,7 +325,7 @@ def on_result(_request_id: str, tokens: List[int], is_finished: bool):
 
           return _request_id == request_id and is_finished
 
-        _, tokens, _ = await callback.wait(on_result, timeout=self.response_timeout_secs)
+        _, tokens, _ = await callback.wait(on_result, timeout=self.response_timeout)
         if request_id in self.stream_tasks:  # in case there is still a stream task running, wait for it to complete
           if DEBUG >= 2: print("Pending stream task. Waiting for stream task to complete.")
           try:
@@ -334,7 +337,7 @@ def on_result(_request_id: str, tokens: List[int], is_finished: bool):
       else:
         _, tokens, _ = await callback.wait(
           lambda _request_id, tokens, is_finished: _request_id == request_id and is_finished,
-          timeout=self.response_timeout_secs,
+          timeout=self.response_timeout,
         )
 
         finish_reason = "length"