[Bounty] PyTorch & HuggingFace Interface #139

risingsunomi · 2024-08-11T02:24:36Z

Hello all,

I’ve made some updates to the exo library based on the bounty mentioned in this tweet/X post. These changes aim to integrate PyTorch and expand access to various language models through Hugging Face’s AutoModelForCausalLM.

What's New?

ShardedHuggingFaceModel: Adds sharding support for Hugging Face models.
PyTorchDynamicShardInferenceEngine: A new inference engine that uses PyTorch tensors for dynamic sharding.

These updates enable the exo library to use PyTorch, allowing access to a broader range of language models.

Limitations and Bugs

Right now the ShardedHuggingFaceModel is focused on using LlamaForCausalLM from the huggingface transformers library. From that model we break it up using LLamaModel and the layers it contains. We can then select the layers and run the pytorch tensors over them as need. I focused on using llama3.1 8B as I could only slightly run that.

Due to my current hardware limitations (specifically GPU and VRAM), I wasn’t able to fully test this across multiple nodes. The model currently takes about 30 seconds per token to generate for me (I have slow GPUs), which might be related to the absence of caching (not implemented due to VRAM constraints). It’s running without reaching an EOT and the outputs seem random.

Request for Feedback

I’m sharing this in the hope that others can test it on more capable setups and provide feedback on how to enhance performance and stability.

Important Note on Meta LLaMA 3.1 Model

If you plan to test with the official Meta LLaMA 3.1 model, please note:

Access: You’ll need to request access and authenticate using huggingface-cli to download it.
Command: Run the following command before using the model:
```
huggingface-cli login
```
I’m exploring ways to simplify this process, but for now, it’s necessary.

Chat API Update

Added an option to select the LLaMA 3.1 model in the chat API.

Looking forward to any feedback or suggestions you might have.

Thank you

AlexCheema · 2024-08-18T19:09:23Z

Hey, sorry for the delay. I haven't had a chance to check this properly yet. I'll be able to look next week.

risingsunomi · 2024-08-18T22:35:31Z

Hey, sorry for the delay. I haven't had a chance to check this properly yet. I'll be able to look next week.

Sounds good. Let me know anything needed. Thank you

AlexCheema · 2024-08-23T17:08:49Z

exo/inference/pytorch/model/hf.py

+
+        # self.past_key_values = DynamicCache()
+
+    def forward_layers(


I really like this approach of generalising this so it works for other models without having to explicitly implement them.

Can you write a test for a model with a different architecture to make sure this generalises e.g. recurrent Gemma?
I wonder if we need a little bit of model-specific behaviour to enable this in general?

AlexCheema · 2024-08-23T17:09:17Z

exo/inference/pytorch/helpers.py

+            async for chunk in response.content.iter_chunked(8192):
+                f.write(chunk)
+
+async def download_files(urls: List[str], output_paths: List[Path]):


No, I can remove

Please remove

Please remove

Sorry forgot to. Will do that now.

AlexCheema · 2024-08-23T17:11:26Z

exo/inference/pytorch/inference.py

+        self.shard = None
+        self.model = None
+        self.tokenizer = None
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


Are these the only options? I think supporting e.g. Mac with mps would be great since then you can run heterogeneous clusters.

One thing to try at some point would be mixing MLX and PyTorch and see if they are interoperable with exactly the same model.

With pytorch I don't think mac is fully rolled out yet. There seems to be some work arounds but CUDA and CPU are the only options on the pytorch download website. pytorch even stopped ROCm support for AMD

They have a nightly for testing MPS https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/

What about this in the official "stable" docs: https://pytorch.org/docs/stable/notes/mps.html

Will try that but currently no mac to test. When I get through these other fixes though I can definitely add it for you or other mac users to test.

AlexCheema · 2024-08-23T17:12:22Z

exo/inference/pytorch/model/hf.py

+        # Load the model
+        self.full_model = AutoModelForCausalLM.from_pretrained(
+            shard.model_id,
+            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,


Again, are these the only options? Would want support across other platforms

AlexCheema · 2024-08-23T17:13:27Z

exo/inference/pytorch/model/hf.py

+
+            layers.append(layer)
+
+        self.full_model.model.layers = nn.ModuleList(layers)


What does the peak memory usage look like here? I'm not sure of the specifics of python if this is going to hold each layer twice. Not sure but perhaps setting them in place would be more memory efficient.

They shouldn't be held twice as when the ensure_shard function is called in the infer_prompt or infer_tensor the init class function is called which loads the needed layers each time depending on the shard. Will make sure about memory limits though and usage.

AlexCheema · 2024-08-23T17:15:12Z

exo/inference/pytorch/model/hf.py

+
+
+        # Load the model
+        self.full_model = AutoModelForCausalLM.from_pretrained(


Will this download the entire model?
We have code to selectively download the model from HuggingFace so you don't have to download all layers on every device: exo/download

Also this won't work with our download progress code. We show in the TUI what the download progress of the model is.

Will this download the entire model? We have code to selectively download the model from HuggingFace so you don't have to download all layers on every device: exo/download

Will look at using that code because yes it currently does download all the model

AlexCheema · 2024-08-23T17:16:01Z

exo/inference/pytorch/model/utils.py

+import torch
+from torch.nn import functional as F
+
+def sample_logits(logits, temperature=1.0, top_k=0, top_p=1.0, alpha_f=0.0, alpha_p=0.0):


Can this be imported from somewhere rather than copy-pasta into the codebase? It looks like boilerplate code from somewhere.

Yes, I was testing it as the default values but will clean that part up. I will set it in the Interface class settings to be used.

AlexCheema · 2024-08-23T17:16:31Z

install.ps1

+& .\.venv\Scripts\Activate.ps1
+
+# Install the package in the virtual environment
+pip install .


Windows? Did this work on windows? Curious if it works there.

I was testing on Windows but couldn't fully get it working right. Will test again and make sure as I switched to using Linux to do further dev

@AlexCheema I can confirm that it is working on Windows, but there are a few issues:

PyTorch 2.4 doesn't install. In order to get it working, it needs the nightly build

In main.py

for s in [signal.SIGINT, signal.SIGTERM]: loop.add_signal_handler(s, handle_exit)

isn't supported on windows.

If I change handle_exit() to:

def handle_exit(): asyncio.ensure_future(shutdown(loop)) if platform.system() != "Windows": for s in [signal.SIGINT, signal.SIGTERM]: loop.add_signal_handler(s, handle_exit) else: # On Windows, we can only reliably catch SIGINT (Ctrl+C) signal.signal(signal.SIGINT, lambda signum, frame: handle_exit())

Getting some kind of network error between the GUI and the backend

It seems to work. It's insanely slow for me though (no GPU... the raspberry pis are much faster 😄). Windows changes perhaps out of scope for this PR though.

Wowzah!

This is exciting. A lot of the community were mad we didn't support Windows. We have a bounty here if you want to get it working there once this one is merged: #186

AlexCheema · 2024-08-23T17:18:40Z

Great work. You clearly thought about this and implemented a really nice solution. I particularly like the generalisation of model splitting, rather than doing each one separately.

Take a look through the comments I left.

AlexCheema · 2024-08-23T17:19:59Z

The main thing I want to address and test is device support. We can make this the default inference engine if it works reliably across many devices.

On that point, if we can automate the bootstrapping of the environment for each user (e.g. install drivers, whatever else is needed to run on their device) that would be great. We don't have to do this in this PR/bounty, we can do another. But I would love to discuss and figure out how this can best be done.

AlexCheema · 2024-08-23T17:21:45Z

exo/inference/pytorch/test_build_transformer.py

@@ -0,0 +1,21 @@
+import unittest


Run this test in circle ci ./.circleci/config.yml

AlexCheema · 2024-08-23T17:21:49Z

exo/inference/pytorch/test_inference_engine.py

@@ -0,0 +1,33 @@
+


Run this test in circle ci ./.circleci/config.yml

AlexCheema · 2024-08-23T17:22:33Z

exo/inference/pytorch/test_inference_engine.py

+        n_layers=12
+    )
+
+    engine = PyTorchDynamicShardInferenceEngine(


Is this test complete? We need a test that tests the model splitting. Take a look at exo/inference/test_inference_engine.py. You can just add the test there.

risingsunomi

Reviewed and looks good. Will work through notes to improve

risingsunomi · 2024-09-19T13:24:53Z

I have updated my main fork branch with the pytorch interface changes. Please take a look and test. Thank you!

AlexCheema · 2024-10-02T12:44:55Z

Hey @risingsunomi I'm thinking of making this the default inference engine on linux machines.
Could you resolve conflicts please?

risingsunomi · 2024-10-02T18:11:13Z

Hey @risingsunomi I'm thinking of making this the default inference engine on linux machines. Could you resolve conflicts please?

Will do and clean up more.

risingsunomi · 2024-10-02T18:38:21Z

@AlexCheema clean up finished and no conflicts with base branch

AlexCheema · 2024-10-03T21:31:48Z

torch not added as a dependency

AlexCheema · 2024-10-03T21:32:34Z

error loading and splitting model: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`
Error processing prompt: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`
Traceback (most recent call last):
  File "/Users/alex/exo/exo/main.py", line 161, in run_model_cli
    await node.process_prompt(shard, prompt, None, request_id=request_id)
  File "/Users/alex/exo/exo/orchestration/standard_node.py", line 98, in process_prompt
    resp = await self._process_prompt(base_shard, prompt, image_str, request_id, inference_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alex/exo/exo/orchestration/standard_node.py", line 134, in _process_prompt
    result, inference_state, is_finished = await self.inference_engine.infer_prompt(request_id, shard, prompt, image_str, inference_state=inference_state)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alex/exo/exo/inference/pytorch/inference.py", line 101, in infer_prompt
    await self.ensure_shard(shard)
  File "/Users/alex/exo/exo/inference/pytorch/inference.py", line 271, in ensure_shard
    self.stateful_sharded_model = ShardedHuggingFaceModel(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alex/exo/exo/inference/pytorch/model/hf.py", line 58, in __init__
    self.llm_model = AutoModelForCausalLM.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alex/exo/.venv/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alex/exo/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 3274, in from_pretrained
    raise ImportError(
ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`

accelerate package needs to be installed

risingsunomi · 2024-10-03T21:34:50Z

Let me add the dependences to the exo setup.py install_requires

AlexCheema · 2024-10-03T21:35:00Z

seems to use some other downloader (perhaps transformers?)
it should use the exo downloader for integration with exo (also these other downloads aren't necessarily async friendly but the exo one is). also we are going to remove the transformers dependency as soon as possible as it bloats the whole of exo if we want to distribute this as an installable.

AlexCheema · 2024-10-03T21:36:52Z

seems to use some other downloader (perhaps transformers?) it should use the exo downloader for integration with exo (also these other downloads aren't necessarily async friendly but the exo one is). also we are going to remove the transformers dependency as soon as possible as it bloats the whole of exo if we want to distribute this as an installable.

for this look at the other inference engine implementations for reference. you should first download using the exo downloader then point PyTorch to the directory on disk.

Node fixes jan242025

…l model to route needed model structure but same base mha structure, updated tests, updated inference engine, generalized some llm functions in llm_utils.py

Qwen dev jan25

--help list --interence-engine=torch

risingsunomi · 2025-02-14T01:45:45Z

Have updated branch again. Waiting on this to be approved. Please let me know any updates or any changes needed. Going to test to see if the new updates broke anything but so far so good.

amdyuzhoulu · 2025-03-06T02:26:12Z

Hi, @risingsunomi . Very nice project!

After testing the latest commit (2d1a) on both 4090D and 7900XTX, I am pleased to report that it runs and produces output.

However, there are several issues compared to the original TinyGrad implementation that need addressing:

On the Qwen-1.5B model with the 7900XTX GPU, the output results are nonsensical and do not meet expectations.
With the Qwen-0.5B model using either the 7900XTX or 4090D GPUs, there are problems with memory allocation and deallocation. Specifically, while the TinyGrad solution occupies about 4GB of VRAM, due to what seems to be an incorrect memory deallocation policy, the current version's VRAM usage spikes significantly, consuming almost all available VRAM and cause EXO to halt.

I hope this feedback will be helpful for further improvements.

risingsunomi · 2025-03-07T13:18:36Z

Hi, @risingsunomi . Very nice project!

After testing the latest commit (2d1a) on both 4090D and 7900XTX, I am pleased to report that it runs and produces output.

However, there are several issues compared to the original TinyGrad implementation that need addressing:

On the Qwen-1.5B model with the 7900XTX GPU, the output results are nonsensical and do not meet expectations.

With the Qwen-0.5B model using either the 7900XTX or 4090D GPUs, there are problems with memory allocation and deallocation. Specifically, while the TinyGrad solution occupies about 4GB of VRAM, due to what seems to be an incorrect memory deallocation policy, the current version's VRAM usage spikes significantly, consuming almost all available VRAM and cause EXO to halt.

I hope this feedback will be helpful for further improvements.

Hi @amdyuzhoulu

Appreciate you trying it out and I am happy to hear it is working. Hope to have this PR accepted by @AlexCheema and we can get this working for others. The VRAM/RAM issue in Tinygrad is also present in PyTorch. The quick fix is to clear and reload cache, weights and model with the last chat message + a small chat summary. Without that, with each chat prompt, the kvcache exponentially grows in VRAM/RAM. When reloading from the last message + a summary or a small number of past messages, the kvcache is smaller.

When trying to use the builtin clearing cache function inside our clear_model method, there was some possible async issues as space wasn't freed up. A full reset via deleting the shard in the inference engine seems to be the best way to reload but causes some delays. After reloading, inference is quick but depending on the machine, it is a bit of a bottleneck.

It is only done in two occasions

When encoding a prompt from user/system if it already has a model loaded
When hitting a torch.cuda.OutOfMemoryError during inference in infer_tensor.

We still keep a OOM counter but it is just for logging currently. OOM seems to only happen once and, if again, it is shut down by the operating system, at least on Linux.

Possibly we can add a way to track VRAM/RAM usage during inference and make it more smart. Will to work on that further implementation after hopefully getting this PR accepted.

Thank you,
Vincent

amdyuzhoulu · 2025-03-11T10:01:17Z

Hi, @risingsunomi . Very nice project!

Thank you for your swift follow-up.
Congratulations on the progress so far! Utilizing Torch memory profiling tools and modifying the tests/test_llama3_full.py file, I identified the segment of code responsible for the memory allocation issues.

@@ -232,14 +232,14 @@ class ShardedGeneralModel(nn.Module):
       print(f"mask: {mask}")
       print(f"input_pos: {input_pos}")
 
-    
-    model_output = self.model(
-      tokens=tokens,
-      mask=mask,
-      input_pos=input_pos,
-      hidden_state=hidden_state,
-      dtype=self.dtype
-    )
+    with torch.no_grad():
+      model_output = self.model(
+        tokens=tokens,
+        mask=mask,
+        input_pos=input_pos,
+        hidden_state=hidden_state,
+        dtype=self.dtype
+      )
 
     if self.shard.is_last_layer():
       model_logits = model_output

The modifications implemented have significantly improved performance; both the 0.5B and 7B models can now generate extensive responses without irregular GPU VRAM occupation using GPU 4090D.

I kindly request that you consider merging these changes. Additionally, there are a few more observations and potential areas for improvement:

After completing an answer, if a new question is asked, the GPU VRAM appears to be cleared before reloading from disk, which may affect initial token generation performance concurrently on speed.
Inference speed decreases with each subsequent question (1st: 10tk/s, 2nd: ~6tk/s, 3rd: ~4.5tk/s), indicating a possible issue with performance degradation over time.

A single CPU core shows very high workload while the GPU utilization remains around 30%. This suggests a potential bottleneck or inefficiency that might need further investigation, especially when compared to the Tinygrad implementation where GPU utilization stays close to 100% throughout inference without significant speed drops.

This is the TinyGrad-version exo on 7900XTX. 7900XTX GPU should be slower than 4090D for 40%~.

I plan to delve deeper into the memory clean-up functions and other related areas to investigate these issues further.

Looking forward to your feedback.

Best,
Yuzhou Lu

risingsunomi · 2025-03-11T13:44:13Z

Hi, @risingsunomi . Very nice project!

Thank you for your swift follow-up. Congratulations on the progress so far! Utilizing Torch memory profiling tools and modifying the tests/test_llama3_full.py file, I identified the segment of code responsible for the memory allocation issues.
@@ -232,14 +232,14 @@ class ShardedGeneralModel(nn.Module):
       print(f"mask: {mask}")
       print(f"input_pos: {input_pos}")
 
-    
-    model_output = self.model(
-      tokens=tokens,
-      mask=mask,
-      input_pos=input_pos,
-      hidden_state=hidden_state,
-      dtype=self.dtype
-    )
+    with torch.no_grad():
+      model_output = self.model(
+        tokens=tokens,
+        mask=mask,
+        input_pos=input_pos,
+        hidden_state=hidden_state,
+        dtype=self.dtype
+      )
 
     if self.shard.is_last_layer():
       model_logits = model_output
The modifications implemented have significantly improved performance; both the 0.5B and 7B models can now generate extensive responses without irregular GPU VRAM occupation using GPU 4090D.
I kindly request that you consider merging these changes. Additionally, there are a few more observations and potential areas for improvement:

After completing an answer, if a new question is asked, the GPU VRAM appears to be cleared before reloading from disk, which may affect initial token generation performance concurrently on speed.

Inference speed decreases with each subsequent question (1st: 10tk/s, 2nd: ~6tk/s, 3rd: ~4.5tk/s), indicating a possible issue with performance degradation over time.

3. A single CPU core shows very high workload while the GPU utilization remains around 30%. This suggests a potential bottleneck or inefficiency that might need further investigation, especially when compared to the Tinygrad implementation where GPU utilization stays close to 100% throughout inference without significant speed drops. This is the TinyGrad-version exo on 7900XTX. 7900XTX GPU should be slower than 4090D for 40%~.
I plan to delve deeper into the memory clean-up functions and other related areas to investigate these issues further.

Looking forward to your feedback.

Best, Yuzhou Lu

Hi Yuzhou,

I appreciate the work on this and will accept the changes. Please open a PR on my fork and will merge it in. I can also then take out the resetting of the model/shard and that should help with the other issues. Let me know if you see something else though to improve this. Hopefully @AlexCheema can make it so we can share the bounty.

Thank you and, again, appreciate your work, time and effort on this. I needed another pair of skilled eyes to see more of what I couldn't.

Vincent

AlexCheema reviewed Aug 23, 2024

View reviewed changes

exo/inference/pytorch/test_inference_engine.py Outdated

@@ -0,0 +1,33 @@

Copy link

Contributor

AlexCheema Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run this test in circle ci ./.circleci/config.yml

AlexCheema reviewed Aug 23, 2024

View reviewed changes

risingsunomi commented Aug 23, 2024

View reviewed changes

This was referenced Aug 27, 2024

[BOUNTY - $500] Llama.cpp inference engine #167

Open

[BOUNTY - $200] Windows native support #186

Open

This was referenced Sep 27, 2024

Error: Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): No module named '_posixshmem' #237

Open

Support local and custom models #246

Draft

AlexCheema mentioned this pull request Oct 2, 2024

Many things to improve or fix #269

Open

risingsunomi and others added 24 commits January 24, 2025 14:53

model load issue

efcb5b9

pass token issue

4bf752b

pass token issue

7cc42f1

pass token issue

4e3e53e

Merge pull request #46 from risingsunomi/node-fixes-jan242025

c3bde74

Node fixes jan242025

removing screenshot

a09956e

fix for sharded weights

f508dff

adding torch support for llama-3.2-3b

8920a87

fixing tok_embeddings

1d7262d

fixing caching setup

ec91e09

adding qwen2 model, creating a general multihead attention transforma…

a7757d3

…l model to route needed model structure but same base mha structure, updated tests, updated inference engine, generalized some llm functions in llm_utils.py

removing duplicate mlp for single layer_mlp

1431d48

fixes to general mha for detecting to use_tied or not

7ad4b1c

Merge pull request #47 from risingsunomi/qwen-dev-jan25

fbb6e55

Qwen dev jan25

Merge branch 'main' of github.com:exo-explore/exo into exo-explore-main

f7028c7

Merge branch 'exo-explore-main'

5f6b22d

adding new shard download method, adding all llama models for torch

76e141a

adding torch support for qwen models

85d25c1

updating torch to 2.6.0 latest stable

57b43f7

test for mistral support

611bffb

--help list --interence-engine=torch

0523893

Merge branch 'exo-explore:main' into main

2bc2b3d

Merge pull request #49 from divinity76/patch-2

aac3e75

--help list --interence-engine=torch

Merge branch 'exo-explore:main' into main

2d1a9a6

Merge branch 'exo-explore:main' into main

3f93257


		layers.append(layer)

		self.full_model.model.layers = nn.ModuleList(layers)



		# Load the model
		self.full_model = AutoModelForCausalLM.from_pretrained(

[Bounty] PyTorch & HuggingFace Interface #139

Are you sure you want to change the base?

[Bounty] PyTorch & HuggingFace Interface #139

Conversation

risingsunomi commented Aug 11, 2024

What's New?

Limitations and Bugs

Request for Feedback

Important Note on Meta LLaMA 3.1 Model

Chat API Update

AlexCheema commented Aug 18, 2024

risingsunomi commented Aug 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexCheema commented Aug 23, 2024

AlexCheema commented Aug 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

risingsunomi left a comment

Choose a reason for hiding this comment

risingsunomi commented Sep 19, 2024

AlexCheema commented Oct 2, 2024

risingsunomi commented Oct 2, 2024

risingsunomi commented Oct 2, 2024

AlexCheema commented Oct 3, 2024

AlexCheema commented Oct 3, 2024

risingsunomi commented Oct 3, 2024

AlexCheema commented Oct 3, 2024 • edited Loading

AlexCheema commented Oct 3, 2024

risingsunomi commented Feb 14, 2025

amdyuzhoulu commented Mar 6, 2025 • edited Loading

risingsunomi commented Mar 7, 2025

amdyuzhoulu commented Mar 11, 2025

risingsunomi commented Mar 11, 2025

AlexCheema commented Oct 3, 2024 •

edited

Loading

amdyuzhoulu commented Mar 6, 2025 •

edited

Loading