Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With exo unable to run llama-3.2-1b #459

Open
FFAMax opened this issue Nov 14, 2024 · 5 comments
Open

With exo unable to run llama-3.2-1b #459

FFAMax opened this issue Nov 14, 2024 · 5 comments

Comments

@FFAMax
Copy link
Contributor

FFAMax commented Nov 14, 2024

It's trying load and never completed

Removing download task for Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16): True
  0%|                                                                                                           | 0/148 [00:00<?, ?it/s]
ram used:  0.00 GB, layers.0.attention.wq.weight                      :   1%|▏                          | 1/148 [00:00<00:07, 19.19it/s]
ram used:  0.01 GB, layers.0.attention.wk.weight                      :   1%|▎                          | 2/148 [00:00<00:07, 19.45it/s]
ram used:  0.01 GB, layers.0.attention.wv.weight                      :   2%|▌                          | 3/148 [00:00<00:05, 26.73it/s]
....
ram used:  2.47 GB, output.weight                                     :  99%|████████████████████████▊| 147/148 [00:07<00:00, 18.38it/s]
ram used:  2.47 GB, freqs_cis                                         : 100%|█████████████████████████| 148/148 [00:07<00:00, 18.50it/s]
ram used:  2.47 GB, freqs_cis                                         : 100%|█████████████████████████| 148/148 [00:08<00:00, 18.50it/s]
loaded weights in 8005.22 ms, 2.47 GB loaded at 0.31 GB/s
  0%|                                                                                                           | 0/148 [00:00<?, ?it/s]
ram used:  2.47 GB, layers.0.attention.wq.weight                      :   1%|▏                         | 1/148 [00:00<00:00, 224.25it/s]
ram used:  2.48 GB, layers.0.attention.wk.weight                      :   1%|▎                         | 2/148 [00:00<00:00, 221.67it/s]
ram used:  2.48 GB, layers.0.attention.wo.weight                      :   3%|▋                         | 4/148 [00:00<00:00, 262.63it/s]
...
ram used:  4.42 GB, tok_embeddings.weight                             :  99%|████████████████████████▋| 146/148 [00:07<00:00, 18.27it/s]
ram used:  4.94 GB, freqs_cis                                         : 100%|█████████████████████████| 148/148 [00:07<00:00, 18.51it/s]
ram used:  4.94 GB, freqs_cis                                         : 100%|█████████████████████████| 148/148 [00:07<00:00, 18.50it/s]
loaded weights in 8002.23 ms, 2.47 GB loaded at 0.31 GB/s
  0%|                                                                                                           | 0/148 [00:00<?, ?it/s]
ram used:  4.94 GB, layers.0.attention.wq.weight                      :   1%|▏                         | 1/148 [00:00<00:00, 179.52it/s]
ram used:  4.95 GB, layers.0.attention.wk.weight                      :   1%|▎                          | 2/148 [00:00<00:15,  9.20it/s]
...
ram used:  7.41 GB, freqs_cis                                         : 100%|█████████████████████████| 148/148 [00:08<00:00, 18.31it/s]
ram used:  7.41 GB, freqs_cis                                         : 100%|█████████████████████████| 148/148 [00:08<00:00, 18.31it/s]
loaded weights in 8087.63 ms, 2.47 GB loaded at 0.31 GB/s
  0%|                                                                                                           | 0/148 [00:00<?, ?it/s]
ram used:  7.41 GB, layers.0.attention.wq.weight                      :   1%|▏                         | 1/148 [00:00<00:00, 202.87it/s]
ram used:  7.42 GB, layers.0.attention.wk.weight                      :   1%|▎                         | 2/148 [00:00<00:00, 207.12it/s]
ram used:  7.43 GB, layers.0.attention.wo.weight

Final

ram used: 11.83 GB, tok_embeddings.weight                             :  99%|████████████████████████▋| 146/148 [00:08<00:00, 18.15it/s]
ram used: 12.36 GB, output.weight                                     :  99%|████████████████████████▊| 147/148 [00:08<00:00, 18.27it/s]
ram used: 12.36 GB, freqs_cis                                         : 100%|█████████████████████████| 148/148 [00:08<00:00, 18.39it/s]
ram used: 12.36 GB, freqs_cis                                         : 100%|█████████████████████████| 148/148 [00:08<00:00, 18.38it/s]
loaded weights in 8055.68 ms, 2.47 GB loaded at 0.31 GB/s
Task exception was never retrieved
future: <Task finished name='Task-447' coro=<StandardNode.process_prompt() done, defined at
/home/user/exo/exo/orchestration/standard_node.py:144> exception=RuntimeError('Wait timeout: 10000 ms! (the signal is not set to 7830,
but 7828)')>
Traceback (most recent call last):
  File "/home/user/exo/exo/orchestration/standard_node.py", line 166, in process_prompt
    resp = await self._process_prompt(base_shard, prompt, request_id)
  File "/home/user/exo/exo/orchestration/standard_node.py", line 198, in _process_prompt
    result = await self.inference_engine.infer_prompt(request_id, shard, prompt)
  File "/home/user/exo/exo/inference/inference_engine.py", line 29, in infer_prompt
    output_data = await self.infer_tensor(request_id, shard, tokens)
  File "/home/user/exo/exo/inference/tinygrad/inference.py", line 88, in infer_tensor
    return output_data.numpy()
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 3500, in _wrapper
    ret = fn(*args, **kwargs)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 310, in numpy
    return np.frombuffer(self._data(), dtype=_to_np_dtype(self.dtype)).reshape(self.shape)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 3475, in _wrapper
    if _METADATA.get() is not None: return fn(*args, **kwargs)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 254, in _data
    cpu = self.cast(self.dtype.scalar()).contiguous().to("CLANG").realize()
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 3475, in _wrapper
    if _METADATA.get() is not None: return fn(*args, **kwargs)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 213, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 224, in run_schedule
    ei.run(var_vals, do_update_stats=do_update_stats)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 174, in run
    et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 140, in __call__
    self.copy(dest, src)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 135, in copy
    dest.copyin(src.as_buffer(allow_zero_copy=True))  # may allocate a CPU buffer depending on allow_zero_copy
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/device.py", line 114, in as_buffer
    return self.copyout(memoryview(bytearray(self.nbytes)))
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/device.py", line 125, in copyout
    self.allocator.copyout(mv, self._buf)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/device.py", line 664, in copyout
    self.device.timeline_signal.wait(self.device.timeline_value)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/device.py", line 424, in wait
    raise RuntimeError(f"Wait timeout: {timeout} ms! (the signal is not set to {value}, but {self.value})")
RuntimeError: Wait timeout: 10000 ms! (the signal is not set to 7830, but 7828)
Task exception was never retrieved
future: <Task finished name='Task-30321' coro=<StandardNode.process_prompt() done, defined at
/home/user/exo/exo/orchestration/standard_node.py:144> exception=RuntimeError('Wait timeout: 10000 ms! (the signal is not set to 8753,
but 7828)')>
Traceback (most recent call last):
  File "/home/user/exo/exo/orchestration/standard_node.py", line 166, in process_prompt
    resp = await self._process_prompt(base_shard, prompt, request_id)
  File "/home/user/exo/exo/orchestration/standard_node.py", line 198, in _process_prompt
    result = await self.inference_engine.infer_prompt(request_id, shard, prompt)
  File "/home/user/exo/exo/inference/inference_engine.py", line 29, in infer_prompt
    output_data = await self.infer_tensor(request_id, shard, tokens)
  File "/home/user/exo/exo/inference/tinygrad/inference.py", line 88, in infer_tensor
    return output_data.numpy()
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 3500, in _wrapper
    ret = fn(*args, **kwargs)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 310, in numpy
    return np.frombuffer(self._data(), dtype=_to_np_dtype(self.dtype)).reshape(self.shape)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 3475, in _wrapper
    if _METADATA.get() is not None: return fn(*args, **kwargs)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 254, in _data
    cpu = self.cast(self.dtype.scalar()).contiguous().to("CLANG").realize()
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 3475, in _wrapper
    if _METADATA.get() is not None: return fn(*args, **kwargs)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/tensor.py", line 213, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 224, in run_schedule
    ei.run(var_vals, do_update_stats=do_update_stats)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 174, in run
    et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 140, in __call__
    self.copy(dest, src)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 135, in copy
    dest.copyin(src.as_buffer(allow_zero_copy=True))  # may allocate a CPU buffer depending on allow_zero_copy
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/device.py", line 114, in as_buffer
    return self.copyout(memoryview(bytearray(self.nbytes)))
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/device.py", line 125, in copyout
    self.allocator.copyout(mv, self._buf)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/device.py", line 657, in copyout
    self.device.synchronize()
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/device.py", line 519, in synchronize
    self.timeline_signal.wait(self.timeline_value - 1)
  File "/home/user/exo/lib/python3.10/site-packages/tinygrad/device.py", line 424, in wait
    raise RuntimeError(f"Wait timeout: {timeout} ms! (the signal is not set to {value}, but {self.value})")
RuntimeError: Wait timeout: 10000 ms! (the signal is not set to 8753, but 7828)
Task exception was never retrieved
future: <Task finished name='Task-61811' coro=<StandardNode.process_prompt() done, defined at
/home/user/exo/exo/orchestration/standard_node.py:144> exception=RuntimeError('Wait timeout: 10000 ms! (the signal is not set to 8753,
but 7828)')>
@VGLALALA
Copy link

Are you using tinygrad?

@FFAMax
Copy link
Contributor Author

FFAMax commented Nov 14, 2024

Are you using tinygrad?

Yes. That's a linux machine there therefore TinygradDynamicShardInferenceEngine picked up.

@VGLALALA
Copy link

You cant run llama 3.2 with tinygrad yet, currently it only support llama 3.1 and llama 3

@johnykes
Copy link

You cant run llama 3.2 with tinygrad yet, currently it only support llama 3.1 and llama 3

what about exo --inference-engine mlx ?

I have errors using both on a linux server.

Does mlx have some specific requirements regarding the system (e.g. a GPU installed)?

@VGLALALA
Copy link

uhh mlx is for Apple only, Apple silicon not even intel Apple. MLX Supports every model and those model are quantized in 4bit, model for linux currently is only Llama 3 and 3.1 8b 70b in fp32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants