Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clang issues on ubuntu 24.04 and Python 3.12 #458

Open
tunguz opened this issue Nov 14, 2024 · 17 comments
Open

clang issues on ubuntu 24.04 and Python 3.12 #458

tunguz opened this issue Nov 14, 2024 · 17 comments

Comments

@tunguz
Copy link

tunguz commented Nov 14, 2024

Screenshot from 2024-11-14 08-18-52

After installing exo and clang on an Ubuntu 24.04 machine with a Ryzen CPU I got an error while trying to run a prompt. (See attached image.) Anyone have any idea what might be going on?

@devinatkin
Copy link

I got this same issue myself. Currently trying my hand at diagnosing what might be the cause. I'm guess it's something to do with the fact that it's running in a venv with python there.

image

I'm on a beefy setup so I know it's not a resource constraint issue (2 1080 Ti)

@tunguz
Copy link
Author

tunguz commented Nov 14, 2024

@devinatkin I don't think its the venv issue. I am running it on bare metal Ubuntu 24.04, which comes with Python 3.12 as the default system Python.

@devinatkin
Copy link

@devinatkin I don't think its the venv issue. I am running it on bare metal Ubuntu 24.04, which comes with Python 3.12 as the default system Python.

Well that's good to know I'm on a fresh install of Ubuntu 24.04.1 LTS and decided to try with just the 1 machine pretty good machine before adding the rest of the junk heap.

@devinatkin
Copy link

Error processing prompt: Command '['clang', '-shared', '-march=native', '-O2', '-Wall', '-Werror', '-x', 'c', '-fPIC', '-ffreestanding', '-nostdlib', '-', '-o', '/tmp/tmp8o4ea_pa']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/home/dmatkin/exo/exo/main.py", line 193, in run_model_cli
    await node.process_prompt(shard, prompt, request_id=request_id)
  File "/home/dmatkin/exo/exo/orchestration/standard_node.py", line 166, in process_prompt
    resp = await self._process_prompt(base_shard, prompt, request_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/exo/orchestration/standard_node.py", line 198, in _process_prompt
    result = await self.inference_engine.infer_prompt(request_id, shard, prompt)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/exo/inference/inference_engine.py", line 28, in infer_prompt
    tokens = await self.encode(shard, prompt)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/exo/inference/tinygrad/inference.py", line 76, in encode
    await self.ensure_shard(shard)
  File "/home/dmatkin/exo/exo/inference/tinygrad/inference.py", line 99, in ensure_shard
    model_shard = await loop.run_in_executor(self.executor, build_transformer, model_path, shard, parameters)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/exo/inference/tinygrad/inference.py", line 59, in build_transformer
    load_state_dict(model, weights, strict=False, consume=False)  # consume=True
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict
    else: v.replace(state_dict[k].to(v.device)).realize()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/tensor.py", line 3500, in _wrapper
    ret = fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/tensor.py", line 213, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 222, in run_schedule
    for ei in lower_schedule(schedule):
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 215, in lower_schedule
    raise e
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 209, in lower_schedule
    try: yield lower_schedule_item(si)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 193, in lower_schedule_item
    runner = get_runner(si.outputs[0].device, si.ast)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 162, in get_runner
    method_cache[ckey] = method_cache[bkey] = ret = CompiledRunner(replace(prg, dname=dname))
                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 84, in __init__
    self.lib:bytes = precompiled if precompiled is not None else Device[p.dname].compiler.compile_cached(p.src)
                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/device.py", line 183, in compile_cached
    lib = self.compile(src)
          ^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/runtime/ops_clang.py", line 15, in compile
    subprocess.check_output(['clang', '-shared', *self.args, '-O2', '-Wall', '-Werror', '-x', 'c', '-fPIC', '-ffreestanding', '-nostdlib',
  File "/usr/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['clang', '-shared', '-march=native', '-O2', '-Wall', '-Werror', '-x', 'c', '-fPIC', '-ffreestanding', '-nostdlib', '-', '-o', '/tmp/tmp8o4ea_pa']' returned non-zero exit status 1.
Received exit signal SIGTERM...

Trying to launch with a run command seems to deliver the same type of issue.

@tunguz
Copy link
Author

tunguz commented Nov 14, 2024

@devinatkin I don't think its the venv issue. I am running it on bare metal Ubuntu 24.04, which comes with Python 3.12 as the default system Python.

Well that's good to know I'm on a fresh install of Ubuntu 24.04.1 LTS and decided to try with just the 1 machine pretty good machine before adding the rest of the junk heap.

Yup, I am using a brand new machine with a brand new Ubuntu install. This was pretty much the first thing I had tried on it.

@cadenmackenzie
Copy link
Contributor

+1 I was getting this as well. Thought maybe the clang version wasn't compatible with the current tinygrad implementation so tried clang 14 and 16 but couldn't get it to work.

@cnukaus
Copy link

cnukaus commented Nov 15, 2024

adding -v to clang, you can probably see the actual error:
clang -v -include tgmath.h -shared -march=native -O2 -Wall -Werror -x c -fPIC - -o /tmp/tefsd
Ubuntu clang version 14.0.0-1ubuntu1.1
TLDNR removed
on-dir=/root/exo -ferror-limit 19 -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/--368d98.o -x c -
clang -cc1 version 14.0.0 based upon LLVM 14.0.0 default target x86_64-pc-linux-gnu
ignoring nonexistent directory "/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include"
ignoring nonexistent directory "/include"

Basically some kind of path issue as it has a repeating component, that shouldn't be repeating.
it is compiling from pipe, not a file, so I can't debug further

@tunguz
Copy link
Author

tunguz commented Nov 15, 2024

Basically some kind of path issue as it has a repeating component, that shouldn't be repeating. it is compiling from pipe, not a file, so I can't debug further

Would it be possible to overcome this by a different kind of clang installation?

@jonstelly
Copy link

I'm having other clang problems trying to get a docker image together (Dockerfile if curious) so this might just change from the issue you're seeing to the one I'm seeing, but to the question of trying other clang installation methods...

https://apt.llvm.org/ for nightly builds, other versions, etc...

would be curious to hear if you get things working with a different version / install.

@AlexCheema
Copy link
Contributor

@blindcrone you're running this on your linux box right? Could you take a look at what might be the issue here? Thanks!

@blindcrone
Copy link
Contributor

blindcrone commented Nov 19, 2024

I've only got the tinygrad backend working on linux machines that have GPUs. I chased this rabbithole a bit a few weeks ago and found that this is an issue in tinygrad in general, as I've yet to find any report of a linux user being able to use the clang backend on tinygrad for llama and related models. Tinygrad contains a "fix-bf16"-alike function that also doesn't seem to solve the issue.

The actual bug is happening in LLVM when trying to support float16 types, and is an issue I was able to chase down in that repository, I'll look for it again and find links to post here, but the tl;dr is that this might be patched in LLVM 19 but no distro I know of currently packages it because of build issues

@lexasub
Copy link

lexasub commented Nov 20, 2024

@blindcrone i install it(clang 20), but is not fixed

@blindcrone
Copy link
Contributor

Yea, I'm on arch and I haven't gotten newer llvm to cleanly install (Probably I have stuff that depends on old versions, or have the wrong compilers to build it), so if it doesn't work then there goes that theory

I think I'll just write another inference engine that supports CPU. Been digging all around that code anyway

@lexasub
Copy link

lexasub commented Nov 21, 2024

@blindcrone after clean ccache cache, i another problem, exo didn't use gpu (cluster mode - 2 linux machines, rtx 3060). and i don't get answer in ui
log.log

@kdkd
Copy link

kdkd commented Nov 22, 2024

I'm attempting this on aarch64(raspberry pi 5). With stock clang 14, it's failing with error: __bf16 is not supported on this target. I upgraded to clang 18, which is blowing up in a weirder place:

fatal error: error in backend: Cannot select: 0x555591ddaa60: f16 = fp_round 0x555591ddac90, TargetConstant:i64<0>
  0x555591ddac90: bf16,ch = load<(load (s16) from %ir.7 + 4, !tbaa !4)> 0x555591d763e0, 0x555591dd63e0, undef:i64
    0x555591dd63e0: i64 = add nuw 0x555591dd5960, Constant:i64<4>
      0x555591dd5960: i64 = add 0x555591dd59d0, 0x555591dd5a40
        0x555591dd59d0: i64,ch = CopyFromReg 0x555591d763e0, Register:i64 %3
          0x555591dd5ab0: i64 = Register %3
        0x555591dd5a40: i64 = shl nuw nsw 0x555591dd57a0, Constant:i64<3>
          0x555591dd57a0: i64,ch = CopyFromReg 0x555591d763e0, Register:i64 %0
            0x555591dd5810: i64 = Register %0
          0x555591dd5b20: i64 = Constant<3>
      0x555591dd61b0: i64 = Constant<4>
    0x555591dd58f0: i64 = undef
  0x555591dd5c00: i64 = TargetConstant<0>
In function: E_4194304_4
clang-18: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Debian clang version 18.1.8 (++20240731024826+3b5b5c1ec4a3-1~exp1~20240731144843.145)

With clang 19, it builds and executes okay, but I still don't have it working because i'm getting socket errors immediately after that that I haven't debugged yet.

@Coastline-3102
Copy link

I believe I am also experiencing this error. I am on Debian 12 (bookworm) with Python 3.12.7 and clang version 14.0.6

Is there a "known good" combination of distro and python/clang versions that work? I have been testing out my own version of a Dockerfile so I can deploy this to multiple systems, but that is also getting the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants