clang issues on ubuntu 24.04 and Python 3.12 #458

tunguz · 2024-11-14T13:27:42Z

After installing exo and clang on an Ubuntu 24.04 machine with a Ryzen CPU I got an error while trying to run a prompt. (See attached image.) Anyone have any idea what might be going on?

devinatkin · 2024-11-14T19:02:33Z

I got this same issue myself. Currently trying my hand at diagnosing what might be the cause. I'm guess it's something to do with the fact that it's running in a venv with python there.

I'm on a beefy setup so I know it's not a resource constraint issue (2 1080 Ti)

tunguz · 2024-11-14T19:05:49Z

@devinatkin I don't think its the venv issue. I am running it on bare metal Ubuntu 24.04, which comes with Python 3.12 as the default system Python.

devinatkin · 2024-11-14T19:11:50Z

@devinatkin I don't think its the venv issue. I am running it on bare metal Ubuntu 24.04, which comes with Python 3.12 as the default system Python.

Well that's good to know I'm on a fresh install of Ubuntu 24.04.1 LTS and decided to try with just the 1 machine pretty good machine before adding the rest of the junk heap.

devinatkin · 2024-11-14T19:23:30Z

Error processing prompt: Command '['clang', '-shared', '-march=native', '-O2', '-Wall', '-Werror', '-x', 'c', '-fPIC', '-ffreestanding', '-nostdlib', '-', '-o', '/tmp/tmp8o4ea_pa']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/home/dmatkin/exo/exo/main.py", line 193, in run_model_cli
    await node.process_prompt(shard, prompt, request_id=request_id)
  File "/home/dmatkin/exo/exo/orchestration/standard_node.py", line 166, in process_prompt
    resp = await self._process_prompt(base_shard, prompt, request_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/exo/orchestration/standard_node.py", line 198, in _process_prompt
    result = await self.inference_engine.infer_prompt(request_id, shard, prompt)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/exo/inference/inference_engine.py", line 28, in infer_prompt
    tokens = await self.encode(shard, prompt)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/exo/inference/tinygrad/inference.py", line 76, in encode
    await self.ensure_shard(shard)
  File "/home/dmatkin/exo/exo/inference/tinygrad/inference.py", line 99, in ensure_shard
    model_shard = await loop.run_in_executor(self.executor, build_transformer, model_path, shard, parameters)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/exo/inference/tinygrad/inference.py", line 59, in build_transformer
    load_state_dict(model, weights, strict=False, consume=False)  # consume=True
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict
    else: v.replace(state_dict[k].to(v.device)).realize()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/tensor.py", line 3500, in _wrapper
    ret = fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/tensor.py", line 213, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 222, in run_schedule
    for ei in lower_schedule(schedule):
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 215, in lower_schedule
    raise e
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 209, in lower_schedule
    try: yield lower_schedule_item(si)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 193, in lower_schedule_item
    runner = get_runner(si.outputs[0].device, si.ast)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 162, in get_runner
    method_cache[ckey] = method_cache[bkey] = ret = CompiledRunner(replace(prg, dname=dname))
                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 84, in __init__
    self.lib:bytes = precompiled if precompiled is not None else Device[p.dname].compiler.compile_cached(p.src)
                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/device.py", line 183, in compile_cached
    lib = self.compile(src)
          ^^^^^^^^^^^^^^^^^
  File "/home/dmatkin/exo/.venv/lib/python3.12/site-packages/tinygrad/runtime/ops_clang.py", line 15, in compile
    subprocess.check_output(['clang', '-shared', *self.args, '-O2', '-Wall', '-Werror', '-x', 'c', '-fPIC', '-ffreestanding', '-nostdlib',
  File "/usr/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['clang', '-shared', '-march=native', '-O2', '-Wall', '-Werror', '-x', 'c', '-fPIC', '-ffreestanding', '-nostdlib', '-', '-o', '/tmp/tmp8o4ea_pa']' returned non-zero exit status 1.
Received exit signal SIGTERM...

Trying to launch with a run command seems to deliver the same type of issue.

tunguz · 2024-11-14T19:37:16Z

@devinatkin I don't think its the venv issue. I am running it on bare metal Ubuntu 24.04, which comes with Python 3.12 as the default system Python.

Well that's good to know I'm on a fresh install of Ubuntu 24.04.1 LTS and decided to try with just the 1 machine pretty good machine before adding the rest of the junk heap.

Yup, I am using a brand new machine with a brand new Ubuntu install. This was pretty much the first thing I had tried on it.

cadenmackenzie · 2024-11-14T21:06:29Z

+1 I was getting this as well. Thought maybe the clang version wasn't compatible with the current tinygrad implementation so tried clang 14 and 16 but couldn't get it to work.

cnukaus · 2024-11-15T14:46:47Z

adding -v to clang, you can probably see the actual error:
clang -v -include tgmath.h -shared -march=native -O2 -Wall -Werror -x c -fPIC - -o /tmp/tefsd
Ubuntu clang version 14.0.0-1ubuntu1.1
TLDNR removed
on-dir=/root/exo -ferror-limit 19 -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/--368d98.o -x c -
clang -cc1 version 14.0.0 based upon LLVM 14.0.0 default target x86_64-pc-linux-gnu
ignoring nonexistent directory "/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include"
ignoring nonexistent directory "/include"

Basically some kind of path issue as it has a repeating component, that shouldn't be repeating.
it is compiling from pipe, not a file, so I can't debug further

tunguz · 2024-11-15T16:22:43Z

Basically some kind of path issue as it has a repeating component, that shouldn't be repeating. it is compiling from pipe, not a file, so I can't debug further

Would it be possible to overcome this by a different kind of clang installation?

jonstelly · 2024-11-16T22:05:12Z

I'm having other clang problems trying to get a docker image together (Dockerfile if curious) so this might just change from the issue you're seeing to the one I'm seeing, but to the question of trying other clang installation methods...

https://apt.llvm.org/ for nightly builds, other versions, etc...

would be curious to hear if you get things working with a different version / install.

AlexCheema · 2024-11-18T17:01:39Z

@blindcrone you're running this on your linux box right? Could you take a look at what might be the issue here? Thanks!

blindcrone · 2024-11-19T02:24:56Z

I've only got the tinygrad backend working on linux machines that have GPUs. I chased this rabbithole a bit a few weeks ago and found that this is an issue in tinygrad in general, as I've yet to find any report of a linux user being able to use the clang backend on tinygrad for llama and related models. Tinygrad contains a "fix-bf16"-alike function that also doesn't seem to solve the issue.

The actual bug is happening in LLVM when trying to support float16 types, and is an issue I was able to chase down in that repository, I'll look for it again and find links to post here, but the tl;dr is that this might be patched in LLVM 19 but no distro I know of currently packages it because of build issues

lexasub · 2024-11-20T18:58:42Z

@blindcrone may be try on archlinux? https://archlinux.pkgs.org/rolling/mesa-git-x86_64/clang-git-20.0.0_r516907.e102338b6e2f-1-x86_64.pkg.tar.zst.html

lexasub · 2024-11-20T19:07:34Z

@blindcrone i install it(clang 20), but is not fixed

blindcrone · 2024-11-21T01:42:23Z

Yea, I'm on arch and I haven't gotten newer llvm to cleanly install (Probably I have stuff that depends on old versions, or have the wrong compilers to build it), so if it doesn't work then there goes that theory

I think I'll just write another inference engine that supports CPU. Been digging all around that code anyway

lexasub · 2024-11-21T07:56:21Z

@blindcrone after clean ccache cache, i another problem, exo didn't use gpu (cluster mode - 2 linux machines, rtx 3060). and i don't get answer in ui
log.log

kdkd · 2024-11-22T02:21:37Z

I'm attempting this on aarch64(raspberry pi 5). With stock clang 14, it's failing with error: __bf16 is not supported on this target. I upgraded to clang 18, which is blowing up in a weirder place:

fatal error: error in backend: Cannot select: 0x555591ddaa60: f16 = fp_round 0x555591ddac90, TargetConstant:i64<0>
  0x555591ddac90: bf16,ch = load<(load (s16) from %ir.7 + 4, !tbaa !4)> 0x555591d763e0, 0x555591dd63e0, undef:i64
    0x555591dd63e0: i64 = add nuw 0x555591dd5960, Constant:i64<4>
      0x555591dd5960: i64 = add 0x555591dd59d0, 0x555591dd5a40
        0x555591dd59d0: i64,ch = CopyFromReg 0x555591d763e0, Register:i64 %3
          0x555591dd5ab0: i64 = Register %3
        0x555591dd5a40: i64 = shl nuw nsw 0x555591dd57a0, Constant:i64<3>
          0x555591dd57a0: i64,ch = CopyFromReg 0x555591d763e0, Register:i64 %0
            0x555591dd5810: i64 = Register %0
          0x555591dd5b20: i64 = Constant<3>
      0x555591dd61b0: i64 = Constant<4>
    0x555591dd58f0: i64 = undef
  0x555591dd5c00: i64 = TargetConstant<0>
In function: E_4194304_4
clang-18: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Debian clang version 18.1.8 (++20240731024826+3b5b5c1ec4a3-1~exp1~20240731144843.145)

With clang 19, it builds and executes okay, but I still don't have it working because i'm getting socket errors immediately after that that I haven't debugged yet.

Coastline-3102 · 2024-11-25T23:58:41Z

I believe I am also experiencing this error. I am on Debian 12 (bookworm) with Python 3.12.7 and clang version 14.0.6

Is there a "known good" combination of distro and python/clang versions that work? I have been testing out my own version of a Dockerfile so I can deploy this to multiple systems, but that is also getting the same error.

AlexCheema mentioned this issue Nov 18, 2024

Bug on Linux - Clang #464

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clang issues on ubuntu 24.04 and Python 3.12 #458

clang issues on ubuntu 24.04 and Python 3.12 #458

tunguz commented Nov 14, 2024

devinatkin commented Nov 14, 2024

tunguz commented Nov 14, 2024

devinatkin commented Nov 14, 2024

devinatkin commented Nov 14, 2024

tunguz commented Nov 14, 2024

cadenmackenzie commented Nov 14, 2024

cnukaus commented Nov 15, 2024

tunguz commented Nov 15, 2024

jonstelly commented Nov 16, 2024

AlexCheema commented Nov 18, 2024

blindcrone commented Nov 19, 2024 •

edited

Loading

lexasub commented Nov 20, 2024

lexasub commented Nov 20, 2024

blindcrone commented Nov 21, 2024

lexasub commented Nov 21, 2024

kdkd commented Nov 22, 2024

Coastline-3102 commented Nov 25, 2024

clang issues on ubuntu 24.04 and Python 3.12 #458

clang issues on ubuntu 24.04 and Python 3.12 #458

Comments

tunguz commented Nov 14, 2024

devinatkin commented Nov 14, 2024

tunguz commented Nov 14, 2024

devinatkin commented Nov 14, 2024

devinatkin commented Nov 14, 2024

tunguz commented Nov 14, 2024

cadenmackenzie commented Nov 14, 2024

cnukaus commented Nov 15, 2024

tunguz commented Nov 15, 2024

jonstelly commented Nov 16, 2024

AlexCheema commented Nov 18, 2024

blindcrone commented Nov 19, 2024 • edited Loading

lexasub commented Nov 20, 2024

lexasub commented Nov 20, 2024

blindcrone commented Nov 21, 2024

lexasub commented Nov 21, 2024

kdkd commented Nov 22, 2024

Coastline-3102 commented Nov 25, 2024

blindcrone commented Nov 19, 2024 •

edited

Loading