Skip to content

[Bug] rocm57 flow nightly crashes #2144

@Sing-Li

Description

@Sing-Li

🐛 Bug

When using rocm 5.7 nightly to run serve or chat, the jit will crash the first time after downloading the weights and before outputting an md5-named lib.

To Reproduce

Steps to reproduce the behavior:

  1. install latest rocm57 nightly
  2. clear out any cache, then run serve or chat on any model (known supported and working)
  3. the flow will crash
04:10:24] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[2024-04-15 04:10:26] INFO auto_device.py:85: Not found device: cuda:0
[2024-04-15 04:10:28] INFO auto_device.py:76: Found device: rocm:0
[2024-04-15 04:10:28] INFO auto_device.py:76: Found device: rocm:1
[2024-04-15 04:10:28] INFO auto_device.py:76: Found device: rocm:2
[2024-04-15 04:10:28] INFO auto_device.py:76: Found device: rocm:3
[2024-04-15 04:10:29] INFO auto_device.py:85: Not found device: metal:0
[2024-04-15 04:10:30] INFO auto_device.py:85: Not found device: vulkan:0
[2024-04-15 04:10:31] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-15 04:10:31] INFO auto_device.py:33: Using device: rocm:0
[2024-04-15 04:10:31] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/gemma-2b-it-q4f16_1-MLC
[2024-04-15 04:10:31] INFO download.py:131: Weights already downloaded: /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC
[2024-04-15 04:10:31] INFO chat_module.py:781: Model lib not found. Now compiling model lib on device...
[2024-04-15 04:10:32] INFO jit.py:35: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-04-15 04:10:32] INFO jit.py:94: Compiling using commands below:
[2024-04-15 04:10:32] INFO jit.py:95: /usr/bin/python3 -m mlc_llm compile /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'context_window_size=8192;prefill_chunk_size=1024;tensor_parallel_shards=1' --device rocm:0 --output /tmp/tmpzldeenzs/lib.so
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[2024-04-15 04:10:33] INFO auto_config.py:69: Found model configuration: /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC/mlc-chat-config.json
[2024-04-15 04:10:33] INFO auto_target.py:84: Detecting target device: rocm:0
[2024-04-15 04:10:33] INFO auto_target.py:86: Found target: {"thread_warp_size": 64, "mtriple": "amdgcn-amd-amdhsa-hcc", "max_threads_per_block": 1024, "max_num_threads": 256, "kind": "rocm", "max_shared_memory_per_block": 65536, "tag": "", "mcpu": "gfx908", "keys": ["rocm", "gpu"]}
[2024-04-15 04:10:33] INFO auto_target.py:103: Found host LLVM triple: x86_64-unknown-linux-gnu
[2024-04-15 04:10:33] INFO auto_target.py:104: Found host LLVM CPU: skylake-avx512
[2024-04-15 04:10:33] INFO auto_config.py:153: Found model type: gemma. Use `--model-type` to override.
Compiling with arguments:
  --config          GemmaConfig(hidden_size=2048, hidden_act='gelu', intermediate_size=16384, attention_bias=False, num_attention_heads=8, num_key_value_heads=1, head_dim=256, num_hidden_layers=18, rms_norm_eps=1e-06, vocab_size=256000, position_embedding_base=10000.0, context_window_size=8192, prefill_chunk_size=1024, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      gemma
  --target          {"thread_warp_size": 64, "host": {"mtriple": "x86_64-unknown-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "skylake-avx512", "keys": ["cpu"]}, "mtriple": "amdgcn-amd-amdhsa-hcc", "max_threads_per_block": 1024, "max_num_threads": 256, "kind": "rocm", "max_shared_memory_per_block": 65536, "tag": "", "mcpu": "gfx908", "keys": ["rocm", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          /tmp/tmpzldeenzs/lib.so
  --overrides       context_window_size=8192;sliding_window_size=None;prefill_chunk_size=1024;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-04-15 04:10:33] INFO config.py:106: Overriding context_window_size from 8192 to 8192
[2024-04-15 04:10:33] INFO config.py:106: Overriding prefill_chunk_size from 1024 to 1024
[2024-04-15 04:10:33] INFO config.py:106: Overriding tensor_parallel_shards from 1 to 1
[2024-04-15 04:10:33] INFO compile.py:137: Creating model from: GemmaConfig(hidden_size=2048, hidden_act='gelu', intermediate_size=16384, attention_bias=False, num_attention_heads=8, num_key_value_heads=1, head_dim=256, num_hidden_layers=18, rms_norm_eps=1e-06, vocab_size=256000, position_embedding_base=10000.0, context_window_size=8192, prefill_chunk_size=1024, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-04-15 04:10:33] INFO compile.py:156: Exporting the model to TVM Unity compiler
[2024-04-15 04:10:34] INFO compile.py:162: Running optimizations using TVM Unity
[2024-04-15 04:10:34] INFO compile.py:176: Registering metadata: {'model_type': 'gemma', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-04-15 04:10:35] INFO pipeline.py:50: Running TVM Relax graph-level optimizations
[2024-04-15 04:10:45] INFO pipeline.py:50: Lowering to TVM TIR kernels
[2024-04-15 04:10:46] INFO pipeline.py:50: Running TVM TIR-level optimizations
[2024-04-15 04:10:50] INFO pipeline.py:50: Running TVM Dlight low-level optimizations
[2024-04-15 04:10:52] INFO pipeline.py:50: Lowering to VM bytecode
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `alloc_embedding_tensor`: 4.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode`: 10.31 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill`: 132.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify`: 132.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode`: 0.13 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `embed`: 4.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill`: 132.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-04-15 04:10:54] INFO pipeline.py:50: Compiling external modules
[2024-04-15 04:10:54] INFO pipeline.py:50: Compilation complete! Exporting to disk
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 52, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 25, in main
    cli.main(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/compile.py", line 128, in main
    compile(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/compile.py", line 234, in compile
    _compile(args, model_config)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/compile.py", line 179, in _compile
    args.build_func(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/support/auto_target.py", line 266, in build
    relax.build(
  File "/usr/local/lib/python3.10/dist-packages/tvm/relax/vm_build.py", line 341, in build
    return _vmlink(
  File "/usr/local/lib/python3.10/dist-packages/tvm/relax/vm_build.py", line 247, in _vmlink
    lib = tvm.build(
  File "/usr/local/lib/python3.10/dist-packages/tvm/driver/build_module.py", line 297, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "/usr/local/lib/python3.10/dist-packages/tvm/contrib/rocm.py", line 120, in callback_rocm_link
    rocm_link(tmp_obj, tmp_cobj)
  File "/usr/local/lib/python3.10/dist-packages/tvm/contrib/rocm.py", line 85, in rocm_link
    lld if lld is not None else find_lld()[0],
  File "/usr/local/lib/python3.10/dist-packages/tvm/contrib/rocm.py", line 59, in find_lld
    raise RuntimeError("cannot find ld.lld, candidates are: " + str(lld_list))
RuntimeError: cannot find ld.lld, candidates are: ['ld.lld-17.0', 'ld.lld-17', 'ld.lld', '/opt/rocm/llvm/bin']
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/chat_module.py", line 772, in __init__
    self.model_lib_path = _get_lib_module_path(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/chat_module.py", line 591, in _get_lib_module_path
    raise FileNotFoundError(err_msg)
FileNotFoundError: Cannot find the model library that corresponds to `None`.
`None` is either provided in the `chat_config` you passed in, or specified in /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC/mlc-chat-config.json.
We searched over the following possible paths: 
- None-rocm.so
- dist/prebuilt/lib/None-rocm.so
- dist/HF://mlc-ai/gemma-2b-it-q4f16_1-MLC/None-rocm.so
- /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC/None-rocm.so
- /root/.cache/mlc_llm/model_weights/mlc-ai/None-rocm.so
If you would like to directly specify the model library path, you may consider passing in the `ChatModule.model_lib_path` parameter.
Please checkout https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb for an example on how to load a model.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/mlc_llm", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 37, in main
    cli.main(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/chat.py", line 41, in main
    chat(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/chat.py", line 133, in chat
    cm = ChatModule(model, device, chat_config=config, model_lib_path=model_lib_path)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/chat_module.py", line 785, in __init__
    jit.jit(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/jit.py", line 123, in jit
    _run_jit(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/jit.py", line 96, in _run_jit
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'mlc_llm', 'compile', '/root/.cache/mlc_llm/

Expected behavior

Simple invocation of flow should work, as it does with nvidia cuda 12.2 nightly build.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): rocm 5.7
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): linux 22.04lts
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) mi-25
  • How you installed MLC-LLM (conda, source): nightly rocm57
  • How you installed TVM-Unity (pip, source): nightly rocm57
  • Python version (e.g. 3.10): 3.10
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
  • Any other relevant information:

Additional context

Likely due to this issue mlc-ai/relax#316

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugConfirmed bugs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions