Skip to content

[Bug] LlaMa-3 doesn't work #2281

@chongkuiqi

Description

@chongkuiqi

🐛 Bug

Thanks for your work ! I download the compiled/quantized Llama-3 weights from https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC, but when i run it, it outputs as folowing:

Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 953, in run
self._target(self._args, *self._kwargs)
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
self._ffi"run_background_loop"
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:168
12: mlc::llm::serve::EngineImpl::Step()
at /workspace/mlc-llm/cpp/serve/engine.cc:326
11: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:235
10: mlc::llm::serve::GPUSampler::BatchSampleTokensWithProbAfterTopP(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator
, std::allocatormlc::llm::RandomGenerator* > const&, std::vector<tvm::runtime::NDArray, std::allocatortvm::runtime::NDArray >
)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:179
9: mlc::llm::serve::GPUSampler::BatchSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocatormlc::llm::RandomGenerator* > const&, bool, std::vector<tvm::runtime::NDArray, std::allocatortvm::runtime::NDArray >)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:369
8: mlc::llm::serve::GPUSampler::ChunkSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator
, std::allocatormlc::llm::RandomGenerator* > const&, bool)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:450
7: mlc::llm::serve::GPUSampler::SampleOnGPU(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, bool, int, std::vector<int, std::allocator > const&)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:567
6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocatortvm::runtime::TVMRetValue > const&)
3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int ()(TVMValue, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
0: TVMThrowLastError.cold
TVMError: after determining tmp storage requirements for inclusive_scan: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

So I mlc_llm compile the llama-3-cuda.so, but it outputs:

Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 953, in run
self._target(self._args, self._kwargs)
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
self._ffi"run_background_loop"
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
3: _ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_6detail17PackFuncVoidAddr_ILi8ENS0_15CUDAWrappedFuncEEENS0_10PackedFuncET0_RKSt6vectorINS4_1
2: tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue
, void
) const [clone .isra.0]
1: tvm::runtime::CUDAModuleNode::GetFunc(int, std::_cxx11::basic_string<char, std::char_traits, std::allocator > const&)
0: ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/cuda/cuda_module.cc", line 110
CUDAError: cuModuleLoadData(&(module
[device_id]), data
.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

To Reproduce

from mlc_llm import MLCEngine
model = "Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(
model=model,
model_lib=" llama-3-cuda.so",
device="cuda",
)
response = engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=False,
)
print(response)

Environment

  • Platform : cuda

  • Operating system: ubuntu20.04

  • Device: RTX 6000 24GB

  • How you installed MLC-LLM: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121

  • How you installed TVM-Unity: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121

  • Python version: 3.10

  • GPU driver version: 535.171.04

  • CUDA/cuDNN version : cuda12.1, cudnn 8.9.4

Could you please provide some helps ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugConfirmed bugs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions