-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
🐛 Bug
I was running llama 2 chat 13b q8f16_1 on an A10G hosted on huggingface. I switched to the A100 and started getting the following error when I send a query to the LLM through the rest API:
To Reproduce
Steps to reproduce the behavior:
1- Run the mlc_chat rest app with llama2-chat-13b quantized q8f16_1
2- Send a request to the server
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 292, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 83, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in call
raise e
File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 273, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 190, in run_endpoint_function
return await dependant.call(values)
File "/code/run/rest.py", line 231, in generate
msg = session["chat_mod"].generate(prompt=request.prompt)
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/chat_module.py", line 639, in generate
self._prefill(prompt)
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/chat_module.py", line 808, in _prefill
self._prefill_func(input, decode_next_token, place_in_prompt.value)
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 476, in raise_last_ffi_error
raise py_err
File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1244, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) const
File "/workspace/mlc-llm/cpp/llm_chat.cc", line 771, in mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, bool, mlc::llm::PlaceInPrompt)
File "/workspace/mlc-llm/cpp/llm_chat.cc", line 997, in mlc::llm::LLMChat::ForwardTokens(std::vector<int, std::allocator >, long)
tvm._ffi.base.TVMError: Traceback (most recent call last):
9: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
at /workspace/mlc-llm/cpp/llm_chat.cc:1244
8: mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, bool, mlc::llm::PlaceInPrompt)
at /workspace/mlc-llm/cpp/llm_chat.cc:771
7: mlc::llm::LLMChat::ForwardTokens(std::vector<int, std::allocator >, long)
at /workspace/mlc-llm/cpp/llm_chat.cc:997
6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocatortvm::runtime::TVMRetValue > const&)
3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int ()(TVMValue, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
0: _ZN3tvm7runtime6deta
3: _ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_6detail17PackFuncVoidAddr_ILi8ENS0_15CUDAWrappedFuncEEENS0_10PackedFuncET0_RKSt6vectorINS4_1
2: tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const [clone .isra.0]
1: tvm::runtime::CUDAModuleNode::GetFunc(int, std::_cxx11::basic_string<char, std::char_traits, std::allocator > const&)
0: ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/cuda/cuda_module.cc", line 110
File "/workspace/tvm/src/runtime/library_module.cc", line 78
CUDAError: cuModuleLoadData(&(module[device_id]), data.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU
Expected behavior
I am expecting to get an an answer to the prompt similar to through the other gpu
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
- Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu
- Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) NVidia A100
- How you installed MLC-LLM (
conda, source): source - How you installed TVM-Unity (
pip, source): - Python version (e.g. 3.10): 3.11
- GPU driver version (if applicable): 470.xx
- CUDA/cuDNN version (if applicable): 11.8
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): - Any other relevant information: