Skip to content

[Bug] CUDA_ERROR_NO_BINARY_FOR_GPU  #994

@ali-libertify

Description

@ali-libertify

🐛 Bug

I was running llama 2 chat 13b q8f16_1 on an A10G hosted on huggingface. I switched to the A100 and started getting the following error when I send a query to the LLM through the rest API:

To Reproduce

Steps to reproduce the behavior:

1- Run the mlc_chat rest app with llama2-chat-13b quantized q8f16_1
2- Send a request to the server

ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 292, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 83, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in call
raise e
File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 273, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 190, in run_endpoint_function
return await dependant.call(values)
File "/code/run/rest.py", line 231, in generate
msg = session["chat_mod"].generate(prompt=request.prompt)
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/chat_module.py", line 639, in generate
self._prefill(prompt)
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/chat_module.py", line 808, in _prefill
self._prefill_func(input, decode_next_token, place_in_prompt.value)
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 476, in raise_last_ffi_error
raise py_err
File "/workspace/mlc-llm/cpp/llm_chat.cc", line 1244, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue
)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue
) const
File "/workspace/mlc-llm/cpp/llm_chat.cc", line 771, in mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, bool, mlc::llm::PlaceInPrompt)
File "/workspace/mlc-llm/cpp/llm_chat.cc", line 997, in mlc::llm::LLMChat::ForwardTokens(std::vector<int, std::allocator >, long)
tvm._ffi.base.TVMError: Traceback (most recent call last):
9: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
at /workspace/mlc-llm/cpp/llm_chat.cc:1244
8: mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, bool, mlc::llm::PlaceInPrompt)
at /workspace/mlc-llm/cpp/llm_chat.cc:771
7: mlc::llm::LLMChat::ForwardTokens(std::vector<int, std::allocator >, long)
at /workspace/mlc-llm/cpp/llm_chat.cc:997
6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocatortvm::runtime::TVMRetValue > const&)
3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int ()(TVMValue, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
0: _ZN3tvm7runtime6deta
3: _ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_6detail17PackFuncVoidAddr_ILi8ENS0_15CUDAWrappedFuncEEENS0_10PackedFuncET0_RKSt6vectorINS4_1
2: tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const [clone .isra.0]
1: tvm::runtime::CUDAModuleNode::GetFunc(int, std::_cxx11::basic_string<char, std::char_traits, std::allocator > const&)
0: ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/cuda/cuda_module.cc", line 110
File "/workspace/tvm/src/runtime/library_module.cc", line 78
CUDAError: cuModuleLoadData(&(module
[device_id]), data
.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

Expected behavior

I am expecting to get an an answer to the prompt similar to through the other gpu

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) NVidia A100
  • How you installed MLC-LLM (conda, source): source
  • How you installed TVM-Unity (pip, source):
  • Python version (e.g. 3.10): 3.11
  • GPU driver version (if applicable): 470.xx
  • CUDA/cuDNN version (if applicable): 11.8
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugConfirmed bugs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions