-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
❓ General Questions
Hi, I am a beginner in mlc-llm and am trying to write some quantization methods myself. I want to deploy Smoothquant quantization and have written a code that I don't know if there are any bugs in it. When I test it by running debug_chat.py I have some problems when running. Here are some outputs:
======================= Starts Tokenization & Embedding =======================
Parsed prompt using conversation template phi-2: ['Instruct: Where is Beijing?\nOutput:']
Input tokens: [43993 25 6350 318 11618 30 198 26410 25]
======================= Starts Prefill =======================
f11_fused_NT_matmul6_add6_add8_add8 has INF: 5
Traceback (most recent call last):
File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 454, in
main()
File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 450, in main
dc.generate("Where is Beijing?", 3)
File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 384, in generate
next_token = self._sample_token_from_logits(logits)
File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 358, in _sample_token_from_logits
next_token = self.sample_topp_from_prob_func(logits, top_p, random.random())
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/shi/anaconda3/envs/newmlc/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
3: _ZN3tvm7runtime13PackedFun
2: tvm::runtime::TypedPackedFunc<int (tvm::runtime::NDArray, double, double)>::AssignTypedLambda<int ()(tvm::runtime::NDArray, double, double)>(int ()(tvm::runtime::NDArray, double, double), std::__cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
1: tvm::runtime::relax_vm::SampleTopPFromProb(tvm::runtime::NDArray, double, double)
0: _ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/relax_vm/lm_support.cc", line 490
TVMError: The output probabilities are all NaNs, can not sample from it
I guess maybe tvm doesn't support int8 multiplication and I multiply w and x quantized as int8 directly? I tried to reverse quantize w and x to FP16 and multiply them before, and this problem didn't happen.
w = self.q_weight
w = nn.op.permute_dims(w)
x = nn.op.matmul(
x_q, w, out_dtype=self.out_dtype # self.out_dtype is fp16
)
Thanks for any help.