Skip to content

[Question] Quantization Problems #2523

@ponytaill

Description

@ponytaill

❓ General Questions

Hi, I am a beginner in mlc-llm and am trying to write some quantization methods myself. I want to deploy Smoothquant quantization and have written a code that I don't know if there are any bugs in it. When I test it by running debug_chat.py I have some problems when running. Here are some outputs:

======================= Starts Tokenization & Embedding =======================
Parsed prompt using conversation template phi-2: ['Instruct: Where is Beijing?\nOutput:']
Input tokens: [43993 25 6350 318 11618 30 198 26410 25]
======================= Starts Prefill =======================
f11_fused_NT_matmul6_add6_add8_add8 has INF: 5
Traceback (most recent call last):
File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 454, in
main()
File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 450, in main
dc.generate("Where is Beijing?", 3)
File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 384, in generate
next_token = self._sample_token_from_logits(logits)
File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 358, in _sample_token_from_logits
next_token = self.sample_topp_from_prob_func(logits, top_p, random.random())
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/shi/anaconda3/envs/newmlc/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
3: _ZN3tvm7runtime13PackedFun
2: tvm::runtime::TypedPackedFunc<int (tvm::runtime::NDArray, double, double)>::AssignTypedLambda<int ()(tvm::runtime::NDArray, double, double)>(int ()(tvm::runtime::NDArray, double, double), std::__cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
1: tvm::runtime::relax_vm::SampleTopPFromProb(tvm::runtime::NDArray, double, double)
0: _ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/relax_vm/lm_support.cc", line 490
TVMError: The output probabilities are all NaNs, can not sample from it

I guess maybe tvm doesn't support int8 multiplication and I multiply w and x quantized as int8 directly? I tried to reverse quantize w and x to FP16 and multiply them before, and this problem didn't happen.

    w = self.q_weight
    w = nn.op.permute_dims(w)
    x = nn.op.matmul(
        x_q, w, out_dtype=self.out_dtype # self.out_dtype is fp16
    )

Thanks for any help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionQuestion about the usage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions