[Question] Quantization Problems

## ❓ General Questions



Hi, I am a beginner in mlc-llm and am trying to write some quantization methods myself. I want to deploy Smoothquant quantization and have written a code that I don't know if there are any bugs in it. When I test it by running debug_chat.py I have some problems when running. Here are some outputs:

======================= Starts Tokenization & Embedding =======================
Parsed prompt using conversation template phi-2: ['Instruct: Where is Beijing?\nOutput:']
Input tokens: [43993    25  6350   318 11618    30   198 26410    25]
======================= Starts Prefill =======================
f11_fused_NT_matmul6_add6_add8_add8 has INF: 5
Traceback (most recent call last):
  File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 454, in <module>
    main()
  File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 450, in main
    dc.generate("Where is Beijing?", 3)
  File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 384, in generate
    next_token = self._sample_token_from_logits(logits)
  File "/home/shi/smq/mlc-llm/python/mlc_llm/testing/debug_chat.py", line 358, in _sample_token_from_logits
    next_token = self.sample_topp_from_prob_func(logits, top_p, random.random())
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/home/shi/anaconda3/envs/newmlc/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  3: _ZN3tvm7runtime13PackedFun
  2: tvm::runtime::TypedPackedFunc<int (tvm::runtime::NDArray, double, double)>::AssignTypedLambda<int (*)(tvm::runtime::NDArray, double, double)>(int (*)(tvm::runtime::NDArray, double, double), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
  1: tvm::runtime::relax_vm::SampleTopPFromProb(tvm::runtime::NDArray, double, double)
  0: _ZN3tvm7runtime6deta
  File "/workspace/tvm/src/runtime/relax_vm/lm_support.cc", line 490
TVMError: The output probabilities are all NaNs, can not sample from it

I guess maybe tvm doesn't support int8 multiplication and I multiply w and x quantized as int8 directly? I tried to reverse quantize w and x to FP16 and multiply them before, and this problem didn't happen.

        w = self.q_weight
        w = nn.op.permute_dims(w)
        x = nn.op.matmul(
            x_q, w, out_dtype=self.out_dtype # self.out_dtype is fp16
        )

Thanks for any help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Quantization Problems #2523

❓ General Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Quantization Problems #2523

Description

❓ General Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions