-
Couldn't load subscription status.
- Fork 1.8k
Description
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
- mlc_llm convert_weight /mnt/data/ehdd1/home/models/hf/deepseek-llm-7b-chat/deepseek-llm-7b-chat --quantization e4m3_e4m3_f16 -o dist/deepseek-llm-7b-chat-e4m3-MLC
- mlc_llm gen_config /mnt/data/ehdd1/home/models/hf/deepseek-llm-7b-chat/deepseek-llm-7b-chat --quantization e4m3_e4m3_f16 --conv-template deepseek -o dist/deepseek-llm-7b-chat-e4m3-MLC/
- mlc_llm compile ./dist/deepseek-llm-7b-chat-e4m3-MLC/mlc-chat-config.json --opt O0 --device cuda -o dist/libs/deepse
ek-llm-7b-chat-e4m3-O0-cuda.so
Expected behavior
When I used q0f16, the compiling process worked, but when I tried e4m3 or e5m2, the compiling process failed. (I promise every steps in e4m3 is the same as q0f16) The raw model I used is deepseek from huggingface, and I also try Llama2, But it also failed to compile with e4m3.
BTW, is mlc-llm still not supported with int8 quantization?
Environment
- Platform (CUDA):
- Operating system (Ubuntu):
- Device (NVIDIA A6000)
- How you installed MLC-LLM (source):
- How you installed TVM-Unity (source):
- Python version (3.10):
Additional context
[2025-02-08 12:57:08] INFO auto_config.py:70: Found model configuration: dist/deepseek-llm-7b-chat-e4m3-MLC/mlc-chat-config.json
[2025-02-08 12:57:09] INFO auto_device.py:79: Found device: cuda:0
[2025-02-08 12:57:09] INFO auto_device.py:79: Found device: cuda:1
[2025-02-08 12:57:09] INFO auto_target.py:78: Found configuration of target device "cuda:0": {"thread_warp_size": runtime.BoxInt(32), "arch": "sm_86", "max_threads_per_block": runtime.BoxInt(1024), "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]}
[2025-02-08 12:57:09] INFO auto_target.py:110: Found host LLVM triple: x86_64-redhat-linux-gnu
[2025-02-08 12:57:09] INFO auto_target.py:111: Found host LLVM CPU: znver3
[2025-02-08 12:57:09] INFO auto_target.py:334: Generating code for CUDA architecture: sm_86
[2025-02-08 12:57:09] INFO auto_target.py:335: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90a
[2025-02-08 12:57:09] INFO auto_config.py:154: Found model type: llama. Use --model-type to override.
Compiling with arguments:
--config LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=30, rms_norm_eps=1e-06, vocab_size=102400, tie_word_embeddings=False, position_embedding_base=10000.0, rope_scaling=None, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, disaggregation=False, kwargs={})
--quantization PerTensorQuantize(name='e4m3_e4m3_f16', kind='per-tensor-quant', activation_dtype='e4m3_float8', weight_dtype='e4m3_float8', storage_dtype='e4m3_float8', model_dtype='float16', quantize_embedding=False, quantize_final_fc=False, quantize_linear=True, num_elem_per_storage=1, max_int_value=448, use_scale=True, calibration_mode='inference', tensor_parallel_shards=1)
--model-type llama
--target {"thread_warp_size": runtime.BoxInt(32), "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "znver3", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": runtime.BoxInt(1024), "libs": ["thrust"], "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]}
--opt flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
--system-lib-prefix ""
--output dist/libs/deepseek-llm-7b-chat-e4m3-O0-cuda.so
--overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None;disaggregation=None
[2025-02-08 12:57:09] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=30, rms_norm_eps=1e-06, vocab_size=102400, tie_word_embeddings=False, position_embedding_base=10000.0, rope_scaling=None, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, disaggregation=False, kwargs={})
[2025-02-08 12:57:09] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2025-02-08 12:57:11] INFO compile.py:164: Running optimizations using TVM Unity
[2025-02-08 12:57:11] INFO compile.py:186: Registering metadata: {'model_type': 'llama', 'quantization': 'e4m3_e4m3_f16', 'context_window_size': 4096, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 4096, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'disaggregation': False, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
[2025-02-08 12:57:12] INFO pipeline.py:55: Running TVM Relax graph-level optimizations
[2025-02-08 12:57:15] INFO pipeline.py:55: Lowering to TVM TIR kernels
[2025-02-08 12:57:23] INFO pipeline.py:55: Running TVM TIR-level optimizations
[2025-02-08 12:57:38] INFO pipeline.py:55: Running TVM Dlight low-level optimizations
[2025-02-08 12:57:44] INFO pipeline.py:55: Lowering to VM bytecode
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function alloc_embedding_tensor: 32.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function argsort_probs: 0.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_decode: 14.09 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_decode_to_last_hidden_states: 15.09 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_prefill: 452.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_prefill_to_last_hidden_states: 483.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_select_last_hidden_states: 1.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_verify: 451.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_verify_to_last_hidden_states: 483.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function create_tir_paged_kv_cache: 0.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function decode: 0.11 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function decode_to_last_hidden_states: 0.12 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function embed: 32.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function gather_hidden_states: 0.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function get_logits: 0.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function multinomial_from_uniform: 0.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function prefill: 451.01 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function prefill_to_last_hidden_states: 483.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function renormalize_by_top_p: 0.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function sample_with_top_p: 0.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function sampler_take_probs: 0.01 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function sampler_verify_draft_tokens: 0.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function scatter_hidden_states: 0.00 MB
[2025-02-08 12:57:47] INFO estimate_memory_usage.py:58: [Memory usage] Function softmax_with_temperature: 0.00 MB
[2025-02-08 12:57:49] INFO pipeline.py:55: Compiling external modules
[2025-02-08 12:57:49] INFO pipeline.py:55: Compilation complete! Exporting to disk
Traceback (most recent call last):
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/bin/mlc_llm", line 8, in
sys.exit(main())
^^^^^^
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/lib/python3.11/site-packages/mlc_llm/main.py", line 34, in main
cli.main(sys.argv[2:])
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/lib/python3.11/site-packages/mlc_llm/cli/compile.py", line 129, in main
compile(
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/lib/python3.11/site-packages/mlc_llm/interface/compile.py", line 244, in compile
_compile(args, model_config)
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/lib/python3.11/site-packages/mlc_llm/interface/compile.py", line 189, in _compile
args.build_func(
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/lib/python3.11/site-packages/mlc_llm/support/auto_target.py", line 301, in build
relax.build(
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/lib/python3.11/site-packages/tvm/relax/vm_build.py", line 353, in build
return _vmlink(
^^^^^^^^
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/lib/python3.11/site-packages/tvm/relax/vm_build.py", line 249, in _vmlink
lib = tvm.build(
^^^^^^^^^^
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/lib/python3.11/site-packages/tvm/driver/build_module.py", line 297, in build
rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/myid/kz96891/anaconda3/envs/mlc-llm/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm.error.InternalError: Traceback (most recent call last):
63: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::__mk_TVM24::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#1}>(tvm::__mk_TVM24::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#1}, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, tvm::runtime::TVMRetValue)
62: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
61: tvm::SplitMixedModule(tvm::IRModule, tvm::Target const&, tvm::Target const&)
60: tvm::ApplyPasses(tvm::IRModule, tvm::transform::Sequential)
59: tvm::transform::Pass::operator()(tvm::IRModule) const
58: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
57: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
56: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
55: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
54: _ZN3tvm7runtime13PackedFun
53: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::tir::transform::FP8ComputeLegalize(tvm::runtime::String)::{lambda(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)#1}>(tvm::tir::transform::FP8ComputeLegalize(tvm::runtime::String)::{lambda(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)#1})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
52: tvm::tir::FP8ComputeLegalizer::Legalize(tvm::tir::PrimFunc)
51: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
50: _ZZN3tvm3tir11StmtFunctorI
49: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
48: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
47: ZZN3tvm3tir11StmtFunctorI
46: tvm::tir::ComputeLegalizer::VisitStmt(tvm::tir::AttrStmtNode const*)
45: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
44: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
43: ZZN3tvm3tir11StmtFunctorI
42: tvm::tir::ComputeLegalizer::VisitStmt(tvm::tir::AllocateNode const*)
41: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
40: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
39: ZZN3tvm3tir11StmtFunctorI
38: tvm::tir::ComputeLegalizer::VisitStmt(tvm::tir::AllocateNode const*)
37: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
36: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
35: ZZN3tvm3tir11StmtFunctorI
34: tvm::tir::ComputeLegalizer::VisitStmt(tvm::tir::AllocateNode const*)
33: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
32: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
31: ZZN3tvm3tir11StmtFunctorI
30: tvm::tir::ComputeLegalizer::VisitStmt(tvm::tir::AllocateNode const*)
29: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
28: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
27: ZZN3tvm3tir11StmtFunctorI
26: tvm::tir::ComputeLegalizer::VisitStmt(tvm::tir::AttrStmtNode const*)
25: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
24: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
23: ZZN3tvm3tir11StmtFunctorI
22: tvm::tir::ComputeLegalizer::VisitStmt(tvm::tir::AttrStmtNode const*)
21: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
20: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
19: ZZN3tvm3tir11StmtFunctorI
18: tvm::tir::ComputeLegalizer::VisitStmt(tvm::tir::AttrStmtNode const*)
17: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
16: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
15: _ZZN3tvm3tir11StmtFunctorIFNS
14: tvm::runtime::Array<tvm::tir::Stmt, std::enable_if<std::is_base_of<tvm::runtime::ObjectRef, tvm::tir::Stmt>::value, void>::type> tvm::tir::StmtMutator::Internal::MutateArray<tvm::tir::Stmt, tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}>(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, std::enable_if<std::is_base_of<tvm::runtime::ObjectRef, tvm::tir::Stmt>::value, void>::type> const&, tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1})
13: tvm::runtime::ObjectPtrtvm::runtime::Object tvm::runtime::Array<tvm::tir::Stmt, void>::MapHelper<tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}, tvm::tir::Stmt>(tvm::runtime::ObjectPtrtvm::runtime::Object, tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1})
12: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
11: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
10: _ZZN3tvm3tir11StmtFunctorI
9: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
8: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
7: _ZZN3tvm3tir11StmtFunctorIFNS
6: tvm::runtime::Array<tvm::tir::Stmt, std::enable_if<std::is_base_of<tvm::runtime::ObjectRef, tvm::tir::Stmt>::value, void>::type> tvm::tir::StmtMutator::Internal::MutateArray<tvm::tir::Stmt, tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}>(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, std::enable_if<std::is_base_of<tvm::runtime::ObjectRef, tvm::tir::Stmt>::value, void>::type> const&, tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1})
5: tvm::runtime::ObjectPtrtvm::runtime::Object tvm::runtime::Array<tvm::tir::Stmt, void>::MapHelper<tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}, tvm::tir::Stmt>(tvm::runtime::ObjectPtrtvm::runtime::Object, tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1})
4: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
3: tvm::tir::StmtFunctor<tvm::tir::Stmt (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
2: ZZN3tvm3tir11StmtFunctorI
1: tvm::tir::ComputeLegalizer::VisitStmt(tvm::tir::BufferStoreNode const*)
0: _ZN3tvm7runtime6deta
File "/workspace/tvm/src/tir/transforms/unsupported_dtype_legalize.cc", line 330
InternalError: Check failed: (MatchDType(value->dtype)) is false: