Doesn't work with llama3.1 8b #38

ShuaiShao93 · 2024-12-21T01:07:40Z

When I ran

python -m deepcompressor.app.llm.ptq examples/llm/configs/qoq-g128.yaml --model-name ~/Meta-Llama-3.1-8B-Instruct

Got error

Traceback (most recent call last):
  File "/home/ss/deepcompressor/deepcompressor/app/llm/ptq.py", line 384, in <module>
    main(config, logging_level=tools.logging.DEBUG)
  File "/home/ss/deepcompressor/deepcompressor/app/llm/ptq.py", line 352, in main
    model = ptq(
            ^^^^
  File "/home/ss/deepcompressor/deepcompressor/app/llm/ptq.py", line 154, in ptq
    reorder_llm(model, config, tokenizer, reorder_cache=reorder_cache)
  File "/home/ss/.cache/pypoetry/virtualenvs/deepcompressor-_YvnLDOG-py3.12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ss/deepcompressor/deepcompressor/app/llm/quant/reorder.py", line 268, in reorder_llm
    if "residual" not in reorder_cache and not config.reorder.dynamic and config.reorder.is_enabled_for("residual"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ss/.cache/pypoetry/virtualenvs/deepcompressor-_YvnLDOG-py3.12/lib/python3.12/site-packages/torch/_tensor.py", line 1114, in __contains__
    raise RuntimeError(
RuntimeError: Tensor.__contains__ only supports Tensor or scalar, but you passed in a <class 'str'>.

The text was updated successfully, but these errors were encountered:

bobboli · 2025-01-03T09:49:28Z

Hi,
I can't reproduce the error. Could you try to clean up the cache runs/llm/llama-3.1/* and retry:

python -m deepcompressor.app.llm.ptq configs/qoq-g128.yaml --model-name llama-3.1-8b-instruct --model-path /path/to/Llama-3.1-8B-Instruct/

ShuaiShao93 · 2025-01-03T23:47:32Z

Hi, I can't reproduce the error. Could you try to clean up the cache runs/llm/llama-3.1/* and retry:
python -m deepcompressor.app.llm.ptq configs/qoq-g128.yaml --model-name llama-3.1-8b-instruct --model-path /path/to/Llama-3.1-8B-Instruct/

Thanks, this works!

However, when I further convert ckpt with trtllm 0.15.0

export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python3 TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3.1-8B-Instruct --output_dir ./tllm_8b_checkpoint_1gpu_w4a8  --dtype float16 --quant_ckpt_path ./Llama-3.1-8B-QServe-g128 --use_qserve --per_group

It fails with

[01/03/2025-23:43:00] [TRT-LLM] [I] Processing weights in layer: 0
Traceback (most recent call last):
  File "/home/ss/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 555, in <module>
    main()
  File "/home/ss/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 547, in main
    convert_and_save_hf(args)
  File "/home/ss/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 488, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/home/ss/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 495, in execute
    f(args, rank)
  File "/home/ss/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 472, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 419, in from_hugging_face
    weights = load_weights_from_lmquant(quant_ckpt_path, config)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/convert.py", line 2072, in load_weights_from_lmquant
    v = [
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/convert.py", line 2073, in <listcomp>
    load(f'{prefix}.self_attn.{comp}_proj.{suffix}')
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/convert.py", line 1951, in load
    v = quant_params[key]
KeyError: 'model.layers.0.self_attn.q_proj.weight.zero'

Is this related?

ShuaiShao93 · 2025-01-04T00:12:05Z

Ah I had to upgrade trtllm to 0.16.0.

Now it failed with

Traceback (most recent call last):
  File "/opt/conda/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 627, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 425, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save
    engine = build_model(build_config,
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 383, in build_model
    return build(model, build_config)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1237, in build
    model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 988, in forward
    hidden_states = self.transformer.forward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 311, in forward
    hidden_states = self.layers.forward(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 543, in forward
    hidden_states = layer(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 163, in forward
    attention_output = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 2793, in forward
    assert lora_layer_params is None, "lora is not supported on SmoothQuantAttention now"
AssertionError: lora is not supported on SmoothQuantAttention now

ShuaiShao93 mentioned this issue Dec 23, 2024

[Feature Request] Better support for w4a8 quantization NVIDIA/TensorRT-LLM#2605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doesn't work with llama3.1 8b #38

Doesn't work with llama3.1 8b #38

ShuaiShao93 commented Dec 21, 2024

bobboli commented Jan 3, 2025 •

edited

Loading

ShuaiShao93 commented Jan 3, 2025

ShuaiShao93 commented Jan 4, 2025

Doesn't work with llama3.1 8b #38

Doesn't work with llama3.1 8b #38

Comments

ShuaiShao93 commented Dec 21, 2024

bobboli commented Jan 3, 2025 • edited Loading

ShuaiShao93 commented Jan 3, 2025

ShuaiShao93 commented Jan 4, 2025

bobboli commented Jan 3, 2025 •

edited

Loading