Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chatglm2-6b int8+kv8 build failed on 0.8.0 branch #1239

Closed
2 of 4 tasks
NaNAGISaSA opened this issue Mar 6, 2024 · 3 comments
Closed
2 of 4 tasks

chatglm2-6b int8+kv8 build failed on 0.8.0 branch #1239

NaNAGISaSA opened this issue Mar 6, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@NaNAGISaSA
Copy link

NaNAGISaSA commented Mar 6, 2024

System Info

    - CPU architecture: x86_64
    - GPU properties
      - GPU name: NVIDIA A100
      - GPU memory size: 40G
    - Libraries
      - TensorRT-LLM branch or tag: v0.8.0
      - TensorRT-LLM commit: 5955b8afbad
      - Container used: yes, `make -C docker release_build` on v0.8.0 branch
    - NVIDIA driver version: 525.89.02
    - OS: Ubuntu 22.04

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

pip install transformers==4.33.0 # fix: https://huggingface.co/THUDM/chatglm2-6b/discussions/87

tp_size=1

python examples/chatglm/convert_checkpoint.py --model_dir ${hf_model_dir}
--tp_size ${tp_size}
--dtype float16
--use_weight_only
--weight_only_precision int8
--int8_kv_cache
--workers ${tp_size}
--output_dir ${quant_out_dir}/int8-kv8/${tp_size}-gpu/

trtllm-build --checkpoint_dir ${quant_out_dir}/int8-kv8/${tp_size}-gpu/
--output_dir ${trt_out_dir}/int8-kv8/${tp_size}-gpu/
--gemm_plugin float16
--gpt_attention_plugin float16
--context_fmha_fp32_acc enable
--remove_input_padding enable
--max_batch_size 128
--max_input_len 2048
--max_output_len 2048

Expected behavior

build success

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0
Inferring chatglm version from path...
Chatglm version: chatglm2
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00, 1.36s/it]
Calibration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:11<00:00, 5.42it/s]
Weights loaded. Total time: 00:06:07
Total time of converting checkpoints: 00:07:18
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[03/06/2024-03:40:23] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set gemm_plugin to float16.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set lookup_plugin to None.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set lora_plugin to None.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set context_fmha to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set context_fmha_fp32_acc to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set paged_kv_cache to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set remove_input_padding to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set multi_block_mode to False.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set enable_xqa to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set tokens_per_block to 128.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[03/06/2024-03:40:23] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len.
It may not be optimal to set max_num_tokens=max_batch_size
max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 497, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 420, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 392, in build_and_save
engine = build(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 272, in build
model.load(weights)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 338, in load
raise RuntimeError(err_msg)
RuntimeError: Provided tensor names are different from those expected by the engine.
Provided but not expected tensors: {'transformer.layers.2.attention.dense.act_scale', 'transformer.layers.25.attention.quantization_scaling_factor', 'transformer.layers.22.mlp.quantization_scaling_factor', 'transformer.layers.26.mlp.fc.act_scale', 'transformer.layers.24.attention.dense.act_scale', 'transformer.layers.18.mlp.fc.act_scale', 'transformer.layers.6.mlp.fc.act_scale', 'transformer.layers.19.mlp.fc.act_scale', 'transformer.layers.0.input_layernorm.scale_to_int', 'transformer.layers.4.attention.dense.act_scale', 'transformer.layers.21.mlp.fc.act_scale', 'transformer.layers.0.mlp.proj.act_scale', 'transformer.layers.16.post_layernorm.scale_to_int', 'transformer.layers.24.attention.quantization_scaling_factor', 'transformer.layers.17.attention.quantization_scaling_factor', 'transformer.layers.10.input_layernorm.scale_to_int', 'transformer.layers.0.mlp.fc.act_scale', 'transformer.layers.19.attention.quantization_scaling_factor', 'transformer.layers.15.mlp.fc.act_scale', 'transformer.layers.6.mlp.proj.act_scale', 'transformer.layers.9.attention.qkv.act_scale', 'transformer.layers.10.attention.dense.act_scale', 'transformer.layers.23.mlp.quantization_scaling_factor', 'transformer.layers.4.mlp.quantization_scaling_factor', 'transformer.layers.17.mlp.fc.act_scale', 'transformer.layers.21.input_layernorm.scale_to_int', 'transformer.layers.21.attention.dense.act_scale', 'transformer.layers.9.mlp.proj.act_scale', 'transformer.layers.1.mlp.proj.act_scale', 'transformer.layers.13.mlp.quantization_scaling_factor', 'transformer.layers.9.attention.dense.act_scale', 'transformer.layers.12.input_layernorm.scale_to_int', 'transformer.layers.21.attention.quantization_scaling_factor', 'transformer.layers.23.attention.quantization_scaling_factor', 'transformer.layers.14.mlp.quantization_scaling_factor', 'transformer.layers.16.input_layernorm.scale_to_int', 'transformer.layers.12.attention.quantization_scaling_factor', 'transformer.layers.11.attention.qkv.act_scale', 'transformer.layers.11.input_layernorm.scale_to_int', 'transformer.layers.26.post_layernorm.scale_to_int', 'transformer.layers.4.mlp.proj.act_scale', 'transformer.layers.5.mlp.fc.act_scale', 'transformer.layers.23.mlp.fc.act_scale', 'transformer.layers.26.attention.qkv.act_scale', 'transformer.layers.0.attention.quantization_scaling_factor', 'transformer.layers.2.attention.quantization_scaling_factor', 'transformer.layers.25.input_layernorm.scale_to_int', 'transformer.layers.19.input_layernorm.scale_to_int', 'transformer.layers.26.attention.quantization_scaling_factor', 'transformer.layers.21.mlp.proj.act_scale', 'transformer.layers.2.input_layernorm.scale_to_int', 'transformer.layers.25.mlp.proj.act_scale', 'transformer.layers.23.mlp.proj.act_scale', 'transformer.layers.15.attention.qkv.act_scale', 'transformer.layers.16.mlp.proj.act_scale', 'transformer.layers.8.mlp.proj.act_scale', 'transformer.layers.17.input_layernorm.scale_to_int', 'transformer.layers.1.attention.quantization_scaling_factor', 'transformer.layers.16.mlp.fc.act_scale', 'transformer.layers.1.attention.qkv.act_scale', 'transformer.layers.5.input_layernorm.scale_to_int', 'transformer.layers.4.mlp.fc.act_scale', 'transformer.layers.10.attention.quantization_scaling_factor', 'transformer.layers.9.mlp.quantization_scaling_factor', 'transformer.layers.22.mlp.proj.act_scale', 'transformer.layers.8.attention.dense.act_scale', 'transformer.layers.22.input_layernorm.scale_to_int', 'transformer.layers.27.attention.dense.act_scale', 'transformer.layers.27.attention.qkv.act_scale', 'transformer.layers.3.input_layernorm.scale_to_int', 'transformer.layers.13.mlp.proj.act_scale', 'transformer.layers.24.mlp.proj.act_scale', 'transformer.layers.15.mlp.proj.act_scale', 'transformer.layers.22.post_layernorm.scale_to_int', 'transformer.layers.6.input_layernorm.scale_to_int', 'transformer.layers.19.mlp.quantization_scaling_factor', 'transformer.layers.8.mlp.quantization_scaling_factor', 'transformer.layers.13.post_layernorm.scale_to_int', 'transformer.layers.20.post_layernorm.scale_to_int', 'transformer.layers.11.attention.dense.act_scale', 'transformer.layers.1.mlp.quantization_scaling_factor', 'transformer.layers.20.attention.qkv.act_scale', 'transformer.layers.23.attention.dense.act_scale', 'transformer.layers.18.attention.dense.act_scale', 'transformer.layers.7.attention.quantization_scaling_factor', 'transformer.layers.22.attention.qkv.act_scale', 'transformer.layers.7.attention.qkv.act_scale', 'transformer.layers.26.mlp.quantization_scaling_factor', 'transformer.layers.22.mlp.fc.act_scale', 'transformer.layers.11.post_layernorm.scale_to_int', 'transformer.layers.2.post_layernorm.scale_to_int', 'transformer.layers.3.attention.qkv.act_scale', 'transformer.layers.17.post_layernorm.scale_to_int', 'transformer.layers.24.input_layernorm.scale_to_int', 'transformer.layers.10.mlp.quantization_scaling_factor', 'transformer.layers.3.post_layernorm.scale_to_int', 'transformer.layers.3.mlp.fc.act_scale', 'transformer.layers.12.mlp.proj.act_scale', 'transformer.layers.8.mlp.fc.act_scale', 'transformer.layers.4.attention.quantization_scaling_factor', 'transformer.layers.6.mlp.quantization_scaling_factor', 'transformer.layers.6.attention.quantization_scaling_factor', 'transformer.layers.27.mlp.proj.act_scale', 'transformer.layers.5.mlp.proj.act_scale', 'transformer.layers.12.mlp.fc.act_scale', 'transformer.layers.15.input_layernorm.scale_to_int', 'transformer.layers.24.post_layernorm.scale_to_int', 'transformer.layers.5.post_layernorm.scale_to_int', 'transformer.layers.23.post_layernorm.scale_to_int', 'transformer.layers.3.attention.dense.act_scale', 'transformer.layers.20.input_layernorm.scale_to_int', 'transformer.layers.7.mlp.fc.act_scale', 'transformer.layers.17.mlp.proj.act_scale', 'transformer.layers.20.attention.quantization_scaling_factor', 'transformer.layers.27.mlp.quantization_scaling_factor', 'transformer.layers.14.attention.quantization_scaling_factor', 'transformer.layers.11.attention.quantization_scaling_factor', 'transformer.layers.23.attention.qkv.act_scale', 'transformer.layers.17.attention.qkv.act_scale', 'transformer.layers.7.post_layernorm.scale_to_int', 'transformer.layers.9.post_layernorm.scale_to_int', 'transformer.layers.9.input_layernorm.scale_to_int', 'transformer.layers.14.mlp.fc.act_scale', 'transformer.layers.14.attention.qkv.act_scale', 'transformer.layers.3.mlp.quantization_scaling_factor', 'transformer.layers.0.mlp.quantization_scaling_factor', 'transformer.layers.18.post_layernorm.scale_to_int', 'transformer.layers.10.mlp.proj.act_scale', 'transformer.layers.7.mlp.quantization_scaling_factor', 'transformer.layers.13.attention.dense.act_scale', 'transformer.layers.17.mlp.quantization_scaling_factor', 'transformer.layers.27.attention.quantization_scaling_factor', 'transformer.layers.17.attention.dense.act_scale', 'transformer.layers.15.post_layernorm.scale_to_int', 'transformer.layers.18.attention.quantization_scaling_factor', 'transformer.layers.14.attention.dense.act_scale', 'transformer.layers.19.attention.qkv.act_scale', 'transformer.layers.8.input_layernorm.scale_to_int', 'transformer.layers.24.attention.qkv.act_scale', 'transformer.layers.19.attention.dense.act_scale', 'transformer.layers.2.mlp.quantization_scaling_factor', 'transformer.layers.22.attention.dense.act_scale', 'transformer.layers.15.attention.dense.act_scale', 'transformer.layers.12.attention.qkv.act_scale', 'transformer.layers.25.mlp.fc.act_scale', 'transformer.layers.12.post_layernorm.scale_to_int', 'transformer.layers.26.attention.dense.act_scale', 'transformer.layers.13.input_layernorm.scale_to_int', 'transformer.layers.1.input_layernorm.scale_to_int', 'transformer.layers.10.mlp.fc.act_scale', 'transformer.layers.3.mlp.proj.act_scale', 'transformer.layers.11.mlp.proj.act_scale', 'transformer.layers.24.mlp.fc.act_scale', 'transformer.layers.23.input_layernorm.scale_to_int', 'transformer.layers.12.mlp.quantization_scaling_factor', 'transformer.layers.2.mlp.fc.act_scale', 'transformer.layers.4.attention.qkv.act_scale', 'transformer.layers.6.attention.qkv.act_scale', 'transformer.layers.9.mlp.fc.act_scale', 'transformer.layers.26.input_layernorm.scale_to_int', 'transformer.layers.19.mlp.proj.act_scale', 'transformer.layers.18.mlp.quantization_scaling_factor', 'transformer.layers.25.attention.qkv.act_scale', 'transformer.layers.21.post_layernorm.scale_to_int', 'transformer.layers.2.attention.qkv.act_scale', 'transformer.layers.15.mlp.quantization_scaling_factor', 'transformer.layers.7.input_layernorm.scale_to_int', 'transformer.layers.6.post_layernorm.scale_to_int', 'transformer.layers.18.input_layernorm.scale_to_int', 'transformer.layers.13.mlp.fc.act_scale', 'transformer.layers.14.mlp.proj.act_scale', 'transformer.layers.1.attention.dense.act_scale', 'transformer.layers.13.attention.quantization_scaling_factor', 'transformer.layers.10.attention.qkv.act_scale', 'transformer.layers.1.mlp.fc.act_scale', 'transformer.layers.7.attention.dense.act_scale', 'transformer.layers.22.attention.quantization_scaling_factor', 'transformer.layers.14.post_layernorm.scale_to_int', 'transformer.layers.6.attention.dense.act_scale', 'transformer.layers.24.mlp.quantization_scaling_factor', 'transformer.layers.9.attention.quantization_scaling_factor', 'transformer.layers.2.mlp.proj.act_scale', 'transformer.layers.13.attention.qkv.act_scale', 'transformer.layers.16.attention.dense.act_scale', 'transformer.layers.5.attention.qkv.act_scale', 'transformer.layers.5.attention.quantization_scaling_factor', 'transformer.layers.11.mlp.fc.act_scale', 'transformer.layers.3.attention.quantization_scaling_factor', 'transformer.layers.27.mlp.fc.act_scale', 'transformer.layers.20.attention.dense.act_scale', 'transformer.layers.21.mlp.quantization_scaling_factor', 'transformer.layers.25.attention.dense.act_scale', 'transformer.layers.8.post_layernorm.scale_to_int', 'transformer.layers.8.attention.qkv.act_scale', 'transformer.layers.15.attention.quantization_scaling_factor', 'transformer.layers.27.post_layernorm.scale_to_int', 'transformer.layers.7.mlp.proj.act_scale', 'transformer.layers.4.input_layernorm.scale_to_int', 'transformer.layers.0.post_layernorm.scale_to_int', 'transformer.layers.16.mlp.quantization_scaling_factor', 'transformer.layers.1.post_layernorm.scale_to_int', 'transformer.layers.20.mlp.quantization_scaling_factor', 'transformer.layers.16.attention.qkv.act_scale', 'transformer.layers.5.attention.dense.act_scale', 'transformer.layers.20.mlp.proj.act_scale', 'transformer.layers.21.attention.qkv.act_scale', 'transformer.layers.11.mlp.quantization_scaling_factor', 'transformer.layers.0.attention.dense.act_scale', 'transformer.layers.25.mlp.quantization_scaling_factor', 'transformer.layers.18.mlp.proj.act_scale', 'transformer.layers.26.mlp.proj.act_scale', 'transformer.layers.5.mlp.quantization_scaling_factor', 'transformer.layers.20.mlp.fc.act_scale', 'transformer.layers.18.attention.qkv.act_scale', 'transformer.layers.16.attention.quantization_scaling_factor', 'transformer.layers.12.attention.dense.act_scale', 'transformer.layers.25.post_layernorm.scale_to_int', 'transformer.layers.8.attention.quantization_scaling_factor', 'transformer.layers.0.attention.qkv.act_scale', 'transformer.layers.10.post_layernorm.scale_to_int', 'transformer.layers.14.input_layernorm.scale_to_int', 'transformer.layers.19.post_layernorm.scale_to_int', 'transformer.layers.4.post_layernorm.scale_to_int', 'transformer.layers.27.input_layernorm.scale_to_int'}

additional notes

none

@NaNAGISaSA NaNAGISaSA added the bug Something isn't working label Mar 6, 2024
@Tracin
Copy link
Collaborator

Tracin commented Mar 6, 2024

Could you share the config.json attached to checkpoint?

@NaNAGISaSA
Copy link
Author

hello, @Tracin , this is the config.json:

{
    "architecture": "ChatGLMForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float32",
    "num_hidden_layers": 28,
    "num_attention_heads": 32,
    "num_key_value_heads": 2,
    "hidden_size": 4096,
    "intermediate_size": 13696,
    "norm_epsilon": 1e-05,
    "vocab_size": 65024,
    "position_embedding_type": "rope_gptj",
    "max_position_embeddings": 32768,
    "hidden_act": "swiglu",
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "share_embedding_table": false,
    "quantization": {
        "quant_algo": "W8A16",
        "kv_cache_quant_algo": "INT8",
        "sq_use_plugin": true
    },
    "mapping": {
        "world_size": 1,
        "tp_size": 1,
        "pp_size": 1
    },
    "chatglm_version": "chatglm2",
    "add_bias_linear": false,
    "add_qkv_bias": true,
    "apply_query_key_layer_scaling": false,
    "apply_residual_connection_post_layernorm": false,
    "rmsnorm": true,
    "rope_ratio": 1.0
}

@Tracin
Copy link
Collaborator

Tracin commented Mar 6, 2024

@NaNAGISaSA We will fix this in next update. You can build SQ + int8kv or weight-only with FP16 kv before that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants