-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quantize.py fails to export important data to config.json (eg rotary scaling) #1676
Comments
Could you share what model do you use? |
thank you, https://huggingface.co/meta-llama/Llama-2-70b-hf , finetuned (w/o any change in architecture) and exported in bfloat16 |
It looks the rope_scaling of llama-2-70b-hf is NULL {
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 28672,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 32000
} |
Please excuse that I have not mentioned this earlier explicitly. We have finetuned the model with changing rope scaling. Please see below the config.json for our finetuned model saved in the huggingface format (this is in the $MODEL_DIR directory, as referred above, see the
part. {
"_name_or_path": "OUR_PATH_HERE",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 28672,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 4.0,
"type": "linear"
},
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.39.1",
"use_cache": false,
"vocab_size": 32000
} |
Thank you for the reply. I try to change the |
Thank you for your reply. Please give me few days, I will prepare for you (simple instruction how to obtain) a model with rope_scaling in |
The deepseek-coder 33b model is using rope scaling, and also llama architecture, which has same problem describe here, maybe you can try this model directly: https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/config.json |
Thank you for the sharing. I will take a try. |
Unfortunately, TRT-LLM does not support deepseek yet, and hence I cannot reproduce the issue on the checkpoint. |
You may use the Llama workflow for Deepseek models. It works for int8 weight only quant (engine build + inference), which provider by However the FP8 quant provided by |
Hi @byshiue. |
Thank you for the sharing. We could reproduce the issue now, and we are investigating the issue now. |
Thanks for the reply. My model was obtained by finetune based on llamav3-8b-instruct. During finetune, rope_scaling was added, which made it impossible to pass the fp8 conversion. Is there any way to solve this problem? thanks @byshiue |
We have added it here https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/quantization/quantize_by_modelopt.py#L474-L478. Do you still encounter same issue on latest main branch? If so, could you try printing some debug message there to make sure we add rope_scaling, and double check the |
@byshiue First thanks for your reply. Yes, I still encounter this problem using the latest code. The version number of tensorrtllm is a96ccca convert code: this is huggingface model config |
@fan-niu you should look for the |
@wxsms thanks, but I can't get this package: need I install this package? On my side, I manually built the image based on the latest tensorllm_backend (commit 6053a5d) code, and then performed the fp8 conversion in the image. |
it was renamed to |
@wxsms Cool thanks, also change this code and convert work well, thank you so much. But when I continue to convert tensorrt engine I get this error `[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070900 [07/17/2024-09:59:20] [TRT-LLM] [W] Specifying a Invoked with: <tensorrt_bindings.tensorrt.IOptimizationProfile object at 0x7f734cd733b0>, 'cache_indirection', [1, 1, 1], [32, 1, 16384.0], [64, 1, 32768.0] this is convert engine script: @byshiue Please help to look into this problem, thanks |
@kaiyux @byshiue Thank you for mentioning this version of tensorrtllm, but when I converted the engine based on this version of code, I still encountered the same error as before. Can you give a solution to this problem? Thanks tensorrtllm version: convert script: this is huggingface model config convert engine error log: [07/18/2024-08:30:28] [TRT-LLM] [W] Specifying a Invoked with: <tensorrt_bindings.tensorrt.IOptimizationProfile object at 0x7f9c42f420f0>, 'cache_indirection', [1, 1, 1], [16, 1, 16384.0], [32, 1, 32768.0] |
@kaiyux @byshiue @janpetrov I also add the hotfix code after /usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py:276 and then I can successfully convert the engine, but why is the converted engine output so poor? I also used vllm to deploy the model before conversion, and the output of vllm was completely normal. hot fix code : tensorrtllm engine output: vllm output: |
System Info
4x NVIDIA H100, TensorRT-LLM backend 0.9.0
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
(1) Have a HF transformers model with linear rope scaling.
(2) Edit /usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py, is_linear to (adding the
and ("Rotary"...
part)so that the rope scaling model is exported (without crashing on an error that weights cannot be exported form the Rotary scaling layer, see this issue
(3) then run, as recommended here
Expected behavior
quantize.py should generate a detailed config.json file in the output dir. The subsequent run of
should build a well-working engine.
actual behavior
The config.json generated by quantize.py contains just the following (please note eg the rope scaling missing). The engine built by trtllm-build generates nonsense.
additional notes
When I edit the config.json to have the following contents and then re-run trtllm-build, the resulting engine starts to generate fine text.
Please note that when the input to trtllm-build is generated by examples/llama/convert_checkpoint.py (and not by examples/quantization/quanitize.py) then the config.json looks as follows. This is for the same model but without quantization. Please note much richer data, including rotary scaling.
The text was updated successfully, but these errors were encountered: