-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: ModelRunnerCpp.from_dir() got an unexpected keyword argument 'gpu_weights_percent' #1664
Comments
It looks caused by mismatch of TRT-LLM version between example and TRT-LLM core.
Please try installing the latest main branch. |
I checked the ModelRunnerCpp.py file. gpu_weights_percent is in the function with default 1. Anyway, this is not the promblem I want to solve now. I tried to deploy the engine converted by TRT-LLM v0.9.0 using Triton Server, but it always fail. Could you please help me to solve this problem? Below is the error I met. I just follow the QuickStart Guide provided. But whatever the version of Triton Server I used, it didn't solve this problem. docker run -it --rm --gpus all --network host --shm-size=1g Log in to huggingface-cli to get tokenizerhuggingface-cli login --token ***** Install python dependenciespip install sentencepiece protobuf Launch Serverpython /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 2 E0530 02:04:50.281196 2894 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72) |
|
I tried converting checkpoint and building engine with TensorRT-LLM v0.8.0 and deploying it using 24.02, but below is what I got ###root@ccnl06:/cognitive_comp/chenyun/tensorrtllm_backend# docker run -it --rm --gpus all --network host --shm-size=40g -v =============================
|
Here says that |
The latest examples/run.py #1688 still has the "gpu_weights_percent" so it will cause error when running engine built by TRTLLM 0.9.0. |
Thanks for giving this instruction. I wonder if you have tried deploying the engine using the triton server? Do you have any suggestions for that? |
Yes I can deploy the engine successfully with triton. You need to modify config.pbtxt files as in instruction here https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/ |
Thanks, but it is a bit different to LLAMA2-13B, which must have to do the tensor parallel. some parameters might be different from the instruction and I failed again. Anyway, thank you very much for giving me these advice. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days." |
Hi @YunChen1227 do u still have further issue or question now? If not, we'll close it soon. |
System Info
using 3090 and the docker image produced by the QuickStart Doc
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
after building the llama2 engine
python3 ../run.py --max_output_len=40 --tokenizer_dir /models/0520/ckpt/0/global_step3900-hf/ --engine_dir /models/tmp/llama/7B/trt_engines/fp16/2-gpu/ --input_text ...
Expected behavior
expect to get an answer from the model
actual behavior
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
Traceback (most recent call last):
File "/TensorRT-LLM/examples/llama/../run.py", line 571, in
main(args)
File "/TensorRT-LLM/examples/llama/../run.py", line 420, in main
runner = runner_cls.from_dir(**runner_kwargs)
TypeError: ModelRunnerCpp.from_dir() got an unexpected keyword argument 'gpu_weights_percent'
additional notes
The model I converted does not have much diffferences compared to the origin LLAMA2 13B.
Every steps before RUNNING, which are CONVERT_CHECKPOINTS.PY and TRTLLM-BUILD.py worked perfectly.
The text was updated successfully, but these errors were encountered: