Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: ModelRunnerCpp.from_dir() got an unexpected keyword argument 'gpu_weights_percent' #1664

Closed
3 of 4 tasks
YunChen1227 opened this issue May 24, 2024 · 11 comments
Closed
3 of 4 tasks
Assignees
Labels
not a bug Some known limitation, but not a bug. stale triaged Issue has been triaged by maintainers

Comments

@YunChen1227
Copy link

System Info

using 3090 and the docker image produced by the QuickStart Doc

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

after building the llama2 engine

python3 ../run.py --max_output_len=40 --tokenizer_dir /models/0520/ckpt/0/global_step3900-hf/ --engine_dir /models/tmp/llama/7B/trt_engines/fp16/2-gpu/ --input_text ...

Expected behavior

expect to get an answer from the model

actual behavior

hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
Traceback (most recent call last):
File "/TensorRT-LLM/examples/llama/../run.py", line 571, in
main(args)
File "/TensorRT-LLM/examples/llama/../run.py", line 420, in main
runner = runner_cls.from_dir(**runner_kwargs)
TypeError: ModelRunnerCpp.from_dir() got an unexpected keyword argument 'gpu_weights_percent'

additional notes

The model I converted does not have much diffferences compared to the origin LLAMA2 13B.
Every steps before RUNNING, which are CONVERT_CHECKPOINTS.PY and TRTLLM-BUILD.py worked perfectly.

@YunChen1227 YunChen1227 added the bug Something isn't working label May 24, 2024
@byshiue
Copy link
Collaborator

byshiue commented May 29, 2024

It looks caused by mismatch of TRT-LLM version between example and TRT-LLM core.

gpu_weights_percent is not added in TRT-LLM 0.9.0. So, you might use newer example code and run it on TRT-LLM v0.9.0.

Please try installing the latest main branch.

@byshiue byshiue added triaged Issue has been triaged by maintainers not a bug Some known limitation, but not a bug. and removed bug Something isn't working labels May 29, 2024
@byshiue byshiue self-assigned this May 29, 2024
@YunChen1227
Copy link
Author

I checked the ModelRunnerCpp.py file. gpu_weights_percent is in the function with default 1.

Anyway, this is not the promblem I want to solve now. I tried to deploy the engine converted by TRT-LLM v0.9.0 using Triton Server, but it always fail. Could you please help me to solve this problem? Below is the error I met. I just follow the QuickStart Guide provided. But whatever the version of Triton Server I used, it didn't solve this problem.

docker run -it --rm --gpus all --network host --shm-size=1g
-v $(pwd)/all_models:/all_models
-v $(pwd)/scripts:/opt/scripts
nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3

Log in to huggingface-cli to get tokenizer

huggingface-cli login --token *****

Install python dependencies

pip install sentencepiece protobuf

Launch Server

python /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 2

E0530 02:04:50.281196 2894 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1 0x7fd9982614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7fd9982850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fd9982850a0]
3 0x7fd99a0cb572 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptrnvinfer1::ILogger) + 946
4 0x7fd99a15731d tensorrt_llm::batch_manager::TrtGptModelV1::TrtGptModelV1(int, std::shared_ptrnvinfer1::ILogger, tensorrt_llm::runtime::GptModelConfig, tensorrt_llm::runtime::WorldConfig, std::vector<unsigned char, std::allocator > const&, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 701
5 0x7fd99a125dd4 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 2804
6 0x7fd99a11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 336
7 0x7fdb1412bb62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fdb1412bb62]
8 0x7fdb1412c3f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fdb1412c3f2]
9 0x7fdb1411efd5 TRITONBACKEND_ModelInstanceInitialize + 101
10 0x7fdb2e732296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fdb2e732296]
11 0x7fdb2e7334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fdb2e7334d6]
12 0x7fdb2e716045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fdb2e716045]
13 0x7fdb2e716686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fdb2e716686]
14 0x7fdb2e722efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fdb2e722efd]
15 0x7fdb2dd86ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fdb2dd86ee8]
16 0x7fdb2e70cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fdb2e70cf0b]
17 0x7fdb2e71dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fdb2e71dc65]
18 0x7fdb2e72231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fdb2e72231e]
19 0x7fdb2e8140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fdb2e8140c8]
20 0x7fdb2e8179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fdb2e8179ac]
21 0x7fdb2e96b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fdb2e96b6c2]
22 0x7fdb2dff2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fdb2dff2253]
23 0x7fdb2dd81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fdb2dd81ac3]
24 0x7fdb2de12a04 clone + 68
I0530 02:04:50.281238 2894 model_lifecycle.cc:773] failed to load 'tensorrt_llm'
I0530 02:04:50.281621 2894 server.cc:607]

@byshiue
Copy link
Collaborator

byshiue commented May 30, 2024

nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 only installs the TRT-LLM v0.8.0. You cannot use it to serve the engine built by TRT-LLM v0.9.0.

@YunChen1227
Copy link
Author

I tried converting checkpoint and building engine with TensorRT-LLM v0.8.0 and deploying it using 24.02, but below is what I got

###

root@ccnl06:/cognitive_comp/chenyun/tensorrtllm_backend# docker run -it --rm --gpus all --network host --shm-size=40g -v $(pwd)/all_models:/all_models -v $(pwd)/scripts:/opt/scripts -v /cognitive_comp/chenyun/models:/cognitive_comp/chenyun/models nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3

=============================
== Triton Inference Server ==

NVIDIA Release 24.02 (build 83572707)
Triton Server Version 2.43.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

root@ccnl06:/opt/tritonserver# pip install sentencepiece protobuf
Collecting sentencepiece
Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting protobuf
Downloading protobuf-5.27.0-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 2.9 MB/s eta 0:00:00
Downloading protobuf-5.27.0-cp38-abi3-manylinux2014_x86_64.whl (309 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 309.2/309.2 kB 39.0 MB/s eta 0:00:00
Installing collected packages: sentencepiece, protobuf
Successfully installed protobuf-5.27.0 sentencepiece-0.2.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
root@ccnl06:/opt/tritonserver#
root@ccnl06:/opt/tritonserver#
root@ccnl06:/opt/tritonserver#
root@ccnl06:/opt/tritonserver#
root@ccnl06:/opt/tritonserver# python3 /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 2
root@ccnl06:/opt/tritonserver# I0530 07:21:33.732565 122 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f5446000000' with size 268435456
I0530 07:21:33.732973 121 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f7270000000' with size 268435456
I0530 07:21:33.740504 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0530 07:21:33.740514 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0530 07:21:33.740517 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0530 07:21:33.740520 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0530 07:21:33.740523 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0530 07:21:33.740525 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0530 07:21:33.740527 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0530 07:21:33.740530 122 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
I0530 07:21:33.740728 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0530 07:21:33.740738 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0530 07:21:33.740740 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0530 07:21:33.740743 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0530 07:21:33.740745 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0530 07:21:33.740747 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0530 07:21:33.740750 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0530 07:21:33.740753 121 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
W0530 07:21:34.815881 122 server.cc:251] failed to enable peer access for some device pairs
W0530 07:21:34.818235 121 server.cc:251] failed to enable peer access for some device pairs
I0530 07:21:34.820030 122 model_lifecycle.cc:469] loading: tensorrt_llm:1
I0530 07:21:34.830899 121 model_lifecycle.cc:469] loading: postprocessing:1
I0530 07:21:34.831271 121 model_lifecycle.cc:469] loading: preprocessing:1
I0530 07:21:34.831631 121 model_lifecycle.cc:469] loading: tensorrt_llm:1
I0530 07:21:34.831972 121 model_lifecycle.cc:469] loading: tensorrt_llm_bls:1
I0530 07:21:34.876348 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 1)
I0530 07:21:34.876418 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 0)
I0530 07:21:34.876475 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 2)
I0530 07:21:34.876535 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 3)
I0530 07:21:34.876708 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 4)
I0530 07:21:34.876744 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 1)
I0530 07:21:34.876990 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 6)
I0530 07:21:34.877019 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 0)
I0530 07:21:34.877031 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 5)
I0530 07:21:34.877144 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 2)
I0530 07:21:34.877180 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (GPU device 7)
I0530 07:21:34.877272 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 3)
I0530 07:21:34.877419 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 4)
I0530 07:21:34.877577 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 5)
I0530 07:21:34.877706 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 6)
I0530 07:21:34.877731 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (GPU device 7)
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I0530 07:21:35.426101 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 6)
I0530 07:21:35.428677 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 2)
I0530 07:21:35.428772 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 1)
I0530 07:21:35.428694 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 0)
I0530 07:21:35.432108 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 3)
I0530 07:21:35.434864 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 7)
I0530 07:21:35.435693 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 4)
I0530 07:21:35.437788 121 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (GPU device 5)
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
I0530 07:21:36.039394 121 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm_bls'
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
I0530 07:21:36.287594 121 model_lifecycle.cc:835] successfully loaded 'preprocessing'
I0530 07:21:36.316022 121 model_lifecycle.cc:835] successfully loaded 'postprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 7 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 12682 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 7 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 12682 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12804, GPU 13348 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 12806, GPU 13358 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12801, GPU 13350 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 12803, GPU 13360 (MiB)
Failed, NCCL error /tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'internal error - please report this issue to the NCCL developers'
Failed, NCCL error /tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'internal error - please report this issue to the NCCL developers'

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[55548,1],1]
Exit code: 1

would you please tell me what is the problem?

@YunChen1227
Copy link
Author

Here says that
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
But I pip install torch==2.2 did not solve this problem

@dinhkt
Copy link

dinhkt commented May 30, 2024

The latest examples/run.py #1688 still has the "gpu_weights_percent" so it will cause error when running engine built by TRTLLM 0.9.0.
A quick solution is removing the line for passing that argument (line 430), then also need to remove the other arguments (line 445-449) then you can run the engine. I can run llama3 engine built by TRTLLM 0.9.0 with it.

@YunChen1227
Copy link
Author

Thanks for giving this instruction. I wonder if you have tried deploying the engine using the triton server? Do you have any suggestions for that?

@dinhkt
Copy link

dinhkt commented May 31, 2024

Yes I can deploy the engine successfully with triton. You need to modify config.pbtxt files as in instruction here https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/

@YunChen1227
Copy link
Author

Thanks, but it is a bit different to LLAMA2-13B, which must have to do the tensor parallel. some parameters might be different from the instruction and I failed again. Anyway, thank you very much for giving me these advice.

Copy link

github-actions bot commented Jul 1, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

@github-actions github-actions bot added the stale label Jul 1, 2024
@nv-guomingz
Copy link
Collaborator

Hi @YunChen1227 do u still have further issue or question now? If not, we'll close it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug Some known limitation, but not a bug. stale triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants