Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gptManagerBenchmark launch failed #649

Closed
sleepwalker2017 opened this issue Dec 13, 2023 · 3 comments
Closed

gptManagerBenchmark launch failed #649

sleepwalker2017 opened this issue Dec 13, 2023 · 3 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@sleepwalker2017
Copy link

sleepwalker2017 commented Dec 13, 2023

I'm using the code commit f7eca56161d496cbd28e8e7689dbd90003594bd2.
The gptSessionBenchmark runs ok.
But when I try gptManagerBenchmark to test inflight batching. It crashed.

mpirun -n 2 --allow-run-as-root benchmarks/gptManagerBenchmark --model llama --engine_dir ../../examples/llama/./tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset /data/TensorRT-LLM/benchmarks/cpp/preprocessed_dataset_256.json
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357:0:318107] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 318107) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000007c262 tensorrt_llm::batch_manager::NamedTensor::serialize()  ???:0
 2 0x000000000007b79d tensorrt_llm::batch_manager::InferenceRequest::serialize()  ???:0
 3 0x0000000000048f4e GptServer::getInferenceRequests()  ???:0
 4 0x0000000000049837 std::_Function_handler<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int), GptServer::GptServer(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::shared_ptr<Recorder>, std::optional<unsigned long>)::{lambda(int)#1}>::_M_invoke()  ???:0
 5 0x0000000000071ce4 tensorrt_llm::batch_manager::GptManager::fetchNewRequests()  ???:0
 6 0x00000000000733c6 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop()  ???:0
 7 0x00000000000dc253 std::error_code::default_error_condition()  ???:0
 8 0x0000000000094b43 pthread_condattr_setpshared()  ???:0
 9 0x0000000000125bb4 clone()  ???:0
=================================
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] *** Process received signal ***
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] Signal: Segmentation fault (11)
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] Signal code:  (-6)
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] Failing at address: 0x4d7ad
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f5b0bd76520]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 1] benchmarks/gptManagerBenchmark(+0x7c262)[0x55f2d3804262]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 2] benchmarks/gptManagerBenchmark(+0x7b79d)[0x55f2d380379d]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 3] benchmarks/gptManagerBenchmark(+0x48f4e)[0x55f2d37d0f4e]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 4] benchmarks/gptManagerBenchmark(+0x49837)[0x55f2d37d1837]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 5] benchmarks/gptManagerBenchmark(+0x71ce4)[0x55f2d37f9ce4]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 6] benchmarks/gptManagerBenchmark(+0x733c6)[0x55f2d37fb3c6]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f5b0c058253]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f5b0bdc8b43]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f5b0be59bb4]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node bms-airtrunk-c-g1v2-app-10-214-144-75 exited on signal 11 (Segmentation fault).

Here is my script to convert model, I'm running on 2*V100 GPU.

python build.py --model_dir /data/models/vicuna-13b-v1.5/vicuna-13b-v1.5/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/llama/13B/trt_engines/fp16/2-gpu/ \
                --max_batch_size 24 \
                --tp_size 2 \
                --world_size 2 \
                --use_inflight_batching \
                --remove_input_padding \
                --paged_kv_cache \
                --parallel_build \
@byshiue byshiue added the triaged Issue has been triaged by maintainers label Dec 13, 2023
@omer-dayan
Copy link

Having same issue here. Happens only with MPIRun

@kaiyux kaiyux assigned kaiyux and unassigned MartinMarciniszyn Dec 29, 2023
@kaiyux
Copy link
Member

kaiyux commented Dec 29, 2023

Hi,

Thanks for reporting the issue, we have fixed the issue in the internal codebase, before the fix is pushed, could you please try the following changes to unblock you?

diff --git a/benchmarks/cpp/gptManagerBenchmark.cpp b/benchmarks/cpp/gptManagerBenchmark.cpp
index 9f6681547..c1be42d98 100644
--- a/benchmarks/cpp/gptManagerBenchmark.cpp
+++ b/benchmarks/cpp/gptManagerBenchmark.cpp
@@ -465,8 +465,14 @@ std::shared_ptr<InferenceRequest> makeRequest(std::uint64_t reqId,
     request->setMaxNewTokens(
         bufferManager.copyFrom(&request_output_len, ITensor::makeShape({1, 1}), MemoryType::kPINNED));
     request->setBeamWidth(beamWidthTensor);
-    request->setEndId(eosId);
-    request->setPadId(padId);
+    if (eosId != nullptr)
+    {
+        request->setEndId(eosId);
+    }
+    if (padId != nullptr)
+    {
+        request->setPadId(padId);
+    }
     return request;
 }

Please let me know if you are still seeing the issue after apply the changes, thanks!

@kaiyux
Copy link
Member

kaiyux commented Jan 5, 2024

Hi,

This issue should already been fixed on the latest main branch, please kindly check and let us know if you're still seeing the issue.

Closing, thanks very much for the support.

@kaiyux kaiyux closed this as completed Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants