gptManagerBenchmark launch failed #649

sleepwalker2017 · 2023-12-13T02:56:45Z

I'm using the code commit f7eca56161d496cbd28e8e7689dbd90003594bd2.
The gptSessionBenchmark runs ok.
But when I try gptManagerBenchmark to test inflight batching. It crashed.

mpirun -n 2 --allow-run-as-root benchmarks/gptManagerBenchmark --model llama --engine_dir ../../examples/llama/./tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset /data/TensorRT-LLM/benchmarks/cpp/preprocessed_dataset_256.json
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357:0:318107] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 318107) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000007c262 tensorrt_llm::batch_manager::NamedTensor::serialize()  ???:0
 2 0x000000000007b79d tensorrt_llm::batch_manager::InferenceRequest::serialize()  ???:0
 3 0x0000000000048f4e GptServer::getInferenceRequests()  ???:0
 4 0x0000000000049837 std::_Function_handler<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int), GptServer::GptServer(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::shared_ptr<Recorder>, std::optional<unsigned long>)::{lambda(int)#1}>::_M_invoke()  ???:0
 5 0x0000000000071ce4 tensorrt_llm::batch_manager::GptManager::fetchNewRequests()  ???:0
 6 0x00000000000733c6 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop()  ???:0
 7 0x00000000000dc253 std::error_code::default_error_condition()  ???:0
 8 0x0000000000094b43 pthread_condattr_setpshared()  ???:0
 9 0x0000000000125bb4 clone()  ???:0
=================================
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] *** Process received signal ***
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] Signal: Segmentation fault (11)
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] Signal code:  (-6)
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] Failing at address: 0x4d7ad
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f5b0bd76520]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 1] benchmarks/gptManagerBenchmark(+0x7c262)[0x55f2d3804262]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 2] benchmarks/gptManagerBenchmark(+0x7b79d)[0x55f2d380379d]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 3] benchmarks/gptManagerBenchmark(+0x48f4e)[0x55f2d37d0f4e]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 4] benchmarks/gptManagerBenchmark(+0x49837)[0x55f2d37d1837]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 5] benchmarks/gptManagerBenchmark(+0x71ce4)[0x55f2d37f9ce4]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 6] benchmarks/gptManagerBenchmark(+0x733c6)[0x55f2d37fb3c6]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f5b0c058253]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f5b0bdc8b43]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f5b0be59bb4]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node bms-airtrunk-c-g1v2-app-10-214-144-75 exited on signal 11 (Segmentation fault).

Here is my script to convert model, I'm running on 2*V100 GPU.

python build.py --model_dir /data/models/vicuna-13b-v1.5/vicuna-13b-v1.5/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/llama/13B/trt_engines/fp16/2-gpu/ \
                --max_batch_size 24 \
                --tp_size 2 \
                --world_size 2 \
                --use_inflight_batching \
                --remove_input_padding \
                --paged_kv_cache \
                --parallel_build \

The text was updated successfully, but these errors were encountered:

omer-dayan · 2023-12-25T08:04:30Z

Having same issue here. Happens only with MPIRun

kaiyux · 2023-12-29T07:05:40Z

Hi,

Thanks for reporting the issue, we have fixed the issue in the internal codebase, before the fix is pushed, could you please try the following changes to unblock you?

diff --git a/benchmarks/cpp/gptManagerBenchmark.cpp b/benchmarks/cpp/gptManagerBenchmark.cpp
index 9f6681547..c1be42d98 100644
--- a/benchmarks/cpp/gptManagerBenchmark.cpp
+++ b/benchmarks/cpp/gptManagerBenchmark.cpp
@@ -465,8 +465,14 @@ std::shared_ptr<InferenceRequest> makeRequest(std::uint64_t reqId,
     request->setMaxNewTokens(
         bufferManager.copyFrom(&request_output_len, ITensor::makeShape({1, 1}), MemoryType::kPINNED));
     request->setBeamWidth(beamWidthTensor);
-    request->setEndId(eosId);
-    request->setPadId(padId);
+    if (eosId != nullptr)
+    {
+        request->setEndId(eosId);
+    }
+    if (padId != nullptr)
+    {
+        request->setPadId(padId);
+    }
     return request;
 }

Please let me know if you are still seeing the issue after apply the changes, thanks!

kaiyux · 2024-01-05T04:53:20Z

Hi,

This issue should already been fixed on the latest main branch, please kindly check and let us know if you're still seeing the issue.

Closing, thanks very much for the support.

byshiue assigned MartinMarciniszyn Dec 13, 2023

byshiue added the triaged Issue has been triaged by maintainers label Dec 13, 2023

sleepwalker2017 mentioned this issue Dec 20, 2023

Performance issue when using batch manager #707

Closed

kaiyux assigned kaiyux and unassigned MartinMarciniszyn Dec 29, 2023

kaiyux mentioned this issue Jan 2, 2024

Update TensorRT-LLM #787

Merged

kaiyux closed this as completed Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gptManagerBenchmark launch failed #649

gptManagerBenchmark launch failed #649

sleepwalker2017 commented Dec 13, 2023 •

edited

Loading

omer-dayan commented Dec 25, 2023

kaiyux commented Dec 29, 2023

kaiyux commented Jan 5, 2024

gptManagerBenchmark launch failed #649

gptManagerBenchmark launch failed #649

Comments

sleepwalker2017 commented Dec 13, 2023 • edited Loading

omer-dayan commented Dec 25, 2023

kaiyux commented Dec 29, 2023

kaiyux commented Jan 5, 2024

sleepwalker2017 commented Dec 13, 2023 •

edited

Loading