You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using the code commit f7eca56161d496cbd28e8e7689dbd90003594bd2.
The gptSessionBenchmark runs ok.
But when I try gptManagerBenchmark to test inflight batching. It crashed.
mpirun -n 2 --allow-run-as-root benchmarks/gptManagerBenchmark --model llama --engine_dir ../../examples/llama/./tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset /data/TensorRT-LLM/benchmarks/cpp/preprocessed_dataset_256.json
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357:0:318107] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 318107) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000007c262 tensorrt_llm::batch_manager::NamedTensor::serialize() ???:0
2 0x000000000007b79d tensorrt_llm::batch_manager::InferenceRequest::serialize() ???:0
3 0x0000000000048f4e GptServer::getInferenceRequests() ???:0
4 0x0000000000049837 std::_Function_handler<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int), GptServer::GptServer(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::shared_ptr<Recorder>, std::optional<unsigned long>)::{lambda(int)#1}>::_M_invoke() ???:0
5 0x0000000000071ce4 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() ???:0
6 0x00000000000733c6 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() ???:0
7 0x00000000000dc253 std::error_code::default_error_condition() ???:0
8 0x0000000000094b43 pthread_condattr_setpshared() ???:0
9 0x0000000000125bb4 clone() ???:0
=================================
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] *** Process received signal ***
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] Signal: Segmentation fault (11)
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] Signal code: (-6)
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] Failing at address: 0x4d7ad
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f5b0bd76520]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 1] benchmarks/gptManagerBenchmark(+0x7c262)[0x55f2d3804262]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 2] benchmarks/gptManagerBenchmark(+0x7b79d)[0x55f2d380379d]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 3] benchmarks/gptManagerBenchmark(+0x48f4e)[0x55f2d37d0f4e]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 4] benchmarks/gptManagerBenchmark(+0x49837)[0x55f2d37d1837]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 5] benchmarks/gptManagerBenchmark(+0x71ce4)[0x55f2d37f9ce4]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 6] benchmarks/gptManagerBenchmark(+0x733c6)[0x55f2d37fb3c6]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f5b0c058253]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f5b0bdc8b43]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f5b0be59bb4]
[bms-airtrunk-c-g1v2-app-10-214-144-75:317357] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node bms-airtrunk-c-g1v2-app-10-214-144-75 exited on signal 11 (Segmentation fault).
Here is my script to convert model, I'm running on 2*V100 GPU.
Thanks for reporting the issue, we have fixed the issue in the internal codebase, before the fix is pushed, could you please try the following changes to unblock you?
I'm using the code commit
f7eca56161d496cbd28e8e7689dbd90003594bd2
.The gptSessionBenchmark runs ok.
But when I try gptManagerBenchmark to test inflight batching. It crashed.
Here is my script to convert model, I'm running on 2*V100 GPU.
The text was updated successfully, but these errors were encountered: