We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm using this commit a21e2f85178111fed9812bb88c2cc7411b25f0ba on 2 * A30, since the latest commit doesn't work for me. (See here :#649)
a21e2f85178111fed9812bb88c2cc7411b25f0ba
I find that, when I build engine with different max_batch_size, the performance differs a lot.
Here is the token throughput when building with difference max_batch_size, and run with different max_num_sequences.
You can see the performance is bad after max_batch_size is 32 and larger.
The benchmark script is
CUDA_VISIBLE_DEVICES=0,1 mpirun -n 2 --allow-run-as-root benchmarks/gptManagerBenchmark --model llama --engine_dir ../../examples/llama/./tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset /data/TensorRT-LLM/examples/llama/preprocessed_dataset_256.json --max_num_sequences $1
The build script is
python build.py --model_dir /data/weilong.yu/vicuna-13b/vicuna-13b-v1.5/ \ --dtype float16 \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --output_dir ./tmp/llama/13B/trt_engines/fp16/2-gpu/ \ --max_batch_size $1 \ --tp_size 2 \ --world_size 2 --parallel_build \ --use_inflight_batching \ --remove_input_padding \ --paged_kv_cache \ --enable_context_fmha
The text was updated successfully, but these errors were encountered:
@sleepwalker2017 Do you still have the problem? If not, we will close it soon.
Sorry, something went wrong.
kaiyux
No branches or pull requests
I'm using this commit
a21e2f85178111fed9812bb88c2cc7411b25f0ba
on 2 * A30, since the latest commit doesn't work for me. (See here :#649)I find that, when I build engine with different max_batch_size, the performance differs a lot.
Here is the token throughput when building with difference max_batch_size, and run with different max_num_sequences.
You can see the performance is bad after max_batch_size is 32 and larger.
The benchmark script is
The build script is
The text was updated successfully, but these errors were encountered: