-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is llava trt-llm not much faster than transformers? #1123
Comments
Same issue |
cross-ref #1118 (comment) The main branch has been updated. Thanks. |
@kaiyux thanks for your update,I tried compiling the latest code in docker,tensorrt-llm:0.9.0.dev2024022700,ON 3090 The maximum batch used when building the engine file is 8. run.py batch=1,2,4,8:
|
Hi @bleedingfight Although |
@amukkara In my application scenario, we need to output image descriptions for many images.So the max_new_token for our output is the same for different images. I can indeed improve performance by reducing the value of max_new_token,but this is only because the number of iterations output by the auto-regressive model has decreased. I think the optimization of TRT-llm is not deep enough. |
@bleedingfight |
@amukkara How to maintain consistent output between trt-llm and transformers,I use the generate interface in Transformers with max_new_tokens=200,topk=50 and same prompt,The output results of transformers and trt-llm are different.I want to know how NVIDIA officially ensures the correctness of the TRT model output, but it seems difficult for developers themselves to know how NVIDIA ensures it. |
same question. We test on qwen-vl-chat. Have you solved this problem yet? Looking forward to any suggestions. |
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
output not much faster then transformers!
actual behavior
[02/21/2024-11:36:57] [TRT-LLM] [I] TensorRT vision encoder latency: 0.0076348876953125 sec
[02/21/2024-11:36:57] [TRT-LLM] [I] TensorRT-LLM LLM latency: 4.899357304573059 sec
[02/21/2024-11:36:57] [TRT-LLM] [I] Generate latency: 4.9331824231147765 sec
additional notes
I only modified a part of the code that is not very related to the algorithm,code
llava.py
:run_demo.sh
pipeline with transformers(llava_demo.py):
Although the above code may take around 10 seconds to test a single inference on my machine,Perhaps it was an accidental result, as my previous testing took about 5 seconds.Although there is significant fluctuation in a single test, it can stabilize to around 5 seconds after a large number of tests。Why is llava trt-llm not much faster than transformers?
During operation, the GPU utilization rate did not reach 100%
The text was updated successfully, but these errors were encountered: