-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc]: TTFT profiling with respect to prompt length #7635
Comments
There are a few things to discuss here:
|
Thank you @ywang96 for suggestion, here is some additional info from me: |
Probably it enters the compute-bound region when the prompt length reaches 400. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Anything you want to discuss about vllm.
I am profiling TTFT and TPOT on my machine, I could not explain the behavior of TTFT thus opened this issue to seek for advice.
Below figure shows the TTFTs with respect to prompt length on my machine, the test condition is as below:
steps taken for TTFT and TPOT profiling:
python -m vllm.entrypoints.openai.api_server --args
benchmark_serving.py
to get the TTFT and TPOT, each time only send a request to server to eliminate the effect of waiting timeThe profiled TTFT is as below:
Observation 1: when the prompt length is less than 400, the TTFT seems to be a flat value ~100ms. This value is consistent across different TP settings (tried TP=1, TP=2 and TP=4).
Observation 2: When prompt length is greater than 400, TTFT is linear to prompt length. This result is inline with Figure 6b this paper (https://arxiv.org/pdf/2405.06856).
I don't understand the result of observation 1, can anyone provide some insight on this result? What is the reason causingTTFT a horizontal line when the prompt length is less than 400?
The text was updated successfully, but these errors were encountered: