[Misc]: TTFT profiling with respect to prompt length #7635

luowenjie14 · 2024-08-18T06:58:27Z

Anything you want to discuss about vllm.

I am profiling TTFT and TPOT on my machine, I could not explain the behavior of TTFT thus opened this issue to seek for advice.

Below figure shows the TTFTs with respect to prompt length on my machine, the test condition is as below:

model: llama3-8B
GPU type: V100, the below figure shows the result of TP=2
dataset: ShareGPT

steps taken for TTFT and TPOT profiling:

start the OpenAI API-compatible server using: python -m vllm.entrypoints.openai.api_server --args
iterative running benchmark_serving.py to get the TTFT and TPOT, each time only send a request to server to eliminate the effect of waiting time

The profiled TTFT is as below:
Observation 1: when the prompt length is less than 400, the TTFT seems to be a flat value ~100ms. This value is consistent across different TP settings (tried TP=1, TP=2 and TP=4).
Observation 2: When prompt length is greater than 400, TTFT is linear to prompt length. This result is inline with Figure 6b this paper (https://arxiv.org/pdf/2405.06856).

I don't understand the result of observation 1, can anyone provide some insight on this result? What is the reason causingTTFT a horizontal line when the prompt length is less than 400?

The text was updated successfully, but these errors were encountered:

ywang96 · 2024-08-18T07:28:02Z

There are a few things to discuss here:

ShareGPT is a dataset with prompts of different lengths. I suggest benchmarking with sonnet dataset or random dataset where you can specify the length of all prompts.
What's your default args for launching the model server? For Llama3-8B, something to keep in mind is that chunked prefill is enabled by default for model with long context window (>32768).
TTFT measure the overall latency to the first token, so there's some static level of latency (though it definitely wouldn't be as high as 100ms) introduced from other components of the server (API server, scheduler, preprocessing, etc).

luowenjie14 · 2024-08-18T07:34:17Z

Thank you @ywang96 for suggestion, here is some additional info from me:
For point 1: I do sample the request with specified prompt length from shareGPT, for example, if I test prompt_length 100, I sample the request whose prompt length is closest to it
for Point 2: since I only test 1 request each time, will the chunked prefill affect the result? In addition, the total length of context size of tested request is less than 32768
For point 3: I do measured the http API server's latency (~8ms), so it does not contribute too much to it, I think I will evaluate the scheduler and preprocessing time to get a cleaner ttft.

George-ao · 2024-08-19T08:10:50Z

Probably it enters the compute-bound region when the prompt length reaches 400.

github-actions · 2024-11-18T02:04:47Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

luowenjie14 added the misc label Aug 18, 2024

github-actions bot added the stale label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc]: TTFT profiling with respect to prompt length #7635

[Misc]: TTFT profiling with respect to prompt length #7635

luowenjie14 commented Aug 18, 2024

ywang96 commented Aug 18, 2024

luowenjie14 commented Aug 18, 2024

George-ao commented Aug 19, 2024

github-actions bot commented Nov 18, 2024

[Misc]: TTFT profiling with respect to prompt length #7635

[Misc]: TTFT profiling with respect to prompt length #7635

Comments

luowenjie14 commented Aug 18, 2024

Anything you want to discuss about vllm.

ywang96 commented Aug 18, 2024

luowenjie14 commented Aug 18, 2024

George-ao commented Aug 19, 2024

github-actions bot commented Nov 18, 2024