[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest #10245

WoosukKwon · 2024-11-12T06:37:11Z

This PR adds multimodal inputs to EngineCoreRequest before adding proper support for VLMs in #9871.

Since the multimodal inputs include types incompatible with msgspec (e.g., PIL images), we use pickle for serializing/de-serializing EngineCoreRequest.

Signed-off-by: Woosuk Kwon <[email protected]>

github-actions · 2024-11-12T06:37:25Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Woosuk Kwon <[email protected]>

ywang96

I ran this branch vs main on 1xH100 with vllm serve meta-llama/Llama-3.1-8B-Instruct.

Benchmark command is

python3 vllm/benchmarks/benchmark_serving.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sonnet \
    --dataset-path vllm/benchmarks/sonnet.txt \
    --num-prompts 1000 \
    --request-rate 40

On main:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  29.39     
Total input tokens:                      509089    
Total generated tokens:                  150000    
Request throughput (req/s):              34.02     
Output token throughput (tok/s):         5103.02   
Total Token throughput (tok/s):          22422.30  
---------------Time to First Token----------------
Mean TTFT (ms):                          221.11    
Median TTFT (ms):                        195.76    
P99 TTFT (ms):                           501.90    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.36     
Median TPOT (ms):                        37.82     
P99 TPOT (ms):                           44.80     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.23     
Median ITL (ms):                         22.06     
P99 ITL (ms):                            317.37    
==================================================

This branch v1-pickle:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  29.46     
Total input tokens:                      509089    
Total generated tokens:                  150000    
Request throughput (req/s):              33.94     
Output token throughput (tok/s):         5090.95   
Total Token throughput (tok/s):          22369.28  
---------------Time to First Token----------------
Mean TTFT (ms):                          235.23    
Median TTFT (ms):                        213.75    
P99 TTFT (ms):                           531.09    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.91     
Median TPOT (ms):                        38.43     
P99 TPOT (ms):                           45.99     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.77     
Median ITL (ms):                         22.82     
P99 ITL (ms):                            321.38    
==================================================

I have run this benchmark repeatedly on each branch, and there's a noticeable performance hit from using pickle. However, this hit is probably ~1% from what I observed, so I think this change is acceptable.

An alternatively solution would be loading multimodal data into base64 encoded string, then use msgspec. This basically makes text-only inference unaffected, but I'm not sure about the performance implication for multimodal models and if it's worth the complexity.

WoosukKwon · 2024-11-12T09:55:18Z

@ywang96 Thanks so much for profiling it!

njhill · 2024-11-12T17:50:25Z

Thanks @WoosukKwon @ywang96 I think this is a good temporary solution, I'm hopeful we can address the perf difference with minimal complexity.

…puts to EngineCoreRequest (vllm-project#10245) Signed-off-by: Woosuk Kwon <[email protected]>

…puts to EngineCoreRequest (vllm-project#10245) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: OmerD <[email protected]>

…puts to EngineCoreRequest (vllm-project#10245) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

…puts to EngineCoreRequest (vllm-project#10245) Signed-off-by: Woosuk Kwon <[email protected]>

…puts to EngineCoreRequest (vllm-project#10245) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

[V1] Use Pickle for serializing EngineCoreRequest

dd95be0

Signed-off-by: Woosuk Kwon <[email protected]>

dataclass

5b91864

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon requested review from robertgshaw2-neuralmagic, njhill and ywang96 November 12, 2024 06:46

ywang96 approved these changes Nov 12, 2024

View reviewed changes

WoosukKwon merged commit 7c65527 into main Nov 12, 2024
20 of 22 checks passed

WoosukKwon deleted the v1-pickle branch November 12, 2024 16:57

WoosukKwon mentioned this pull request Nov 13, 2024

[V1] Fix CI tests on V1 engine #10272

Merged

rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 13, 2024

[V1] Use pickle for serializing EngineCoreRequest & Add multimodal in…

a8f6d4f

…puts to EngineCoreRequest (vllm-project#10245) Signed-off-by: Woosuk Kwon <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[V1] Use pickle for serializing EngineCoreRequest & Add multimodal in…

9ba1223

…puts to EngineCoreRequest (vllm-project#10245) Signed-off-by: Woosuk Kwon <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest #10245

[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest #10245

WoosukKwon commented Nov 12, 2024

github-actions bot commented Nov 12, 2024

ywang96 left a comment •

edited

Loading

WoosukKwon commented Nov 12, 2024

njhill commented Nov 12, 2024

[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest #10245

[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest #10245

Conversation

WoosukKwon commented Nov 12, 2024

github-actions bot commented Nov 12, 2024

ywang96 left a comment • edited Loading

Choose a reason for hiding this comment

WoosukKwon commented Nov 12, 2024

njhill commented Nov 12, 2024

ywang96 left a comment •

edited

Loading