Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Oct 16, 2025

Avoid serialization + copy overhead when pickled payload contains large buffers by exploiting https://peps.python.org/pep-0574.

This is beneficial for example when broadcasting bitmask arrays with the multiproc executor.

Benchmark using structured outputs + multiproc executor:

canhazgpu run --gpus 1 -- vllm serve Qwen/Qwen3-1.7B --uvicorn-log-level=error  --no-enable-prefix-caching --distributed-executed-backend mp

python3 benchmarks/benchmark_serving_structured_output.py --backend vllm --model Qwen/Qwen3-1.7B --structured-output-ratio 0.8 --request-rate 120 --max-concurrency 800 --num-prompts 5000 --json-schema-path ./test3.json  --output-len 128

Before:

============ Serving Benchmark Result ============
Successful requests:                     5000      
Maximum request concurrency:             800       
Request rate configured (RPS):           120.00    
Benchmark duration (s):                  83.96     
Total input tokens:                      2405000   
Total generated tokens:                  639936    
Request throughput (req/s):              59.55     
Output token throughput (tok/s):         7621.48   
Total Token throughput (tok/s):          36264.42  
---------------Time to First Token----------------
Mean TTFT (ms):                          327.90    
Median TTFT (ms):                        296.12    
P99 TTFT (ms):                           674.63    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          96.30     
Median TPOT (ms):                        102.40    
P99 TPOT (ms):                           106.44    
---------------Inter-token Latency----------------
Mean ITL (ms):                           96.54     
Median ITL (ms):                         92.78     
P99 ITL (ms):                            251.71    
==================================================
correct_rate(%) 99.6 

After:

============ Serving Benchmark Result ============
Successful requests:                     5000      
Maximum request concurrency:             800       
Request rate configured (RPS):           120.00    
Benchmark duration (s):                  71.28     
Total input tokens:                      2405000   
Total generated tokens:                  639937    
Request throughput (req/s):              70.15     
Output token throughput (tok/s):         8977.91   
Total Token throughput (tok/s):          42718.52  
---------------Time to First Token----------------
Mean TTFT (ms):                          251.35    
Median TTFT (ms):                        236.60    
P99 TTFT (ms):                           586.14    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          80.32     
Median TPOT (ms):                        86.29     
P99 TPOT (ms):                           90.48     
---------------Inter-token Latency----------------
Mean ITL (ms):                           80.32     
Median ITL (ms):                         78.74     
P99 ITL (ms):                            198.15    
==================================================
correct_rate(%) 99.58 

Avoid serialization + copy overhead when pickled payload contains large buffers.

This is beneficial for example when broadcasting bitmask arrays with the multiproc executor.

Signed-off-by: Nick Hill <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for shared memory broadcasting by leveraging pickle's out-of-band (OOB) buffer support. This avoids serialization and copy overhead for large buffers, which is a great improvement. The implementation correctly adapts both shared memory and ZMQ communication paths to handle multipart data. However, I've identified a critical bug in the size calculation for shared memory writes that could lead to a buffer overflow and crash the writer process. A fix is suggested to address this.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 16, 2025
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice!

One minor question and one minor suggestion, otherwise lgtm!

n_local_reader, # number of local readers through shared memory
local_reader_ranks: list[int] | None = None,
max_chunk_bytes: int = 1024 * 1024 * 10,
max_chunk_bytes: int = 1024 * 1024 * 24, # 24MiB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, what led to this change? What's special about 24 MB?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just to ensure that the largest bitmasks are covered, I observed that they could be up to ~20MiB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. It would be helpful to leave a comment in the code to explain where the number came from, and that it's "big enough based on observation" versus some observed technical limitation of the shared memory pathway.

break

def enqueue(self, obj, timeout: float | None = None):
"""Write to message queue with optional timeout (in seconds)"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an expanded docstring here that gives an overview of the encoding format would be helpful here.

@vllm-bot vllm-bot merged commit ab81379 into vllm-project:main Oct 17, 2025
50 of 52 checks passed
@njhill njhill deleted the oob-pickle branch October 17, 2025 03:44
Zhuul pushed a commit to Zhuul/vllm that referenced this pull request Oct 17, 2025
njhill added a commit to njhill/vllm that referenced this pull request Oct 17, 2025
This is a follow-on to PRs vllm-project#26737 and vllm-project#26961 to add some clarifying comments that were suggested in review.

Signed-off-by: Nick Hill <[email protected]>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Nov 12, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants