-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Perf] Exploit out-of-band buffers in shm_broadcast #26961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Avoid serialization + copy overhead when pickled payload contains large buffers. This is beneficial for example when broadcasting bitmask arrays with the multiproc executor. Signed-off-by: Nick Hill <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for shared memory broadcasting by leveraging pickle's out-of-band (OOB) buffer support. This avoids serialization and copy overhead for large buffers, which is a great improvement. The implementation correctly adapts both shared memory and ZMQ communication paths to handle multipart data. However, I've identified a critical bug in the size calculation for shared memory writes that could lead to a buffer overflow and crash the writer process. A fix is suggested to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
russellb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice!
One minor question and one minor suggestion, otherwise lgtm!
| n_local_reader, # number of local readers through shared memory | ||
| local_reader_ranks: list[int] | None = None, | ||
| max_chunk_bytes: int = 1024 * 1024 * 10, | ||
| max_chunk_bytes: int = 1024 * 1024 * 24, # 24MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, what led to this change? What's special about 24 MB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just to ensure that the largest bitmasks are covered, I observed that they could be up to ~20MiB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. It would be helpful to leave a comment in the code to explain where the number came from, and that it's "big enough based on observation" versus some observed technical limitation of the shared memory pathway.
| break | ||
|
|
||
| def enqueue(self, obj, timeout: float | None = None): | ||
| """Write to message queue with optional timeout (in seconds)""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an expanded docstring here that gives an overview of the encoding format would be helpful here.
Signed-off-by: Nick Hill <[email protected]>
This is a follow-on to PRs vllm-project#26737 and vllm-project#26961 to add some clarifying comments that were suggested in review. Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Avoid serialization + copy overhead when pickled payload contains large buffers by exploiting https://peps.python.org/pep-0574.
This is beneficial for example when broadcasting bitmask arrays with the multiproc executor.
Benchmark using structured outputs + multiproc executor:
Before:
After: