[grpc] Support gRPC server entrypoint#30190
Conversation
d0cdccb to
e97ee77
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a gRPC server entrypoint for vLLM, providing an alternative to the existing HTTP/REST API. This is a significant feature that enables more efficient communication through binary protocols and HTTP/2 multiplexing. The implementation is well-structured, with a dedicated GrpcRequestManager to handle the interaction with the vLLM engine, and a clean server implementation in grpc_server.py. The code includes graceful shutdown handling and client cancellation, which are important for a production-ready server.
My review focuses on improving robustness and security. I've identified a potential security vulnerability related to unlimited gRPC message sizes and several places where logging could be improved to include full tracebacks for easier debugging of production issues. These are important for maintaining a reliable and secure service.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Hi @CatherineSue, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
Hi @CatherineSue, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
Thanks @CatherineSue! I think this is definitely something worthwhile to add. It would be interesting to see how the performance compares in the same tests with e.g. Also just for reference here's a gRPC wrapper we had used in the past in IBM/Red Hat, probably quite similar but might be useful to compare https://github.com/opendatahub-io/vllm-tgis-adapter. |
|
Also, I think if this were to be added it might make sense to have this part of the openai api server. We merged anthropic endpoints into that entrypoint as well so you can support both paths at the same time. |
simon-mo
left a comment
There was a problem hiding this comment.
LGTM on the first pass, pending @njhill @robertgshaw2-redhat quick skims on grpc_request_manager.py and grpc_server.py for correct usage of AsyncLLM.
Another question is whether we should just enable this by default so at vllm serve we also start gRPC server at another port?
There was a problem hiding this comment.
cc @robertgshaw2-redhat for the protocol to share with llm-d folks. I'm not expecting a lot of changes to this btw, but if there's some standard we can follow that will be useful as well.
There was a problem hiding this comment.
There are some service endpoints that are missing (like /scale_elastic_ep), is the goal going to be 100% complete coverage?
| // Generate Request | ||
| // ===================== | ||
|
|
||
| message GenerateRequest { |
There was a problem hiding this comment.
@NickLucche as we discussed, we should review this as it is probably the better approach in general for high efficiency integration for both disaggregation, the coordinator, and broader more decoupled efforts.
The biggest question in both the http and grpc version of this is "how do we allow for reasonable custom fields for things like the coordinator, or to disaggregate how the openai response is generated".
I don't think proto is worse than http, but I do think we should be deliberate in the http schema evolution to avoid creating a mismatch between HTTP and gRPC, especially if we think most people would prefer to use the gRPC endpoint.
There was a problem hiding this comment.
Can we start by centralizing the message definition st http/grpc interface definition is shared?
re: custom field protobuf side, would we just bite the bullet and use a Any field?
vllm/grpc/vllm_engine.proto
Outdated
| string request_id = 1; | ||
|
|
||
| // Pre-tokenized input (required) | ||
| TokenizedInput tokenized = 2; |
There was a problem hiding this comment.
I guess there would be no harm in supporting either text or token ids input?
There was a problem hiding this comment.
that would require grpc server to include tokenizer
this can be added in the future, its currently not supported (one of the reasons to have grpc is to have oai server and other related component to be fully written in rust or other languages which is "more" production ready)
There was a problem hiding this comment.
You can also pass a string prompt to AsyncLLM.generate() so it would be trivial to expose this as an option so that the gRPC API could also be used in a standalone manner if desired (albeit probably not recommended for performance).
|
There's currently no tests for this, which I highly recommend having before this is merged. |
There was a problem hiding this comment.
Another question is whether we should just enable this by default so at vllm serve we also start gRPC server at another port
@simon-mo I would be in favor of opt-in first (enable with flag, or explicitly start entrypoint) and start as default in a later release.
vllm/grpc/grpc_request_manager.py
Outdated
| # 1. Create a ParentRequest to track all child requests | ||
| # 2. Fan out multiple child EngineCoreRequests with different | ||
| # request_index values | ||
| # 3. Aggregate outputs from all children | ||
| # For now, we only support n=1, so parent_req=None and | ||
| # request_index=0 |
| // Generate Request | ||
| // ===================== | ||
|
|
||
| message GenerateRequest { |
There was a problem hiding this comment.
Can we start by centralizing the message definition st http/grpc interface definition is shared?
re: custom field protobuf side, would we just bite the bullet and use a Any field?
|
It would also be great to add tests but we can do that as a follow-on if needed. |
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
5994e61 to
9cf8e86
Compare
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
This PR is failing AMD CI: https://buildkite.com/vllm/amd-ci/builds/2524/steps/canvas?jid=019b9d04-99d5-4f11-a7c6-f524c5e5b35e |
|
Fixed by #31970 |
Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: njhill <nickhill123@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: njhill <nickhill123@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: njhill <nickhill123@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: njhill <nickhill123@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Purpose
Add gRPC server support to vLLM, enabling the community to integrate vLLM via gRPC protocol for any upstream application or routing layer.
Key Benefits:
Native gRPC Protocol Support
Integration with
sgl-model-gatewayChanged Files
Protocol & Codegen:
vllm_scheduler.proto- Protocol buffer definition (source)vllm_scheduler_pb2.py- Generated protobuf messages (auto-generated)vllm_scheduler_pb2_grpc.py- Generated gRPC service (auto-generated)compile_protos.py- Script to compile proto files__init__.py- Module initializationServer Implementation:
vllm/grpc/grpc_request_manager.py- Request manager (GrpcRequestManager class)vllm/entrypoints/grpc_server.py- Server entrypoint (VllmSchedulerServicer + main)Compilation
To regenerate the Python code from the
.protofile:Requirements:
pip install grpcio-toolsThis generates:
vllm_scheduler_pb2.py- Message classesvllm_scheduler_pb2_grpc.py- Service stubs and servicersTest Plan
Run the gRPC server with a
Llama-3.1-8B-Instruct:python3 -m vllm.entrypoints.grpc_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --port 8080 \ --host 0.0.0.0 \ --tensor-parallel-size 1Test with gateway integration with
sgl-model-gatewayVerify:
GetModelInfoandGetServerInforeturn correct metadataTest Result
We use
genai-benchto measure the http_server vs (grpc_server +sgl-model-gateway) withLlama-3.3-70B-Instructon 4xH100.Performance Results (Llama-3.3-70B, D100_100, Concurrency 256):
At high concurrency, gRPC demonstrates superior production characteristics:
Key Value Proposition:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.