Skip to content

[grpc] Support gRPC server entrypoint#30190

Merged
vllm-bot merged 31 commits intovllm-project:mainfrom
CatherineSue:vllm-grpc-upstream
Jan 8, 2026
Merged

[grpc] Support gRPC server entrypoint#30190
vllm-bot merged 31 commits intovllm-project:mainfrom
CatherineSue:vllm-grpc-upstream

Conversation

@CatherineSue
Copy link
Copy Markdown
Contributor

@CatherineSue CatherineSue commented Dec 6, 2025

Purpose

Add gRPC server support to vLLM, enabling the community to integrate vLLM via gRPC protocol for any upstream application or routing layer.

Key Benefits:

  1. Native gRPC Protocol Support

    • Enables upstream applications to connect via gRPC/Protobuf instead of HTTP/JSON
    • Binary protocol reduces serialization overhead
    • HTTP/2 multiplexing improves connection efficiency
    • Expands vLLM's integration options beyond HTTP/REST APIs
  2. Integration with sgl-model-gateway

    • Enables vLLM workers to operate as gRPC backends
    • Bypasses Python GIL bottleneck by moving tokenization logic to Rust
    • Provides production-grade features: advanced routing, secured mcp and database management, responses api
    • Measured performance gains at high concurrency (see Test Results)

Changed Files

Protocol & Codegen:

  • vllm_scheduler.proto - Protocol buffer definition (source)
  • vllm_scheduler_pb2.py - Generated protobuf messages (auto-generated)
  • vllm_scheduler_pb2_grpc.py - Generated gRPC service (auto-generated)
  • compile_protos.py - Script to compile proto files
  • __init__.py - Module initialization

Server Implementation:

  • vllm/grpc/grpc_request_manager.py - Request manager (GrpcRequestManager class)
  • vllm/entrypoints/grpc_server.py - Server entrypoint (VllmSchedulerServicer + main)

Compilation

To regenerate the Python code from the .proto file:

python vllm/grpc/compile_protos.py

Requirements: pip install grpcio-tools

This generates:

  • vllm_scheduler_pb2.py - Message classes
  • vllm_scheduler_pb2_grpc.py - Service stubs and servicers

Test Plan

Run the gRPC server with a Llama-3.1-8B-Instruct:

  python3 -m vllm.entrypoints.grpc_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8080 \
    --host 0.0.0.0 \
    --tensor-parallel-size 1

Test with gateway integration with sgl-model-gateway

Verify:

  1. Health check endpoint responds correctly
  2. Streaming generation returns token IDs (not text)
  3. gRPC reflection is available for introspection
  4. Request abort/cancellation works properly
  5. GetModelInfo and GetServerInfo return correct metadata

Test Result

We use genai-bench to measure the http_server vs (grpc_server + sgl-model-gateway) with Llama-3.3-70B-Instruct on 4xH100.

Performance Results (Llama-3.3-70B, D100_100, Concurrency 256):

At high concurrency, gRPC demonstrates superior production characteristics:

Metric gRPC HTTP Improvement
Throughput 9,068 tok/s 6,629 tok/s +37%
Requests/sec 45.7 33.4 +37%
p99 TTFT 1,792ms 2,434ms -26%
p90 TTFT 1,728ms 2,188ms -21%
TTFT Variance (stddev) 428ms 651ms -34%

Key Value Proposition:

  • Processes 39% more requests in same time with 26% better tail latency
  • 34% more consistent performance (lower variance)
D100_100_group_by_server_version_combined_plots_1x4 D100_1000_group_by_server_version_combined_plots_1x4
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a gRPC server entrypoint for vLLM, providing an alternative to the existing HTTP/REST API. This is a significant feature that enables more efficient communication through binary protocols and HTTP/2 multiplexing. The implementation is well-structured, with a dedicated GrpcRequestManager to handle the interaction with the vLLM engine, and a clean server implementation in grpc_server.py. The code includes graceful shutdown handling and client cancellation, which are important for a production-ready server.

My review focuses on improving robustness and security. I've identified a potential security vulnerability related to unlimited gRPC message sizes and several places where logging could be improved to include full tracebacks for easier debugging of production issues. These are important for maintaining a reliable and secure service.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@mergify
Copy link
Copy Markdown

mergify bot commented Dec 6, 2025

Hi @CatherineSue, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@CatherineSue CatherineSue requested a review from hmellor as a code owner December 6, 2025 21:33
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 6, 2025

Hi @CatherineSue, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@njhill
Copy link
Copy Markdown
Member

njhill commented Dec 7, 2025

Thanks @CatherineSue! I think this is definitely something worthwhile to add.

It would be interesting to see how the performance compares in the same tests with e.g. --api-server-scaleout=4.

Also just for reference here's a gRPC wrapper we had used in the past in IBM/Red Hat, probably quite similar but might be useful to compare https://github.com/opendatahub-io/vllm-tgis-adapter.

@bbartels
Copy link
Copy Markdown
Contributor

bbartels commented Dec 7, 2025

Also, I think if this were to be added it might make sense to have this part of the openai api server. We merged anthropic endpoints into that entrypoint as well so you can support both paths at the same time.

Copy link
Copy Markdown
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the first pass, pending @njhill @robertgshaw2-redhat quick skims on grpc_request_manager.py and grpc_server.py for correct usage of AsyncLLM.

Another question is whether we should just enable this by default so at vllm serve we also start gRPC server at another port?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @robertgshaw2-redhat for the protocol to share with llm-d folks. I'm not expecting a lot of changes to this btw, but if there's some standard we can follow that will be useful as well.

Copy link
Copy Markdown
Contributor

@wseaton wseaton Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some service endpoints that are missing (like /scale_elastic_ep), is the goal going to be 100% complete coverage?

@simon-mo simon-mo self-assigned this Dec 8, 2025
// Generate Request
// =====================

message GenerateRequest {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLucche as we discussed, we should review this as it is probably the better approach in general for high efficiency integration for both disaggregation, the coordinator, and broader more decoupled efforts.

The biggest question in both the http and grpc version of this is "how do we allow for reasonable custom fields for things like the coordinator, or to disaggregate how the openai response is generated".

I don't think proto is worse than http, but I do think we should be deliberate in the http schema evolution to avoid creating a mismatch between HTTP and gRPC, especially if we think most people would prefer to use the gRPC endpoint.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start by centralizing the message definition st http/grpc interface definition is shared?
re: custom field protobuf side, would we just bite the bullet and use a Any field?

string request_id = 1;

// Pre-tokenized input (required)
TokenizedInput tokenized = 2;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there would be no harm in supporting either text or token ids input?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require grpc server to include tokenizer
this can be added in the future, its currently not supported (one of the reasons to have grpc is to have oai server and other related component to be fully written in rust or other languages which is "more" production ready)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also pass a string prompt to AsyncLLM.generate() so it would be trivial to expose this as an option so that the gRPC API could also be used in a standalone manner if desired (albeit probably not recommended for performance).

@dtrifiro
Copy link
Copy Markdown
Contributor

There's currently no tests for this, which I highly recommend having before this is merged.

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question is whether we should just enable this by default so at vllm serve we also start gRPC server at another port

@simon-mo I would be in favor of opt-in first (enable with flag, or explicitly start entrypoint) and start as default in a later release.

Comment on lines +134 to +139
# 1. Create a ParentRequest to track all child requests
# 2. Fan out multiple child EngineCoreRequests with different
# request_index values
# 3. Aggregate outputs from all children
# For now, we only support n=1, so parent_req=None and
# request_index=0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plan looks good

// Generate Request
// =====================

message GenerateRequest {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start by centralizing the message definition st http/grpc interface definition is shared?
re: custom field protobuf side, would we just bite the bullet and use a Any field?

@njhill
Copy link
Copy Markdown
Member

njhill commented Dec 24, 2025

It would also be great to add tests but we can do that as a follow-on if needed.

CatherineSue and others added 8 commits January 2, 2026 22:09
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
@njhill njhill self-assigned this Jan 5, 2026
@simon-mo simon-mo enabled auto-merge (squash) January 5, 2026 23:34
@njhill njhill disabled auto-merge January 7, 2026 20:07
njhill added 2 commits January 7, 2026 12:23
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
@njhill njhill enabled auto-merge (squash) January 7, 2026 20:50
@vllm-bot vllm-bot merged commit 791b2fc into vllm-project:main Jan 8, 2026
91 of 93 checks passed
@DarkLight1337
Copy link
Copy Markdown
Member

@DarkLight1337
Copy link
Copy Markdown
Member

cc @AndreasKaratzas

@DarkLight1337
Copy link
Copy Markdown
Member

Fixed by #31970

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: njhill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: njhill <nickhill123@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: njhill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: njhill <nickhill123@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: njhill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: njhill <nickhill123@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: njhill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: njhill <nickhill123@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.