[grpc] Support gRPC server entrypoint by CatherineSue · Pull Request #30190 · vllm-project/vllm

CatherineSue · 2025-12-06T20:29:04Z

Purpose

Add gRPC server support to vLLM, enabling the community to integrate vLLM via gRPC protocol for any upstream application or routing layer.

Key Benefits:

Native gRPC Protocol Support
- Enables upstream applications to connect via gRPC/Protobuf instead of HTTP/JSON
- Binary protocol reduces serialization overhead
- HTTP/2 multiplexing improves connection efficiency
- Expands vLLM's integration options beyond HTTP/REST APIs
Integration with sgl-model-gateway
- Enables vLLM workers to operate as gRPC backends
- Bypasses Python GIL bottleneck by moving tokenization logic to Rust
- Provides production-grade features: advanced routing, secured mcp and database management, responses api
- Measured performance gains at high concurrency (see Test Results)

Changed Files

Protocol & Codegen:

vllm_scheduler.proto - Protocol buffer definition (source)
vllm_scheduler_pb2.py - Generated protobuf messages (auto-generated)
vllm_scheduler_pb2_grpc.py - Generated gRPC service (auto-generated)
compile_protos.py - Script to compile proto files
__init__.py - Module initialization

Server Implementation:

vllm/grpc/grpc_request_manager.py - Request manager (GrpcRequestManager class)
vllm/entrypoints/grpc_server.py - Server entrypoint (VllmSchedulerServicer + main)

Compilation

To regenerate the Python code from the .proto file:

python vllm/grpc/compile_protos.py

Requirements: pip install grpcio-tools

This generates:

vllm_scheduler_pb2.py - Message classes
vllm_scheduler_pb2_grpc.py - Service stubs and servicers

Test Plan

Run the gRPC server with a Llama-3.1-8B-Instruct:

  python3 -m vllm.entrypoints.grpc_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8080 \
    --host 0.0.0.0 \
    --tensor-parallel-size 1

Test with gateway integration with sgl-model-gateway

Verify:

Health check endpoint responds correctly
Streaming generation returns token IDs (not text)
gRPC reflection is available for introspection
Request abort/cancellation works properly
GetModelInfo and GetServerInfo return correct metadata

Test Result

We use genai-bench to measure the http_server vs (grpc_server + sgl-model-gateway) with Llama-3.3-70B-Instruct on 4xH100.

Performance Results (Llama-3.3-70B, D100_100, Concurrency 256):

At high concurrency, gRPC demonstrates superior production characteristics:

Metric	gRPC	HTTP	Improvement
Throughput	9,068 tok/s	6,629 tok/s	+37%
Requests/sec	45.7	33.4	+37%
p99 TTFT	1,792ms	2,434ms	-26%
p90 TTFT	1,728ms	2,188ms	-21%
TTFT Variance (stddev)	428ms	651ms	-34%

Key Value Proposition:

Processes 39% more requests in same time with 26% better tail latency
34% more consistent performance (lower variance)

D100_100_group_by_server_version_combined_plots_1x4

D100_1000_group_by_server_version_combined_plots_1x4

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a gRPC server entrypoint for vLLM, providing an alternative to the existing HTTP/REST API. This is a significant feature that enables more efficient communication through binary protocols and HTTP/2 multiplexing. The implementation is well-structured, with a dedicated GrpcRequestManager to handle the interaction with the vLLM engine, and a clean server implementation in grpc_server.py. The code includes graceful shutdown handling and client cancellation, which are important for a production-ready server.

My review focuses on improving robustness and security. I've identified a potential security vulnerability related to unlimited gRPC message sizes and several places where logging could be improved to include full tracebacks for easier debugging of production issues. These are important for maintaining a reliable and secure service.

vllm/entrypoints/grpc_server.py

vllm/grpc/grpc_request_manager.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/grpc/grpc_request_manager.py

mergify · 2025-12-06T21:28:18Z

Hi @CatherineSue, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

mergify · 2025-12-06T21:37:06Z

Hi @CatherineSue, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

njhill · 2025-12-07T05:47:18Z

Thanks @CatherineSue! I think this is definitely something worthwhile to add.

It would be interesting to see how the performance compares in the same tests with e.g. --api-server-scaleout=4.

Also just for reference here's a gRPC wrapper we had used in the past in IBM/Red Hat, probably quite similar but might be useful to compare https://github.com/opendatahub-io/vllm-tgis-adapter.

bbartels · 2025-12-07T10:15:37Z

Also, I think if this were to be added it might make sense to have this part of the openai api server. We merged anthropic endpoints into that entrypoint as well so you can support both paths at the same time.

simon-mo

LGTM on the first pass, pending @njhill @robertgshaw2-redhat quick skims on grpc_request_manager.py and grpc_server.py for correct usage of AsyncLLM.

Another question is whether we should just enable this by default so at vllm serve we also start gRPC server at another port?

.github/CODEOWNERS

docker/Dockerfile

simon-mo · 2025-12-08T09:14:25Z

vllm/grpc/vllm_engine.proto

cc @robertgshaw2-redhat for the protocol to share with llm-d folks. I'm not expecting a lot of changes to this btw, but if there's some standard we can follow that will be useful as well.

There are some service endpoints that are missing (like /scale_elastic_ep), is the goal going to be 100% complete coverage?

vllm/grpc/vllm_engine.proto

vllm/grpc/compile_protos.py

smarterclayton · 2025-12-09T19:17:39Z

vllm/grpc/vllm_engine.proto

+// Generate Request
+// =====================
+
+message GenerateRequest {


@NickLucche as we discussed, we should review this as it is probably the better approach in general for high efficiency integration for both disaggregation, the coordinator, and broader more decoupled efforts.

The biggest question in both the http and grpc version of this is "how do we allow for reasonable custom fields for things like the coordinator, or to disaggregate how the openai response is generated".

I don't think proto is worse than http, but I do think we should be deliberate in the http schema evolution to avoid creating a mismatch between HTTP and gRPC, especially if we think most people would prefer to use the gRPC endpoint.

Can we start by centralizing the message definition st http/grpc interface definition is shared?
re: custom field protobuf side, would we just bite the bullet and use a Any field?

vllm/grpc/grpc_request_manager.py

njhill · 2025-12-09T20:31:03Z

vllm/grpc/vllm_engine.proto

+  string request_id = 1;
+
+  // Pre-tokenized input (required)
+  TokenizedInput tokenized = 2;


I guess there would be no harm in supporting either text or token ids input?

that would require grpc server to include tokenizer
this can be added in the future, its currently not supported (one of the reasons to have grpc is to have oai server and other related component to be fully written in rust or other languages which is "more" production ready)

You can also pass a string prompt to AsyncLLM.generate() so it would be trivial to expose this as an option so that the gRPC API could also be used in a standalone manner if desired (albeit probably not recommended for performance).

vllm/grpc/vllm_engine_pb2_grpc.py

dtrifiro · 2025-12-11T13:45:30Z

There's currently no tests for this, which I highly recommend having before this is merged.

NickLucche

Another question is whether we should just enable this by default so at vllm serve we also start gRPC server at another port

@simon-mo I would be in favor of opt-in first (enable with flag, or explicitly start entrypoint) and start as default in a later release.

vllm/entrypoints/grpc_server.py

NickLucche · 2025-12-11T13:53:28Z

vllm/grpc/grpc_request_manager.py

+            # 1. Create a ParentRequest to track all child requests
+            # 2. Fan out multiple child EngineCoreRequests with different
+            #    request_index values
+            # 3. Aggregate outputs from all children
+            # For now, we only support n=1, so parent_req=None and
+            # request_index=0


plan looks good

vllm/grpc/vllm_engine.proto

NickLucche · 2025-12-11T14:21:36Z

vllm/grpc/vllm_engine.proto

+// Generate Request
+// =====================
+
+message GenerateRequest {


Can we start by centralizing the message definition st http/grpc interface definition is shared?
re: custom field protobuf side, would we just bite the bullet and use a Any field?

njhill · 2025-12-24T23:00:07Z

It would also be great to add tests but we can do that as a follow-on if needed.

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Signed-off-by: Nick Hill <nickhill123@gmail.com>

DarkLight1337 · 2026-01-08T10:51:58Z

This PR is failing AMD CI: https://buildkite.com/vllm/amd-ci/builds/2524/steps/canvas?jid=019b9d04-99d5-4f11-a7c6-f524c5e5b35e

DarkLight1337 · 2026-01-08T10:55:58Z

cc @AndreasKaratzas

DarkLight1337 · 2026-01-08T14:41:55Z

Fixed by #31970

Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: njhill <nickhill123@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com>

Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: njhill <nickhill123@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: njhill <nickhill123@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: njhill <nickhill123@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com>

CatherineSue requested review from aarnphm and chaunceyjiang as code owners December 6, 2025 20:29

mergify bot added ci/build frontend labels Dec 6, 2025

CatherineSue force-pushed the vllm-grpc-upstream branch from d0cdccb to e97ee77 Compare December 6, 2025 20:30

gemini-code-assist bot reviewed Dec 6, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 6, 2025

View reviewed changes

vllm/grpc/grpc_request_manager.py Outdated Show resolved Hide resolved

CatherineSue requested a review from hmellor as a code owner December 6, 2025 21:33

simon-mo reviewed Dec 8, 2025

View reviewed changes

simon-mo self-assigned this Dec 8, 2025

hmellor reviewed Dec 8, 2025

View reviewed changes

vllm/grpc/compile_protos.py Show resolved Hide resolved

smarterclayton reviewed Dec 9, 2025

View reviewed changes

njhill reviewed Dec 9, 2025

View reviewed changes

dtrifiro reviewed Dec 11, 2025

View reviewed changes

vllm/grpc/vllm_engine_pb2_grpc.py Outdated Show resolved Hide resolved

NickLucche reviewed Dec 11, 2025

View reviewed changes

CatherineSue and others added 8 commits January 2, 2026 22:09

[grpc] Add gRPC server

9e714c0

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Add grpc in CODEOWNERS

5b0ae5a

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Add type stubs for proto files

a274030

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Exclude auto-generated gRPC stubs in mkdocs

64ff6d1

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Run precommit

41bc4f6

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Add pyi in pyproject.toml

d5b2741

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Add mypy ignores to all generated grpc stubs

8421d59

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Exclude grpc in api-autonav

e11e45c

Signed-off-by: Chang Su <chang.s.su@oracle.com>

CatherineSue force-pushed the vllm-grpc-upstream branch from 5994e61 to 9cf8e86 Compare January 3, 2026 06:32

njhill self-assigned this Jan 5, 2026

njhill approved these changes Jan 5, 2026

View reviewed changes

Merge branch 'main' into vllm-grpc-upstream

abfc077

simon-mo enabled auto-merge (squash) January 5, 2026 23:34

njhill added 2 commits January 7, 2026 10:20

add back oob abort rpc

6efd639

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Merge remote-tracking branch 'origin/main' into vllm-grpc-upstream

a6d5c39

njhill disabled auto-merge January 7, 2026 20:07

njhill added 2 commits January 7, 2026 12:23

streamline proto: remove custom err messages and req_id in responses

4165ba8

Signed-off-by: Nick Hill <nickhill123@gmail.com>

update grpc version; use uvloop

be3845e

Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill approved these changes Jan 7, 2026

View reviewed changes

njhill enabled auto-merge (squash) January 7, 2026 20:50

njhill added 4 commits January 7, 2026 17:42

update protobuf version requirements

40cea77

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Merge remote-tracking branch 'origin/main' into vllm-grpc-upstream

46b8c88

also update text.txt version of protobuf

370c046

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Merge remote-tracking branch 'origin/main' into vllm-grpc-upstream

e36c687

vllm-bot merged commit 791b2fc into vllm-project:main Jan 8, 2026
91 of 93 checks passed

tjtanaa mentioned this pull request Jan 8, 2026

[CI] [ROCm] Fix tests/entrypoints/test_grpc_server.py on ROCm #31970

Merged

5 tasks

nrghosh mentioned this pull request Jan 23, 2026

[deps][LLM] Upgrade vLLM to 0.15.0 ray-project/ray#60253

Closed

6 tasks

pacoxu mentioned this pull request Jan 26, 2026

[Bugfix] (grpc): improve GetServerInfo response consistency and accuracy #33070

Open

5 tasks

jeffreywang-anyscale mentioned this pull request Jan 28, 2026

Relax protobuf library version constraints #33202

Merged

5 tasks

santiramos27 mentioned this pull request Feb 8, 2026

[Frontend, grpc] add support for Embed RPC in grpc_server #34082

Open

Uh oh!

Conversation

CatherineSue commented Dec 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changed Files

Compilation

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

njhill commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbartels commented Dec 7, 2025

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wseaton Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dtrifiro commented Dec 11, 2025

Uh oh!

NickLucche left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill commented Dec 24, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Jan 8, 2026

Uh oh!

CatherineSue commented Dec 6, 2025 •

edited by github-actions bot

Loading

njhill commented Dec 7, 2025 •

edited

Loading

wseaton Dec 10, 2025 •

edited

Loading

NickLucche left a comment •

edited

Loading