Skip to content

[Entrypoint] Add realtime OpenPI robot serving API#3673

Merged
Gaohan123 merged 7 commits into
vllm-project:mainfrom
TKONIY:feature/openpi_serving
Jun 1, 2026
Merged

[Entrypoint] Add realtime OpenPI robot serving API#3673
Gaohan123 merged 7 commits into
vllm-project:mainfrom
TKONIY:feature/openpi_serving

Conversation

@TKONIY
Copy link
Copy Markdown
Contributor

@TKONIY TKONIY commented May 17, 2026

This PR adds a standalone realtime OpenPI-compatible robot serving endpoint at /v1/realtime/robot/openpi.

It includes:

  • WebSocket msgpack request/response handling compatible with OpenPI clients.
  • Policy server metadata handshake from policy_server_config.
  • Per-connection session/reset handling.
  • Generic forwarding to AsyncOmni.generate() and action extraction from multimodal_output["actions"].
  • Unit tests for connection handling, config discovery, request construction, and generic action extraction.

This PR intentionally contains only the serving API layer and does not include DreamZero model implementation code.

@TKONIY TKONIY requested review from tzhouam and yenuo26 as code owners May 17, 2026 20:04
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Contributor

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract looks good overall. There is one blocking issue for GR00T. PTAL :)

Comment thread vllm_omni/entrypoints/openpi/serving.py
Copy link
Copy Markdown
Contributor

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have built #3798 on top of it to integrate GR00T.



@router.websocket("/v1/realtime/robot/openpi")
async def realtime_robot_openpi(websocket: WebSocket):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the interface be unified to the standard OpenAI interface v1/realtime? And can different connections be used internally for different models?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. These two API seems totally different except that they are all websocket with dict. Therefore I think it is not necessary to unify them.

  2. I am not sure if I correctly understand your question. Can you explain it a little bit? For example, what is the specific case?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference? The actions are updated in runtime, and model returned the according video frames, am I right?

At lease, we should define a unfied robot API(if it really does not similer to realtime video), other world model could reuse it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem is OpenAI's /realtime only supports Omni understanding (VL) and TTS. So even if we reuse that API, we still need to define extra WebSocket payload formats to it.
UX-wise, users have to learn our custom interface anyway, regardless of the endpoint. And it may confuse users about the actual/typical usage of OpenAI's /realtime.

Plus, engineering-wise, our current /realtime implementation depends on vllm's base implementation, and the extensibility is unknown. Maybe the code is more messy or the extension is too much so that it looks like a complete re-implementation.

@matchyc
Copy link
Copy Markdown
Contributor

matchyc commented May 22, 2026

This PR fits the standard openpi serving paradim, compatibility has been verified on evaluation tasks.

@wtomin wtomin self-requested a review May 26, 2026 02:09
Comment thread vllm_omni/entrypoints/openpi/connection.py
Comment thread vllm_omni/entrypoints/openai/realtime/robot/openpi_serving.py Outdated
@fhfuih
Copy link
Copy Markdown
Contributor

fhfuih commented May 27, 2026

This PR adds a standalone realtime OpenPI-compatible robot serving endpoint at /v1/realtime/robot/openpi.

Is there a link to OpenPI's official documentation (about their endpoint design)? And is it only an endpoint thing, what type of models and modalities does it officially support? Appreciate it if you can attach here for our reference 😄

Maybe related: In #3632 and #3737, I have designed and implemented another endpoint purely for diffusion video real-time generation (and in the future, to change prompt during real time video genreation). It is a custom endpoint. So I wonder if this robot protocol is already commonly used by the community and supports real-time video generation.

Copy link
Copy Markdown
Collaborator

@wtomin wtomin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR LGTM. Also cc @fake0fan for comments.

@fhfuih
Copy link
Copy Markdown
Contributor

fhfuih commented May 28, 2026

This PR adds a standalone realtime OpenPI-compatible robot serving endpoint at /v1/realtime/robot/openpi.

Is there a link to OpenPI's official documentation (about their endpoint design)? And is it only an endpoint thing, what type of models and modalities does it officially support? Appreciate it if you can attach here for our reference 😄

Maybe related: In #3632 and #3737, I have designed and implemented another endpoint purely for diffusion video real-time generation (and in the future, to change prompt during real time video genreation). It is a custom endpoint. So I wonder if this robot protocol is already commonly used by the community and supports real-time video generation.

Hi @TKONIY saw you reacted to my comment, so could you help attach the link to OpenPI's official documentation? 😂

@matchyc
Copy link
Copy Markdown
Contributor

matchyc commented May 28, 2026

This PR adds a standalone realtime OpenPI-compatible robot serving endpoint at /v1/realtime/robot/openpi.

Is there a link to OpenPI's official documentation (about their endpoint design)? And is it only an endpoint thing, what type of models and modalities does it officially support? Appreciate it if you can attach here for our reference 😄
Maybe related: In #3632 and #3737, I have designed and implemented another endpoint purely for diffusion video real-time generation (and in the future, to change prompt during real time video genreation). It is a custom endpoint. So I wonder if this robot protocol is already commonly used by the community and supports real-time video generation.

Hi @TKONIY saw you reacted to my comment, so could you help attach the link to OpenPI's official documentation? 😂

FYI: The OpenPI serving paradigm is more like a wire schema rather than a static protocol between models and real robots/simulation, used by the Pi-family model and DreamZero. For the last question, as far as I know, world models that support real/simulated robots have no unified api interface; however, such a schema is used by downstream communities like molmospace, and the common pipeline is that environments send images/prompt/states and the model return actions, no video will be transport (even videos will be generated inside model).

Unified video generation and robot-related API seem to reuse only the websocket, different payload keys are required to be extended, I am not sure if it is a feasible idea. For example, maybe there will be a grpc based endpoint for more robot policy supports based on lerobot.

Copy link
Copy Markdown
Contributor

@fake0fan fake0fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the serving API code here looks good to me.

And My understanding is that once DreamZero is fully supported end-to-end, we should be able to use these APIs in practice pretty soon, right?

@TKONIY
Copy link
Copy Markdown
Contributor Author

TKONIY commented May 28, 2026

This PR adds a standalone realtime OpenPI-compatible robot serving endpoint at /v1/realtime/robot/openpi.

Is there a link to OpenPI's official documentation (about their endpoint design)? And is it only an endpoint thing, what type of models and modalities does it officially support? Appreciate it if you can attach here for our reference 😄
Maybe related: In #3632 and #3737, I have designed and implemented another endpoint purely for diffusion video real-time generation (and in the future, to change prompt during real time video genreation). It is a custom endpoint. So I wonder if this robot protocol is already commonly used by the community and supports real-time video generation.

Hi @TKONIY saw you reacted to my comment, so could you help attach the link to OpenPI's official documentation? 😂

FYI: The OpenPI serving paradigm is more like a wire schema rather than a static protocol between models and real robots/simulation, used by the Pi-family model and DreamZero. For the last question, as far as I know, world models that support real/simulated robots have no unified api interface; however, such a schema is used by downstream communities like molmospace, and the common pipeline is that environments send images/prompt/states and the model return actions, no video will be transport (even videos will be generated inside model).

Unified video generation and robot-related API seem to reuse only the websocket, different payload keys are required to be extended, I am not sure if it is a feasible idea. For example, maybe there will be a grpc based endpoint for more robot policy supports based on lerobot.

Thanks @fhfuih, @QiuMike, @fake0fan, I try to address your concern together here. I also cite @matchyc's reply which provides some detailed design concern of this API.

TKONIY and others added 3 commits May 28, 2026 10:53
Add a generic OpenPI-compatible robot policy websocket endpoint at /v1/realtime/robot/openpi. The endpoint performs msgpack request handling, policy-server metadata handshake, session/reset tracking, and forwards observations to AsyncOmni with robot-specific extra_args.

Keep model behavior out of the serving layer by requiring policy_server_config from model config and extracting actions from multimodal_output['actions']. This lets robot policy pipelines implement their own transforms and state without coupling the OpenPI protocol code to a specific model.

Add unit coverage for payload validation, missing optional openpi-client dependency, per-connection session state, reset handling, policy_server_config discovery, request construction, and generic actions extraction.

Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>

Co-authored-by: Meng <meng_chen99@163.com>
Move the OpenPI protocol implementation out of the OpenAI realtime package while keeping the public websocket route unchanged.

Make engine request ids unique per inference so robot session ids remain state keys only, and tighten action extraction to require a single result with multimodal_output.

Co-authored-by: Yangshen Deng <yangshen.d@outlook.com>

Signed-off-by: Meng <meng_chen99@163.com>

Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Replace the pickle-based websocket test serialization mock with a JSON bytes mock that can handle ndarray outputs, and apply the import ordering produced by ruff.

Co-authored-by: Meng <meng_chen99@163.com>
Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
@fhfuih
Copy link
Copy Markdown
Contributor

fhfuih commented May 28, 2026

Thanks @matchyc @TKONIY for the extensive explanation! Now I understand that there is no conventional API endpoint naming, nor a conventional websocket payload format, but more like a conventional user flow design.

After a careful reading, now I see that this pure API impl is model-agnostic and modality-agnostic. So, LGTM to merge such an endpoint since it's depended on by other world model development. My previous question on potential endpoint reuse seems irrelevant.

One last minor comment on my side (but forgivable since other people already approved this PR 😁):
This implementation is not quite "real-time"/streaming/duplex in terms of data flow and interaction (explicitly mentioned in code comment: one input -> wait for one output. Websocket only for session persistence). So the endpoint naming /v1/realtime/robot/openpi doesn't perfectly fit. (Unless there is future plan to add streaming input/output or duplex to this endpoint, given any observations on world model trend to support streaming in pipeline-level) Furthermore, for OpenAI users, /v1/realtime is the familiar OpenAI-compatible real-time endpoint, so a subpath endpoint may be intuitively also real-time. I would imagine something like /v1/openpi or /v1/robot/openpi to work as well (depending on whether the "robot use case" needs special attention and whether there are other common protocols for "robot use case")

@TKONIY
Copy link
Copy Markdown
Contributor Author

TKONIY commented May 28, 2026

Thanks @matchyc @TKONIY for the extensive explanation! Now I understand that there is no conventional API endpoint naming, nor a conventional websocket payload format, but more like a conventional user flow design.

After a careful reading, now I see that this pure API impl is model-agnostic and modality-agnostic. So, LGTM to merge such an endpoint since it's depended on by other world model development. My previous question on potential endpoint reuse seems irrelevant.

One last minor comment on my side (but forgivable since other people already approved this PR 😁): This implementation is not quite "real-time"/streaming/duplex in terms of data flow and interaction (explicitly mentioned in code comment: one input -> wait for one output. Websocket only for session persistence). So the endpoint naming /v1/realtime/robot/openpi doesn't perfectly fit. (Unless there is future plan to add streaming input/output or duplex to this endpoint, given any observations on world model trend to support streaming in pipeline-level) Furthermore, for OpenAI users, /v1/realtime is the familiar OpenAI-compatible real-time endpoint, so a subpath endpoint may be intuitively also real-time. I would imagine something like /v1/openpi or /v1/robot/openpi to work as well (depending on whether the "robot use case" needs special attention and whether there are other common protocols for "robot use case")

@hsliuustc0106 @wtomin @amy-why-3459 PTAL on the url path.

@amy-why-3459
Copy link
Copy Markdown
Contributor

@Gaohan123 @hsliuustc0106 I think this PR is ready and can be merged.

@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label May 31, 2026
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following PRs, where will be the data preprocessing and postprocessing logic on?

@TKONIY
Copy link
Copy Markdown
Contributor Author

TKONIY commented Jun 1, 2026

In the following PRs, where will be the data preprocessing and postprocessing logic on?

We have surveyed the robotics community and different models. The conclusion is that the data processing part seems to be model-and-dataset specific, not only-dataset specific. Therefore, we now put these logic inside each model pipeline instead of the serving interface.

In the future when we have gained more experience in supporting new robotics models, we can try to make a new abstraction layer for that if there is an opportunity of unification.

Ref:

@Gaohan123 Gaohan123 merged commit 634af60 into vllm-project:main Jun 1, 2026
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants