[Entrypoint] Add realtime OpenPI robot serving API#3673
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
e7138cc to
c065b01
Compare
c065b01 to
d6343f3
Compare
timzsu
left a comment
There was a problem hiding this comment.
The contract looks good overall. There is one blocking issue for GR00T. PTAL :)
d6343f3 to
12e27af
Compare
|
|
||
|
|
||
| @router.websocket("/v1/realtime/robot/openpi") | ||
| async def realtime_robot_openpi(websocket: WebSocket): |
There was a problem hiding this comment.
Can the interface be unified to the standard OpenAI interface v1/realtime? And can different connections be used internally for different models?
There was a problem hiding this comment.
-
These two API seems totally different except that they are all websocket with dict. Therefore I think it is not necessary to unify them.
-
I am not sure if I correctly understand your question. Can you explain it a little bit? For example, what is the specific case?
There was a problem hiding this comment.
What's the difference? The actions are updated in runtime, and model returned the according video frames, am I right?
At lease, we should define a unfied robot API(if it really does not similer to realtime video), other world model could reuse it.
There was a problem hiding this comment.
One problem is OpenAI's /realtime only supports Omni understanding (VL) and TTS. So even if we reuse that API, we still need to define extra WebSocket payload formats to it.
UX-wise, users have to learn our custom interface anyway, regardless of the endpoint. And it may confuse users about the actual/typical usage of OpenAI's /realtime.
Plus, engineering-wise, our current /realtime implementation depends on vllm's base implementation, and the extensibility is unknown. Maybe the code is more messy or the extension is too much so that it looks like a complete re-implementation.
|
This PR fits the standard openpi serving paradim, compatibility has been verified on evaluation tasks. |
Is there a link to OpenPI's official documentation (about their endpoint design)? And is it only an endpoint thing, what type of models and modalities does it officially support? Appreciate it if you can attach here for our reference 😄 Maybe related: In #3632 and #3737, I have designed and implemented another endpoint purely for diffusion video real-time generation (and in the future, to change prompt during real time video genreation). It is a custom endpoint. So I wonder if this robot protocol is already commonly used by the community and supports real-time video generation. |
Hi @TKONIY saw you reacted to my comment, so could you help attach the link to OpenPI's official documentation? 😂 |
FYI: The OpenPI serving paradigm is more like a wire schema rather than a static protocol between models and real robots/simulation, used by the Pi-family model and DreamZero. For the last question, as far as I know, world models that support real/simulated robots have no unified api interface; however, such a schema is used by downstream communities like molmospace, and the common pipeline is that environments send images/prompt/states and the model return actions, no video will be transport (even videos will be generated inside model). Unified video generation and robot-related API seem to reuse only the websocket, different payload keys are required to be extended, I am not sure if it is a feasible idea. For example, maybe there will be a grpc based endpoint for more robot policy supports based on lerobot. |
fake0fan
left a comment
There was a problem hiding this comment.
Overall, I think the serving API code here looks good to me.
And My understanding is that once DreamZero is fully supported end-to-end, we should be able to use these APIs in practice pretty soon, right?
Thanks @fhfuih, @QiuMike, @fake0fan, I try to address your concern together here. I also cite @matchyc's reply which provides some detailed design concern of this API.
|
Add a generic OpenPI-compatible robot policy websocket endpoint at /v1/realtime/robot/openpi. The endpoint performs msgpack request handling, policy-server metadata handshake, session/reset tracking, and forwards observations to AsyncOmni with robot-specific extra_args. Keep model behavior out of the serving layer by requiring policy_server_config from model config and extracting actions from multimodal_output['actions']. This lets robot policy pipelines implement their own transforms and state without coupling the OpenPI protocol code to a specific model. Add unit coverage for payload validation, missing optional openpi-client dependency, per-connection session state, reset handling, policy_server_config discovery, request construction, and generic actions extraction. Signed-off-by: Yangshen Deng <yangshen.d@outlook.com> Co-authored-by: Meng <meng_chen99@163.com>
Move the OpenPI protocol implementation out of the OpenAI realtime package while keeping the public websocket route unchanged. Make engine request ids unique per inference so robot session ids remain state keys only, and tighten action extraction to require a single result with multimodal_output. Co-authored-by: Yangshen Deng <yangshen.d@outlook.com> Signed-off-by: Meng <meng_chen99@163.com> Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Replace the pickle-based websocket test serialization mock with a JSON bytes mock that can handle ndarray outputs, and apply the import ordering produced by ruff. Co-authored-by: Meng <meng_chen99@163.com> Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
97d8711 to
b8e89a2
Compare
|
Thanks @matchyc @TKONIY for the extensive explanation! Now I understand that there is no conventional API endpoint naming, nor a conventional websocket payload format, but more like a conventional user flow design. After a careful reading, now I see that this pure API impl is model-agnostic and modality-agnostic. So, LGTM to merge such an endpoint since it's depended on by other world model development. My previous question on potential endpoint reuse seems irrelevant. One last minor comment on my side (but forgivable since other people already approved this PR 😁): |
@hsliuustc0106 @wtomin @amy-why-3459 PTAL on the url path. |
|
@Gaohan123 @hsliuustc0106 I think this PR is ready and can be merged. |
Gaohan123
left a comment
There was a problem hiding this comment.
In the following PRs, where will be the data preprocessing and postprocessing logic on?
We have surveyed the robotics community and different models. The conclusion is that the data processing part seems to be model-and-dataset specific, not only-dataset specific. Therefore, we now put these logic inside each model pipeline instead of the serving interface. In the future when we have gained more experience in supporting new robotics models, we can try to make a new abstraction layer for that if there is an opportunity of unification. Ref: |
This PR adds a standalone realtime OpenPI-compatible robot serving endpoint at
/v1/realtime/robot/openpi.It includes:
policy_server_config.AsyncOmni.generate()and action extraction frommultimodal_output["actions"].This PR intentionally contains only the serving API layer and does not include DreamZero model implementation code.