[WIP] [Full Duplex] Feat: Support Full-Duplex realtime runtime & add MiniCPM-o 4.5 demo by Sy0307 · Pull Request #3907 · vllm-project/vllm-omni

Sy0307 · 2026-05-27T11:13:31Z

NOTICE: This PR still WIP

Refs #3745. Stacked on #3642 (tc-mb:Support-MiniCPM-o-4.5).

This PR targets main so it can be opened against the upstream repository. Until #3642 is merged, the diff also includes the base MiniCPM-o 4.5 model-support changes from #3642. The full-duplex/realtime work is in the follow-up signed-off commit 32ab95b2ffdb982fefaa0bbffff5efafdf8a175e.

Purpose

This PR adds a MiniCPM-o 4.5 native full-duplex realtime runtime path for the architecture discussed in #3745.

The base model support comes from #3642. This PR extends that baseline from normal staged MiniCPM-o 4.5 serving into a session-oriented audio streaming path:

client websocket
  -> /v1/duplex or /v1/realtime?duplex=1
  -> duplex session actor / event adapter
  -> AsyncOmniEngine duplex data plane
  -> Stage0 MiniCPM-o listen/speak decode
  -> Stage0-to-Stage1 handoff payload
  -> Stage1 MiniCPM-o TTS / token2wav
  -> realtime audio delta / done events

The goal is not just a smoke test around the existing chat endpoint. The implementation wires a real audio-in -> Stage0 -> Stage1 -> audio-out loop and covers the core realtime control cases needed by the current demo: streaming input append, model listen/speak decisions, audio response streaming, cancel/barge-in, overlap handling, playback ack, and conversation item emission.

RFC Alignment

Implemented from the #3745 full-duplex direction:

Session-scoped duplex state: session id, response id, epoch, playback cursor, active response state, and close/cancel lifecycle are tracked explicitly instead of being hidden behind one-off request state.
Independent serving-side input/output flow: websocket input handling no longer has to treat every input append as a blocking single request-response turn. Cancel/barge-in can be observed while output is active.
Realtime event adapter: /v1/realtime?duplex=1 maps the main OpenAI Realtime-style events used by the demo, including session.update, input_audio_buffer.append, input_audio_buffer.commit, response.create, response.cancel, response.audio.delta, response.audio.done, response.done, and conversation item events.
Duplex data-plane integration: audio append and stage handoff use the engine/orchestrator/scheduler/worker path instead of relying on a fake chat-completion request as the only control surface.
Stage-native MiniCPM-o 4.5 runtime: Stage0 uses the model's audio streaming path and listen/speak policy; Stage1 consumes the handoff payload and emits TTS audio chunks.
Overlap and barge-in policy: input arriving while assistant audio is active is handled through an overlap policy path, not blindly treated as an unconditional cancel.
Playback-aware memory boundary: playback ack is represented in session state and is used to commit played assistant content instead of assuming every emitted byte has necessarily entered conversation memory.

Intentional differences or improvements versus the initial RFC sketch:

The implementation keeps the MiniCPM-o 4.5 policy model-specific instead of pretending all models have the same duplex token and TTS handoff semantics.
Control-plane events such as open, close, cancel, and signal remain explicit, while high-volume audio/stage payloads are moved toward the data-plane path.
Persistent core KV lease is kept out of this PR by design; resumable/session state is used where available, but this PR does not claim the full scheduler-owned KV lease lifecycle.
The Realtime endpoint is introduced as an adapter over the native duplex runtime, so the model-specific path can be validated before claiming full byte-perfect OpenAI Realtime compatibility.

Technical Changes

Serving and protocol:

Add duplex protocol objects for session config, runtime capability, playback cursor, overlap policy, and data-plane result handling.
Add /v1/duplex websocket serving for the native duplex protocol.
Add /v1/realtime?duplex=1 websocket serving for the Realtime-compatible adapter path.
Add Realtime audio format conversion, including pcm16 client input to MiniCPM-o native pcm_f32le input.
Add response lifecycle emission for created, audio delta, audio done, output item/content part lifecycle, done, cancel, and close.
Add playback ack handling and assistant-history commit behavior.

Engine / orchestrator / scheduler:

Add duplex data/control messages through AsyncOmniEngine and StagePool.
Route duplex append/signal/close results back to serving instead of swallowing runtime failures.
Add segment finish handling so Stage0 chunk boundaries can trigger Stage0 -> Stage1 forwarding.
Carry Stage0 -> Stage1 handoff payloads through the scheduler/orchestrator path rather than treating the whole runtime as a serving-only adapter.
Add model intermediate buffer helpers for duplex payloads so hidden states, token ids, and metadata are not passed as ad-hoc unrelated fields.

MiniCPM-o 4.5 runtime:

Add MiniCPM-o 4.5 duplex runtime/policy code for streaming audio append, listen/speak token handling, chunk eos handling, Stage0 result parsing, Stage1 TTS handoff, and token2wav output.
Reuse MiniCPM-o processor/audio/TTS components while isolating duplex session state from generic serving state.
Add support for multi-chunk and multi-turn session continuation in the native realtime path.
Add model-specific safeguards around unsupported modes and stage role/topology reporting.

Examples and configs:

Add MiniCPM-o 4.5 realtime duplex demo script.
Add streaming/stage-replica configs used by the MiniCPM-o 4.5 duplex path.
Update MiniCPM-o example documentation for the native duplex/realtime entrypoint.

Tests:

Add focused unit coverage for duplex protocol objects, serving handler behavior, runtime control result propagation, engine/orchestrator routing, worker native hooks, MiniCPM-o stage input processing, and Realtime event handling.

Current Verified Behavior

The latest controlled remote E2E covers the important demo path:

/v1/realtime?duplex=1 session creation.
pcm16 Realtime audio input conversion into MiniCPM-o native audio append.
streaming audio input commit.
Stage0 MiniCPM-o listen/speak decision.
Stage0 segment finish and Stage0 -> Stage1 handoff.
Stage1 TTS audio delta output.
response.audio.delta, response.audio.done, and response.done emission.
in-flight cancel / barge-in with stale epoch output filtered.
overlap listen case where short input does not incorrectly cancel the current response.
playback ack and committed assistant history accounting.

Known Boundaries / Not Claimed Complete

This PR is a substantial step toward the #3745 RFC, but it should not be reviewed as a final production-complete full-duplex core. Known remaining work:

Persistent core KV lease: not implemented in this PR. resumable/session state should not be interpreted as a full scheduler-owned KV lease with allocation, rollback, migration, and release semantics.
One long-lived request per stage: the implementation is closer to scheduler-managed resumable duplex data-plane requests, but it is not yet the final RFC stage actor lifecycle for every stage.
Byte-perfect OpenAI Realtime compatibility: the main demo event path is implemented, but the full Realtime schema surface is not complete.
Multi-session / multi-replica production policy: happy-path behavior has evidence, but admission control, replica binding, failure recovery, and fairness still need broader validation.
Playback-to-history precision: playback ack is represented and used, but exact token/audio alignment remains model- and mark-resolution dependent.
Long-duration natural conversation quality: the E2E proves the runtime path and key control semantics; it is not a claim that long-running turn-taking quality is fully tuned.

Test Plan

Ruff lint/format on the changed Python files.
git diff --check.
Targeted py_compile for the duplex/realtime serving and MiniCPM-o native runtime files.
Remote H20 MiniCPM-o 4.5 realtime duplex E2E on /v1/realtime?duplex=1.
Full CI / broader model matrix, to be covered by CI and follow-up validation.

Test Result

Local checks:

ruff check <changed-python-files>
ruff format --check <changed-python-files>
git diff --check
python3 -m py_compile <duplex-and-minicpmo-runtime-files>

E2E:

Remote server task: 54cf2229
E2E task: f33584ae
Result: returncode=0, ok=true

Key E2E signals:

overlap_listen=true
overlap_barge_in=true
short_ack_cancelled=false
model_listen_policy_observed=true
model_speak_event_ok=true
playback_commit_ok=true
playback_history_committed_count=1
stale_audio_delta_count=0
response.audio.delta=10
response.audio.done=2
response.output_audio.delta=0
response.output_audio.done=0
error=0

Server log audit for the E2E run:

ERROR=0
Traceback=0
RuntimeError=0
ValueError=0
DynamicCache=0
runtime_append_failed=0
Exception in ASGI=0

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)


cc @lishunyang12 @linyueqian @vklimkov-nvidia @Sy0307 @tc-mb @Gaohan123 @amy-why-3459 @TKONIY @yinpeiqi

Nightwing-77 · 2026-05-31T20:15:47Z

can we convert this model specific duplex runtime to be generic reusable!?

Nightwing-77

The changes look good overall, however I'm a bit concerned about adding duplex execution logic directly inside the model executor — it feels too model-specific. Can we explore a more generic approach instead? This could benefit many other models down the line, not just MiniCPM-o 4.5.
can we build a wrapper which takes the model executor and manages session state and other responsiblities

Signed-off-by: Sy03 <1370724210@qq.com>

… and restore per-segment streaming-input ingestion Combines two fixes needed for MiniCPM-o 4.5 duplex on vllm-omni 0.22: * Register the full thinker/talker architecture keys (MiniCPMO45OmniLLMForConditionalGeneration / MiniCPMO45OmniTTSForConditionalGeneration) and add a plain-chat (use_tts_template) tts_bos fallback so non-duplex chat-completions audio works: resolve <|tts_bos|> (151703) directly and bound the spoken region at <|im_end|> (151645). * Restore per-segment streaming-input ingestion in _update_streaming_input_additional_info: read and accumulate the incoming per-segment model_intermediate_buffer (via streaming_accumulated_keys + torch.cat) instead of only resetting num_processed_tokens. A prior rebase had dropped this, starving duplex audio and producing garbled/doubled output. Signed-off-by: linyueqian <linyueqian@outlook.com>

Add the continuous-duplex realtime web UI under examples/online_serving/minicpmo/realtime_web/: a browser client that streams mic audio to the duplex endpoint and plays back TTS, with selectable turn-detection (model-driven default, server_vad for other models), a voice picker, an interaction-mode toggle, a light/white theme, and 16 kHz anti-aliased mic capture for clean audio. Signed-off-by: linyueqian <linyueqian@outlook.com>

…uest/StreamingUpdate imports, typos) Signed-off-by: linyueqian <linyueqian@outlook.com>

… epochs The data-plane stage0 request id was epoch-scoped and barge_in() aborted all stage requests, so every turn/epoch advance started a fresh, context-less KV while the model helper still skipped re-prepending the system context, degenerating multi-turn output into token garbage. Make stage0 a single long-lived resumable request: its id is epoch-independent and barge_in() preserves the stage0 binding (only downstream stages are torn down), so conversation KV/context persists across turns/epochs as the topology already declares (stage0_long_lived_request). Signed-off-by: linyueqian <linyueqian@outlook.com>

…x sampler Mirror the official MiniCPM-o StreamDecoder.decode listen handling in the data-plane sampler: scale the listen-token logit and optionally force-keep listen only when it ranks within top-k. Defaults (1.0, None) preserve current behavior; tunable via MINICPMO45_LISTEN_PROB_SCALE / MINICPMO45_LISTEN_TOP_K for listen/speak balance. Signed-off-by: linyueqian <linyueqian@outlook.com>

…ation) Add an auto-response mode (session extra_body.auto_response/full_duplex) that runs per-chunk speak/listen generation continuously, matching the official duplex_generate loop, instead of waiting for an explicit response.create. Each ~chunk_period of streamed audio is emitted to the stage0 stream, and continuous chunks feed the ongoing stream rather than being routed through the discrete-response overlap/barge-in policy (explicit force_barge_in still interrupts). Signed-off-by: linyueqian <linyueqian@outlook.com>

…time web demo Full mode now requests server-side continuous auto-response (extra_body.auto_response) so the model speaks on its own. Turn mode no longer hangs on a model listen decision: response.listen / response.done reset the status. Signed-off-by: linyueqian <linyueqian@outlook.com>

NumberWan · 2026-06-10T07:38:17Z

May I know when the target merge date for this PR is? Is it scheduled?

…eaming Two divergences from the official MiniCPMODuplexInference corrupted audio full-duplex output (degenerate / garbage transcripts vs the coherent official worker on identical input): - _stage_prefill_embeddings_only re-emitted the assistant turn-open prefix (im_end + im_start assistant + tts_bos) on every audio chunk, re-opening the turn each chunk and producing repeated turn-initial greetings. The official feeds only <unit>+audio per chunk; the turn is opened once at session init and tts_bos/listen/turn_eos are model-generated. Drop the per-chunk prefix. - _configure_streaming_processor used cnn_redundancy_ms=0, yielding 9 audio embed tokens/chunk vs the official's 10 (official duplex default is 20). This off-by-one misaligned the audio representation the model was trained on. Default to 20, and call processor.reset_streaming() at session init, mirroring the official init_streaming_processor (modeling_minicpmo_unified.py:207). Verified against the official model on the same input: with these changes vllm's chunk-0 audio embeds match the official to 3 decimals (std 0.4339 vs 0.4336). A downstream LLM-forward/positions issue in the duplex data plane remains under investigation and is not addressed here.

… context budget Follow-ups to the per-chunk assistant-prefix fix, aligning the MiniCPM-o 4.5 scheduler data-plane path with the official MiniCPMODuplexInference format: - _stage_prefill_embeddings_only: prepend </unit> for chunks >= 1 so every unit is closed before the next <unit> opens (official finalize_unit feeds terminator + </unit>; the scheduler session update discards the sampled terminator, so only the closure is appended). - preprocess: always place padding in front of the chunk embeddings. The appended duplex tokens occupy the tail of the request prompt and the runner schedules [num_computed_tokens, prompt_len); with the old suffix-split layout the audio embeds of chunks >= 1 landed outside the scheduled span and were never forwarded, so generation ran on pad tokens only. Keeping embeds last also puts the decode position right after the final audio embedding (official listen/speak decision point). Warn when the worker produces more embeddings than reserved slots instead of silently truncating. - orchestrator/duplex: reserve extra scheduler token slots on the first append for the session context (system prompt + optional ref-audio embeddings); previously a long reference audio could overflow the chunk-0 budget and truncate the audio tail. - _prepare_session_context: always emit the official <|im_start|>system\n{text}\n<|audio_start|>[ref]<|audio_end|><|im_end|> template, with or without reference audio. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

… stale duplex tests - _stage_ref_audio_embeddings: the split stage0 wrapper ports official get_audio_embedding(chunk_length=...) as get_audio_hidden_states, so the ref-audio path always fell into the streaming fallback. That truncated the reference audio to a single streaming chunk (~1 s of a 6 s prompt) and advanced the streaming mel/encoder state at session open, corrupting the first real audio chunk. Use the whole-clip encoder when available. - duplex_scheduler_token_budget / first-append reserve: MiniCPM-o pools audio to one token per 100 ms, not 20 ms; the old 50 tok/s math reserved ~5x too many slots and filled the KV with hundreds of </unit> pad embeddings (451 of 482 chunk-0 positions measured on the dumped data plane). Use 1600 samples/token and tight margins. - _prepare_session_context: keep the audio markers conditional on ref audio, matching MiniCPMODuplex.prepare() in the released checkpoint. - tests: update stale duplex expectations (stage0 epoch-independent request id, barge-in preserving stage0, optional <|audio|> token, per-state session config, ref-audio stub signature, new budget numbers). Verified by feeding the dumped vLLM chunk-0 embeddings through the official MiniCPMODuplex decoder: it reproduces the same degenerate logits the server samples, while the official pipeline on the same input yields listen logits of +12 vs garbage text at -10, isolating the bug to embed construction. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…upt the model Replaying the dumped vLLM chunk-0 data-plane embeddings through the official MiniCPMODuplex decoder isolates the final divergence: with the leading pad slots stripped the official decoder produces listen=16.25 (official pipeline: 16.38) on identical embeddings, while including the pad run yields the same degenerate logits the server samples. Any run of </unit> pad embeddings in the KV breaks the model, so scheduler slot reservations must match the worker-built embeddings exactly: - serving (MiniCPMO45PcmAppendBuffer): emit only whole model chunks. The first emission is capped at one chunk (the worker's first unit consumes the official 1035 ms window); commit flushes zero-pad the tail to the chunk boundary (silence, in-distribution) instead of emitting partial chunks. - serving adapter: trim reference audio to a whole number of pooled frames and precompute the exact session-context token count (shared template via MiniCPMO45DuplexPolicy.session_context_texts + samples/1600 pooling math) into duplex_first_append_context_tokens. - engine: duplex_scheduler_token_budget returns the exact per-unit slot count (closure + <unit> + 10 audio embeddings) for whole-chunk payloads; duplex_first_append_context_reserve prefers the adapter-precomputed count; the orchestrator subtracts the absent closure slot on the first append. - worker: _stage_prefill_embeddings_only consumes every complete chunk per append (multi-unit spans) so serving-side multi-chunk payloads stay exact. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…mit whole-chunk multi-payloads The first unit consumes the official ~1035 ms window, so a k-chunk first payload yields k-1 worker units (k>=2); cap-at-one-chunk emission stranded the rest of the committed turn at serving because the commit path flushes once. Emit all whole chunks in one payload instead and model the first-window consumption in the orchestrator's first-append slot budget. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

With model-driven listen/speak restored, an explicit response.create can legitimately resolve to listen and produce no audio, which breaks the Realtime contract that response.create yields a response. Mark response-bound appends with force_speak and suppress only the listen token at the segment's decision step (official listen_prob_scale -> 0 semantics). Per-chunk full-duplex auto-response appends stay fully model-driven, and force_listen keeps precedence. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

A committed turn's tail was stranded: the first unit consumes the official ~1035 ms window, leaving up to a second of speech in the worker buffer with no following chunk to flush it, so the model decided on a partial question (and force_speak then produced an empty reply). On final appends the worker now builds exactly one extra unit - the zero-padded leftover if any, otherwise one full silence unit - matching the official post-turn silence beat at the decision step, and the scheduler budget reserves that unit. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

The official model answers across several units (the microphone keeps streaming silence while the assistant speaks); verified on the released checkpoint, a question is answered as 'I'm sorry, but / I can't answer / that question.' over three consecutive units. Turn-mode gave the model exactly one unit and stopped. While a response is still open after a segment finishes, serving now appends one silence unit at a time (capped) so the reply can complete, and force_speak suppresses the listen token at every step of response-bound segments, mirroring the official mid-turn listen -> tts_bos replacement. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…t the continuation cap Forcing speak on the response-bound segment fires at the question's final unit, where the official model still listens (it answers one silence unit later with real content); the forced decision produced near-empty utterances. Keep response-bound segments model-driven and rely on the silence-continuation units for the official decision cadence, forcing speak only on the last unit before the continuation cap. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…ce unit A model-driven listen on a response-bound segment closed the response immediately, so the continuation units never ran. While continuation budget remains, keep the response open and append the next silence unit as the model's decision point (official cadence: it often listens for a beat before answering). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

The official duplex format feeds the sampled terminator (listen/chunk_eos/ turn_eos) + </unit> into the KV at every unit boundary, and the model's listen/speak policy conditions on its own past decisions. The scheduler session update discards the segment's final sampled token, so the KV never contained them and the model kept producing empty speak segments. The model sampler now records the terminator per duplex session (via a runner-published row -> session map) and the next append re-injects it ahead of the first unit closure; the scheduler budget reserves one extra slot for appends after the first. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

preprocess ran the duplex embedding path for every scheduling step of the data-plane request, including 1-token decode steps where token_offset is past the prompt: the prompt-embedding slice came up empty and was pad-filled, so every sampled token was forwarded as a </unit> embedding instead of itself. The model saw </unit> right after its own <|speak|> and terminated with empty utterances. Verified by replaying the dumped KV spans through the official decoder, which generates real text from the same state. Decode steps now use the normal embedding lookup of the sampled token ids. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

30 silence units kept turn-mode responses open too long and stalled subsequent turn-taking in the scenario flow; 8 s covers the official reply cadence (listen 1-2 units, speak 2-4 units). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

The generic stage-0 final-output message accompanies every duplex segment with cumulative thinker text and no audio; gating the empty flush on unit_end_of_turn pushed it into the text-without-audio error branch, flooding auto-respond clients with one error per chunk and displacing real decisions. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Match the official per-chunk contract (exactly one result per audio chunk): duplex stage-0 segment boundaries no longer forward the raw thinker output as a final-output message. Listen decisions already flow via _emit_duplex_model_listen_output and spoken content via the talker stage; the extra message carried cumulative text with no audio and every downstream consumer had to filter it (mis-filtering caused either an error per chunk or silenced decisions). Reverts the converter-side text-without-audio suppression, no longer needed. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

The orchestrator stamps duplex_native_decision=listen/model_listen on model-listen segment outputs, but the converter only inspected completion token tails, which the wrapped listen output does not carry; listen decisions fell through to the text-without-audio error branch. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Every duplex segment ends finished=True by design; exiting the drain after each delivered batch made every subsequent decision wait for the next append to start a fresh drain task, re-adding one chunk of latency per decision. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…spond A stage-1 emission whose cumulative audio sliced to an empty delta can still carry delta text; in auto-respond mode that is normal streaming overlap, not a text-without-audio protocol error. Restores the guard removed in d73682f now that the listen-marker (f94ae50) and persistent drain (e30885d) fixes cover the cases it was blamed for. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

The official worker processes each chunk in one synchronous loop, so nothing can be lost between chunks. Our per-append cancel+restart of the data-plane drain task orphaned any decision arriving in the swap window (chunk 0 delivered, everything after raced). Keep one drain for the session's stable resumable request and skip the restart entirely. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…ROFILE_LOGS Log every boundary of the auto-respond event path to localize where post-chunk-0 decisions are lost: append control result contents ([append-result]), drain task lifecycle ([drain-start], [drain]), pump routing with per-queue depth ([pump]), and collect-side request state census ([collect]). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…trace Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…chet The model-listen wrapper carries the thinker's hidden states under the 'latent' mm key, which the audio encoder's key fallback treated as a waveform. Encoding it on the chunk-0 listen ratcheted the per-request cumulative audio offset to tokens*hidden_dim fake samples, so every later talker unit sliced to empty and was silently dropped by the auto-respond empty-audio guard: chunk-0 listen arrived, then the session went mute (audio only resurfaced once the real cumulative waveform outgrew the poisoned offset, ~16 units). Compute the native decision first and yield listen results before any audio work so listen batches never touch the offset. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

… batches A talker segment streams several cumulative-audio batches that all carry the same segment text; attaching it per batch re-delivered the text with every audio delta (official results carry per-unit deltas exactly once). Track delivered chars per request, attach only the unseen suffix, and reset at segment end so a genuinely repeated next segment still goes out. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

A segment whose finished batch slices to an empty audio delta never hit the in-branch reset, so the next segment's text was suffix-sliced against the previous segment and lost. Clear the per-request sent-segment text in the output iterator for every finished batch, and compare by content so a genuinely repeated next segment is still delivered. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Browser session trace showed continuation units re-running the talker with the SAME handed text after finished=True (every engine segment ends finished, so finished does not partition text). The iterator-level reset re-attached the text once per continuation segment, duplicating the transcript ('of your day so far? of your day so far?') and re-opening a ghost bubble after end of turn. Compare by content only and replace the stored text when it actually changes; a verbatim-identical consecutive reply keeps its audio but not a second transcript copy (rare, accepted). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

… handoff The resumable duplex prompt folds every earlier unit, so a <|tts_bos|> from an already-spoken reply can sit mid-prompt. The unbounded last-bos search re-sliced that stale region on text-less continuation units and re-handed already-spoken text to the talker, which re-synthesized it (official feeds a lone audio_bos for empty units and never re-feeds text). Restrict the search to the final prompt token (this unit's folded decision) and the current segment. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Match the official MiniCPMODuplex talker conditioning, which our per-segment one-shot synthesis violated in five ways that together garbled the reply audio (correct text, mush between islands): - One TTSStreamingGenerator per spoken turn with carried KV and text_start_pos, fed once per unit; reset only when <|turn_eos|> arrives. Previously every segment synthesized as an independent utterance from scratch. - text_eos only at turn end (text_finished was True for every segment, stamping utterance-final prosody on each ~1s snippet). - ~chunk_size codec tokens per unit instead of a 256+ token free-run, so audio length tracks the text again. - Per-turn token2wav stream: ref-audio caches cloned once per turn with a [4218]*3 silence-token seed (stops ref-voice bleed at onsets) and overlapping pre_lookahead windows advancing chunk_size, flushed with last_chunk only at turn end. Previously each segment re-primed the caches from the ref wav and fed disjoint windows. - Repetition penalty only for duplex TTS sampling (official constructs but never applies the top-p/top-k warpers). Handoffs now accumulate in the runner streaming buffer (streaming_accumulated_keys, with list support) and the turn state consumes them by cursor, so a unit arriving mid-synthesis is queued instead of overwritten. llm2tts conditions on mid-unit <|speak|> tokens and includes <|turn_eos|> token+hidden (the trained stop signal), stamping its id in handoff meta for turn-end detection. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

The first stream() call of each mode (mid-turn 25+pre_lookahead window and the last_chunk tail flush) costs ~20s of one-time compilation, which landed inside the first spoken turn of the first session (two ~20s stalls). Warm both modes against the default ref-audio caches when the profile/dummy run reaches the talker, mirroring the official demo's precompile step. Opt out with MINICPMO45_SKIP_T2W_WARMUP=1. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…ndow A unit that sampled EOS early buffered fewer than window-size codec tokens, yielded no audio, and its chunk produced no client-visible result; the bridge's per-chunk lockstep then waited out its 20s timeout (the two ~20s first-reply stalls). Official pins min=max tokens per mid-turn unit with EOS at -inf; mirror it with a toggleable suppressor re-enabled per unit and lifted at turn end. Also keep the per-unit generate/vocode timing trace behind MINICPMO45_PROFILE_LOGS. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…ity) The official demo forces the first force_listen_count units (default 3) to listen so the model never answers off one second of partial audio; we had no startup guard, so the model spoke at chunk 0/1 with confabulated content. Inject force_listen into the data-plane payload for the first N appends (configurable via extra_body.force_listen_count, 0 disables); the runner already applies it once per segment. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Variable-size last_chunk windows hit a fresh ~20s vocoder compile per new token count, and an empty tail emitted no event for its chunk at all, starving the bridge's per-chunk lockstep into its 20s timeout at reply ends. Pad every tail to chunk_size+pre_lookahead silence tokens (one shape, warmed at startup; trailing silence at reply end is inaudible) and always emit the final waveform batch. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Clients that treat the audio.delta + transcript.delta pair as the per-unit completion signal (the official-demo bridge does) waited 20s at every reply end: the turn-end flush and deduplicated continuation units carry audio but no text, so no transcript followed, the pending unit never flushed, and the per-chunk lockstep timed out. Emit the transcript delta with an empty delta whenever audio was emitted. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…andoff Replaying our exact per-unit conditions through the OFFICIAL talker and vocoder reproduced the garble, exonerating our generator: the conditions themselves were missing ALTERNATING reply segments (decoded condition for 'This is a Chinese TV show called "The Legend of Qin Shi Huang".' was only ' TV show called' + ' of Qin' + '".') — the talker vocalized text it never received. Root cause: the runner's streaming-buffer update is not merge-safe for a resumable stage-1 request (in-place updates merge sub-keys, resume prefills REPLACE the buffer), so runner-side accumulation silently dropped segments and the consumer cursor then ate the head of each replacement. Accumulate in llm2tts instead (per-request bridge state, cleared on epoch reset) and hand the complete ids+hidden history every handoff, making downstream replace semantics lossless; drop the runner-side streaming_accumulated_keys to avoid double-merge. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Two follow-ups to the accumulated-condition handoff: (1) a segment delta can start with several unit decisions (forced/model listens from chunks that produced no stage-1 handoff accumulate ahead of the speak), so the old output[0]!=listen check skipped the ENTIRE first speak segment of every reply (' This is a Chinese' / ' I think it's' never reached the talker); skip the leading listen run, then the speak decision. (2) the talker's consumed-cursor lived in the per-turn state and died at turn_eos, so the next reply re-read the whole accumulated history and re-synthesized the previous reply; keep the cursor per request, popped only when the request finishes. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

A segment delta's leading decision tokens can also be folded into the resumable prompt, so prompt_len + delta over-counts them and the front-aligned hidden indexing truncated each reply's first segment to its first token (' This [is a Chinese]' lost). The hidden tensor's last len(delta) rows are the delta's rows; index from the end. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Sy0307 · 2026-06-12T16:32:51Z

May I know when the target merge date for this PR is? Is it scheduled?

Not scheduled. @linyueqian and I are working on it.

linyueqian · 2026-06-12T16:45:08Z

For anyone who wants to drive this PR from the official browser demo: the adapter we use is examples/online_serving/minicpmo/official_demo_bridge_worker.py. It implements the worker surface the https://github.com/OpenBMB/MiniCPM-o-Demo gateway expects (GET /health + WS /ws/duplex) and translates the official per-chunk duplex protocol onto vLLM-Omni's /v1/realtime?duplex=1 endpoint in full-duplex (extra_body.auto_response) mode, so the stock prebuilt web frontend works unchanged.

Quick start:

# 1. vLLM duplex server (2 GPUs)
vllm-omni serve <MiniCPM-o-4_5 path> --omni \
  --stage-configs-path vllm_omni/model_executor/stage_configs/minicpmo45_2gpu_streaming.yaml \
  --trust-remote-code --port 8099

# 2. Bridge worker
python examples/online_serving/minicpmo/official_demo_bridge_worker.py --port 22500 \
  --vllm-ws "ws://localhost:8099/v1/realtime?duplex=1" --model <MiniCPM-o-4_5 path>

# 3. Official demo gateway (from OpenBMB/MiniCPM-o-Demo)
python gateway.py --http --workers localhost:22500

Then open the demo web UI and talk. The session runs in the official semantics: the model decides listen/speak per ~1s unit, results carry per-unit delta text and audio, and end of turn fires once per reply.

The prebuilt MiniCPM-o-Demo frontend schedules playback assuming every per-chunk result carries exactly one second of 24 kHz audio (listens are silence, non-final speak units are left-padded, only the turn-final result is an unpadded tail). Forwarding variable-size generation deltas with no audio on listens made the frontend drift and clip even though the delivered bytes were verbatim-clear. Buffer reply samples in the bridge and emit one paced unit per result. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

lishunyang12 · 2026-06-14T07:34:21Z

May I know when the target merge date for this PR is? Is it scheduled?

Not scheduled. @linyueqian and I are working on it.

This is expected to be ready before the next major release. I will have more time starting now and will review it thoroughly, as it introduces several structural changes designed to serve as an abstraction for easily integrating full-duplex type models. From a higher-level perspective, I think the abstraction should avoid being too model-specific, and we should also summarize the architectural trends of recent full-duplex models.

hsliuustc0106 mentioned this pull request May 30, 2026

[RFC] Full-Duplex Session Architecture for vLLM-OMNI #3745

Open

Nightwing-77 reviewed May 31, 2026

View reviewed changes

linyueqian force-pushed the sy03/minicpmo45-duplex-runtime branch from ef97974 to dfff075 Compare June 8, 2026 23:36

Sy0307 and others added 3 commits June 8, 2026 16:44

[Duplex] Add MiniCPM-o 4.5 realtime runtime

a5bcbd1

Signed-off-by: Sy03 <1370724210@qq.com>

linyueqian force-pushed the sy03/minicpmo45-duplex-runtime branch from dfff075 to 47a93f5 Compare June 8, 2026 23:56

style(minicpmo): satisfy pre-commit (ruff unused-import + missing Req…

42f5e9a

…uest/StreamingUpdate imports, typos) Signed-off-by: linyueqian <linyueqian@outlook.com>

linyueqian force-pushed the sy03/minicpmo45-duplex-runtime branch from 47a93f5 to 42f5e9a Compare June 9, 2026 00:07

linyueqian added 4 commits June 9, 2026 11:24

linyueqian added 15 commits June 11, 2026 05:10

test(duplex): update token budget expectations for 10 tok/s rate

c7c9642

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

fix(duplex): drain commit flush in whole-chunk appends

947fe29

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian added 18 commits June 11, 2026 15:31

chore(duplex): bridge event logs at info level

4fa0e5e

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

chore(duplex): log raw audio samples and offset ratchet in converter …

3b36217

…trace Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian added 5 commits June 12, 2026 02:39

lishunyang12 self-assigned this Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [Full Duplex] Feat: Support Full-Duplex realtime runtime & add MiniCPM-o 4.5 demo#3907

[WIP] [Full Duplex] Feat: Support Full-Duplex realtime runtime & add MiniCPM-o 4.5 demo#3907
Sy0307 wants to merge 54 commits into
vllm-project:mainfrom
Sy0307:sy03/minicpmo45-duplex-runtime

Sy0307 commented May 27, 2026

Uh oh!

Nightwing-77 May 31, 2026

Uh oh!

Nightwing-77 left a comment •

edited

Loading

Uh oh!

NumberWan commented Jun 10, 2026 •

edited

Loading

Uh oh!

Sy0307 commented Jun 12, 2026

Uh oh!

linyueqian commented Jun 12, 2026

Uh oh!

lishunyang12 commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Sy0307 commented May 27, 2026

Purpose

RFC Alignment

Technical Changes

Current Verified Behavior

Known Boundaries / Not Claimed Complete

Test Plan

Test Result

Uh oh!

Nightwing-77 May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Nightwing-77 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NumberWan commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sy0307 commented Jun 12, 2026

Uh oh!

linyueqian commented Jun 12, 2026

Uh oh!

lishunyang12 commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Nightwing-77 left a comment •

edited

Loading

NumberWan commented Jun 10, 2026 •

edited

Loading

lishunyang12 commented Jun 14, 2026 •

edited

Loading