Skip to content

[WIP] [Full Duplex] Feat: Support Full-Duplex realtime runtime & add MiniCPM-o 4.5 demo#3907

Draft
Sy0307 wants to merge 54 commits into
vllm-project:mainfrom
Sy0307:sy03/minicpmo45-duplex-runtime
Draft

[WIP] [Full Duplex] Feat: Support Full-Duplex realtime runtime & add MiniCPM-o 4.5 demo#3907
Sy0307 wants to merge 54 commits into
vllm-project:mainfrom
Sy0307:sy03/minicpmo45-duplex-runtime

Conversation

@Sy0307

@Sy0307 Sy0307 commented May 27, 2026

Copy link
Copy Markdown
Collaborator

NOTICE: This PR still WIP

Refs #3745. Stacked on #3642 (tc-mb:Support-MiniCPM-o-4.5).

This PR targets main so it can be opened against the upstream repository. Until #3642 is merged, the diff also includes the base MiniCPM-o 4.5 model-support changes from #3642. The full-duplex/realtime work is in the follow-up signed-off commit 32ab95b2ffdb982fefaa0bbffff5efafdf8a175e.

Purpose

This PR adds a MiniCPM-o 4.5 native full-duplex realtime runtime path for the architecture discussed in #3745.

The base model support comes from #3642. This PR extends that baseline from normal staged MiniCPM-o 4.5 serving into a session-oriented audio streaming path:

client websocket
  -> /v1/duplex or /v1/realtime?duplex=1
  -> duplex session actor / event adapter
  -> AsyncOmniEngine duplex data plane
  -> Stage0 MiniCPM-o listen/speak decode
  -> Stage0-to-Stage1 handoff payload
  -> Stage1 MiniCPM-o TTS / token2wav
  -> realtime audio delta / done events

The goal is not just a smoke test around the existing chat endpoint. The implementation wires a real audio-in -> Stage0 -> Stage1 -> audio-out loop and covers the core realtime control cases needed by the current demo: streaming input append, model listen/speak decisions, audio response streaming, cancel/barge-in, overlap handling, playback ack, and conversation item emission.

RFC Alignment

Implemented from the #3745 full-duplex direction:

  • Session-scoped duplex state: session id, response id, epoch, playback cursor, active response state, and close/cancel lifecycle are tracked explicitly instead of being hidden behind one-off request state.
  • Independent serving-side input/output flow: websocket input handling no longer has to treat every input append as a blocking single request-response turn. Cancel/barge-in can be observed while output is active.
  • Realtime event adapter: /v1/realtime?duplex=1 maps the main OpenAI Realtime-style events used by the demo, including session.update, input_audio_buffer.append, input_audio_buffer.commit, response.create, response.cancel, response.audio.delta, response.audio.done, response.done, and conversation item events.
  • Duplex data-plane integration: audio append and stage handoff use the engine/orchestrator/scheduler/worker path instead of relying on a fake chat-completion request as the only control surface.
  • Stage-native MiniCPM-o 4.5 runtime: Stage0 uses the model's audio streaming path and listen/speak policy; Stage1 consumes the handoff payload and emits TTS audio chunks.
  • Overlap and barge-in policy: input arriving while assistant audio is active is handled through an overlap policy path, not blindly treated as an unconditional cancel.
  • Playback-aware memory boundary: playback ack is represented in session state and is used to commit played assistant content instead of assuming every emitted byte has necessarily entered conversation memory.

Intentional differences or improvements versus the initial RFC sketch:

  • The implementation keeps the MiniCPM-o 4.5 policy model-specific instead of pretending all models have the same duplex token and TTS handoff semantics.
  • Control-plane events such as open, close, cancel, and signal remain explicit, while high-volume audio/stage payloads are moved toward the data-plane path.
  • Persistent core KV lease is kept out of this PR by design; resumable/session state is used where available, but this PR does not claim the full scheduler-owned KV lease lifecycle.
  • The Realtime endpoint is introduced as an adapter over the native duplex runtime, so the model-specific path can be validated before claiming full byte-perfect OpenAI Realtime compatibility.

Technical Changes

Serving and protocol:

  • Add duplex protocol objects for session config, runtime capability, playback cursor, overlap policy, and data-plane result handling.
  • Add /v1/duplex websocket serving for the native duplex protocol.
  • Add /v1/realtime?duplex=1 websocket serving for the Realtime-compatible adapter path.
  • Add Realtime audio format conversion, including pcm16 client input to MiniCPM-o native pcm_f32le input.
  • Add response lifecycle emission for created, audio delta, audio done, output item/content part lifecycle, done, cancel, and close.
  • Add playback ack handling and assistant-history commit behavior.

Engine / orchestrator / scheduler:

  • Add duplex data/control messages through AsyncOmniEngine and StagePool.
  • Route duplex append/signal/close results back to serving instead of swallowing runtime failures.
  • Add segment finish handling so Stage0 chunk boundaries can trigger Stage0 -> Stage1 forwarding.
  • Carry Stage0 -> Stage1 handoff payloads through the scheduler/orchestrator path rather than treating the whole runtime as a serving-only adapter.
  • Add model intermediate buffer helpers for duplex payloads so hidden states, token ids, and metadata are not passed as ad-hoc unrelated fields.

MiniCPM-o 4.5 runtime:

  • Add MiniCPM-o 4.5 duplex runtime/policy code for streaming audio append, listen/speak token handling, chunk eos handling, Stage0 result parsing, Stage1 TTS handoff, and token2wav output.
  • Reuse MiniCPM-o processor/audio/TTS components while isolating duplex session state from generic serving state.
  • Add support for multi-chunk and multi-turn session continuation in the native realtime path.
  • Add model-specific safeguards around unsupported modes and stage role/topology reporting.

Examples and configs:

  • Add MiniCPM-o 4.5 realtime duplex demo script.
  • Add streaming/stage-replica configs used by the MiniCPM-o 4.5 duplex path.
  • Update MiniCPM-o example documentation for the native duplex/realtime entrypoint.

Tests:

  • Add focused unit coverage for duplex protocol objects, serving handler behavior, runtime control result propagation, engine/orchestrator routing, worker native hooks, MiniCPM-o stage input processing, and Realtime event handling.

Current Verified Behavior

The latest controlled remote E2E covers the important demo path:

  • /v1/realtime?duplex=1 session creation.
  • pcm16 Realtime audio input conversion into MiniCPM-o native audio append.
  • streaming audio input commit.
  • Stage0 MiniCPM-o listen/speak decision.
  • Stage0 segment finish and Stage0 -> Stage1 handoff.
  • Stage1 TTS audio delta output.
  • response.audio.delta, response.audio.done, and response.done emission.
  • in-flight cancel / barge-in with stale epoch output filtered.
  • overlap listen case where short input does not incorrectly cancel the current response.
  • playback ack and committed assistant history accounting.

Known Boundaries / Not Claimed Complete

This PR is a substantial step toward the #3745 RFC, but it should not be reviewed as a final production-complete full-duplex core. Known remaining work:

  • Persistent core KV lease: not implemented in this PR. resumable/session state should not be interpreted as a full scheduler-owned KV lease with allocation, rollback, migration, and release semantics.
  • One long-lived request per stage: the implementation is closer to scheduler-managed resumable duplex data-plane requests, but it is not yet the final RFC stage actor lifecycle for every stage.
  • Byte-perfect OpenAI Realtime compatibility: the main demo event path is implemented, but the full Realtime schema surface is not complete.
  • Multi-session / multi-replica production policy: happy-path behavior has evidence, but admission control, replica binding, failure recovery, and fairness still need broader validation.
  • Playback-to-history precision: playback ack is represented and used, but exact token/audio alignment remains model- and mark-resolution dependent.
  • Long-duration natural conversation quality: the E2E proves the runtime path and key control semantics; it is not a claim that long-running turn-taking quality is fully tuned.

Test Plan

  • Ruff lint/format on the changed Python files.
  • git diff --check.
  • Targeted py_compile for the duplex/realtime serving and MiniCPM-o native runtime files.
  • Remote H20 MiniCPM-o 4.5 realtime duplex E2E on /v1/realtime?duplex=1.
  • Full CI / broader model matrix, to be covered by CI and follow-up validation.

Test Result

Local checks:

ruff check <changed-python-files>
ruff format --check <changed-python-files>
git diff --check
python3 -m py_compile <duplex-and-minicpmo-runtime-files>

E2E:

  • Remote server task: 54cf2229
  • E2E task: f33584ae
  • Result: returncode=0, ok=true

Key E2E signals:

overlap_listen=true
overlap_barge_in=true
short_ack_cancelled=false
model_listen_policy_observed=true
model_speak_event_ok=true
playback_commit_ok=true
playback_history_committed_count=1
stale_audio_delta_count=0
response.audio.delta=10
response.audio.done=2
response.output_audio.delta=0
response.output_audio.done=0
error=0

Server log audit for the E2E run:

ERROR=0
Traceback=0
RuntimeError=0
ValueError=0
DynamicCache=0
runtime_append_failed=0
Exception in ASGI=0

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)


cc @lishunyang12 @linyueqian @vklimkov-nvidia @Sy0307 @tc-mb @Gaohan123 @amy-why-3459 @TKONIY @yinpeiqi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we convert this model specific duplex runtime to be generic reusable!?

@Nightwing-77 Nightwing-77 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good overall, however I'm a bit concerned about adding duplex execution logic directly inside the model executor — it feels too model-specific. Can we explore a more generic approach instead? This could benefit many other models down the line, not just MiniCPM-o 4.5.
can we build a wrapper which takes the model executor and manages session state and other responsiblities

@linyueqian linyueqian force-pushed the sy03/minicpmo45-duplex-runtime branch from ef97974 to dfff075 Compare June 8, 2026 23:36
Sy0307 and others added 3 commits June 8, 2026 16:44
Signed-off-by: Sy03 <1370724210@qq.com>
… and restore per-segment streaming-input ingestion

Combines two fixes needed for MiniCPM-o 4.5 duplex on vllm-omni 0.22:

* Register the full thinker/talker architecture keys
  (MiniCPMO45OmniLLMForConditionalGeneration /
  MiniCPMO45OmniTTSForConditionalGeneration) and add a plain-chat
  (use_tts_template) tts_bos fallback so non-duplex chat-completions
  audio works: resolve <|tts_bos|> (151703) directly and bound the
  spoken region at <|im_end|> (151645).

* Restore per-segment streaming-input ingestion in
  _update_streaming_input_additional_info: read and accumulate the
  incoming per-segment model_intermediate_buffer (via
  streaming_accumulated_keys + torch.cat) instead of only resetting
  num_processed_tokens. A prior rebase had dropped this, starving
  duplex audio and producing garbled/doubled output.

Signed-off-by: linyueqian <linyueqian@outlook.com>
Add the continuous-duplex realtime web UI under
examples/online_serving/minicpmo/realtime_web/: a browser client that
streams mic audio to the duplex endpoint and plays back TTS, with
selectable turn-detection (model-driven default, server_vad for other
models), a voice picker, an interaction-mode toggle, a light/white
theme, and 16 kHz anti-aliased mic capture for clean audio.

Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the sy03/minicpmo45-duplex-runtime branch from dfff075 to 47a93f5 Compare June 8, 2026 23:56
…uest/StreamingUpdate imports, typos)

Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the sy03/minicpmo45-duplex-runtime branch from 47a93f5 to 42f5e9a Compare June 9, 2026 00:07
… epochs

The data-plane stage0 request id was epoch-scoped and barge_in() aborted all stage requests, so every turn/epoch advance started a fresh, context-less KV while the model helper still skipped re-prepending the system context, degenerating multi-turn output into token garbage. Make stage0 a single long-lived resumable request: its id is epoch-independent and barge_in() preserves the stage0 binding (only downstream stages are torn down), so conversation KV/context persists across turns/epochs as the topology already declares (stage0_long_lived_request).

Signed-off-by: linyueqian <linyueqian@outlook.com>
…x sampler

Mirror the official MiniCPM-o StreamDecoder.decode listen handling in the data-plane sampler: scale the listen-token logit and optionally force-keep listen only when it ranks within top-k. Defaults (1.0, None) preserve current behavior; tunable via MINICPMO45_LISTEN_PROB_SCALE / MINICPMO45_LISTEN_TOP_K for listen/speak balance.

Signed-off-by: linyueqian <linyueqian@outlook.com>
…ation)

Add an auto-response mode (session extra_body.auto_response/full_duplex) that runs per-chunk speak/listen generation continuously, matching the official duplex_generate loop, instead of waiting for an explicit response.create. Each ~chunk_period of streamed audio is emitted to the stage0 stream, and continuous chunks feed the ongoing stream rather than being routed through the discrete-response overlap/barge-in policy (explicit force_barge_in still interrupts).

Signed-off-by: linyueqian <linyueqian@outlook.com>
…time web demo

Full mode now requests server-side continuous auto-response (extra_body.auto_response) so the model speaks on its own. Turn mode no longer hangs on a model listen decision: response.listen / response.done reset the status.

Signed-off-by: linyueqian <linyueqian@outlook.com>
@NumberWan

NumberWan commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

May I know when the target merge date for this PR is? Is it scheduled?

…eaming

Two divergences from the official MiniCPMODuplexInference corrupted audio
full-duplex output (degenerate / garbage transcripts vs the coherent official
worker on identical input):

- _stage_prefill_embeddings_only re-emitted the assistant turn-open prefix
  (im_end + im_start assistant + tts_bos) on every audio chunk, re-opening the
  turn each chunk and producing repeated turn-initial greetings. The official
  feeds only <unit>+audio per chunk; the turn is opened once at session init and
  tts_bos/listen/turn_eos are model-generated. Drop the per-chunk prefix.

- _configure_streaming_processor used cnn_redundancy_ms=0, yielding 9 audio embed
  tokens/chunk vs the official's 10 (official duplex default is 20). This off-by-one
  misaligned the audio representation the model was trained on. Default to 20, and
  call processor.reset_streaming() at session init, mirroring the official
  init_streaming_processor (modeling_minicpmo_unified.py:207).

Verified against the official model on the same input: with these changes vllm's
chunk-0 audio embeds match the official to 3 decimals (std 0.4339 vs 0.4336).
A downstream LLM-forward/positions issue in the duplex data plane remains under
investigation and is not addressed here.
… context budget

Follow-ups to the per-chunk assistant-prefix fix, aligning the MiniCPM-o 4.5
scheduler data-plane path with the official MiniCPMODuplexInference format:

- _stage_prefill_embeddings_only: prepend </unit> for chunks >= 1 so every
  unit is closed before the next <unit> opens (official finalize_unit feeds
  terminator + </unit>; the scheduler session update discards the sampled
  terminator, so only the closure is appended).
- preprocess: always place padding in front of the chunk embeddings. The
  appended duplex tokens occupy the tail of the request prompt and the runner
  schedules [num_computed_tokens, prompt_len); with the old suffix-split
  layout the audio embeds of chunks >= 1 landed outside the scheduled span
  and were never forwarded, so generation ran on pad tokens only. Keeping
  embeds last also puts the decode position right after the final audio
  embedding (official listen/speak decision point). Warn when the worker
  produces more embeddings than reserved slots instead of silently truncating.
- orchestrator/duplex: reserve extra scheduler token slots on the first
  append for the session context (system prompt + optional ref-audio
  embeddings); previously a long reference audio could overflow the chunk-0
  budget and truncate the audio tail.
- _prepare_session_context: always emit the official
  <|im_start|>system\n{text}\n<|audio_start|>[ref]<|audio_end|><|im_end|>
  template, with or without reference audio.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
… stale duplex tests

- _stage_ref_audio_embeddings: the split stage0 wrapper ports official
  get_audio_embedding(chunk_length=...) as get_audio_hidden_states, so the
  ref-audio path always fell into the streaming fallback. That truncated the
  reference audio to a single streaming chunk (~1 s of a 6 s prompt) and
  advanced the streaming mel/encoder state at session open, corrupting the
  first real audio chunk. Use the whole-clip encoder when available.
- duplex_scheduler_token_budget / first-append reserve: MiniCPM-o pools audio
  to one token per 100 ms, not 20 ms; the old 50 tok/s math reserved ~5x too
  many slots and filled the KV with hundreds of </unit> pad embeddings (451 of
  482 chunk-0 positions measured on the dumped data plane). Use 1600
  samples/token and tight margins.
- _prepare_session_context: keep the audio markers conditional on ref audio,
  matching MiniCPMODuplex.prepare() in the released checkpoint.
- tests: update stale duplex expectations (stage0 epoch-independent request
  id, barge-in preserving stage0, optional <|audio|> token, per-state session
  config, ref-audio stub signature, new budget numbers).

Verified by feeding the dumped vLLM chunk-0 embeddings through the official
MiniCPMODuplex decoder: it reproduces the same degenerate logits the server
samples, while the official pipeline on the same input yields listen logits
of +12 vs garbage text at -10, isolating the bug to embed construction.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…upt the model

Replaying the dumped vLLM chunk-0 data-plane embeddings through the official
MiniCPMODuplex decoder isolates the final divergence: with the leading pad
slots stripped the official decoder produces listen=16.25 (official pipeline:
16.38) on identical embeddings, while including the pad run yields the same
degenerate logits the server samples. Any run of </unit> pad embeddings in
the KV breaks the model, so scheduler slot reservations must match the
worker-built embeddings exactly:

- serving (MiniCPMO45PcmAppendBuffer): emit only whole model chunks. The
  first emission is capped at one chunk (the worker's first unit consumes the
  official 1035 ms window); commit flushes zero-pad the tail to the chunk
  boundary (silence, in-distribution) instead of emitting partial chunks.
- serving adapter: trim reference audio to a whole number of pooled frames
  and precompute the exact session-context token count (shared template via
  MiniCPMO45DuplexPolicy.session_context_texts + samples/1600 pooling math)
  into duplex_first_append_context_tokens.
- engine: duplex_scheduler_token_budget returns the exact per-unit slot count
  (closure + <unit> + 10 audio embeddings) for whole-chunk payloads;
  duplex_first_append_context_reserve prefers the adapter-precomputed count;
  the orchestrator subtracts the absent closure slot on the first append.
- worker: _stage_prefill_embeddings_only consumes every complete chunk per
  append (multi-unit spans) so serving-side multi-chunk payloads stay exact.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…mit whole-chunk multi-payloads

The first unit consumes the official ~1035 ms window, so a k-chunk first
payload yields k-1 worker units (k>=2); cap-at-one-chunk emission stranded
the rest of the committed turn at serving because the commit path flushes
once. Emit all whole chunks in one payload instead and model the first-window
consumption in the orchestrator's first-append slot budget.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
With model-driven listen/speak restored, an explicit response.create can
legitimately resolve to listen and produce no audio, which breaks the
Realtime contract that response.create yields a response. Mark
response-bound appends with force_speak and suppress only the listen token
at the segment's decision step (official listen_prob_scale -> 0 semantics).
Per-chunk full-duplex auto-response appends stay fully model-driven, and
force_listen keeps precedence.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
A committed turn's tail was stranded: the first unit consumes the official
~1035 ms window, leaving up to a second of speech in the worker buffer with
no following chunk to flush it, so the model decided on a partial question
(and force_speak then produced an empty reply). On final appends the worker
now builds exactly one extra unit - the zero-padded leftover if any,
otherwise one full silence unit - matching the official post-turn silence
beat at the decision step, and the scheduler budget reserves that unit.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The official model answers across several units (the microphone keeps
streaming silence while the assistant speaks); verified on the released
checkpoint, a question is answered as 'I'm sorry, but / I can't answer /
that question.' over three consecutive units. Turn-mode gave the model
exactly one unit and stopped. While a response is still open after a
segment finishes, serving now appends one silence unit at a time (capped)
so the reply can complete, and force_speak suppresses the listen token at
every step of response-bound segments, mirroring the official mid-turn
listen -> tts_bos replacement.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…t the continuation cap

Forcing speak on the response-bound segment fires at the question's final
unit, where the official model still listens (it answers one silence unit
later with real content); the forced decision produced near-empty
utterances. Keep response-bound segments model-driven and rely on the
silence-continuation units for the official decision cadence, forcing
speak only on the last unit before the continuation cap.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…ce unit

A model-driven listen on a response-bound segment closed the response
immediately, so the continuation units never ran. While continuation
budget remains, keep the response open and append the next silence unit
as the model's decision point (official cadence: it often listens for a
beat before answering).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The official duplex format feeds the sampled terminator (listen/chunk_eos/
turn_eos) + </unit> into the KV at every unit boundary, and the model's
listen/speak policy conditions on its own past decisions. The scheduler
session update discards the segment's final sampled token, so the KV never
contained them and the model kept producing empty speak segments. The model
sampler now records the terminator per duplex session (via a runner-published
row -> session map) and the next append re-injects it ahead of the first unit
closure; the scheduler budget reserves one extra slot for appends after the
first.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
preprocess ran the duplex embedding path for every scheduling step of the
data-plane request, including 1-token decode steps where token_offset is
past the prompt: the prompt-embedding slice came up empty and was pad-filled,
so every sampled token was forwarded as a </unit> embedding instead of
itself. The model saw </unit> right after its own <|speak|> and terminated
with empty utterances. Verified by replaying the dumped KV spans through the
official decoder, which generates real text from the same state. Decode
steps now use the normal embedding lookup of the sampled token ids.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
30 silence units kept turn-mode responses open too long and stalled
subsequent turn-taking in the scenario flow; 8 s covers the official
reply cadence (listen 1-2 units, speak 2-4 units).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The generic stage-0 final-output message accompanies every duplex
segment with cumulative thinker text and no audio; gating the empty
flush on unit_end_of_turn pushed it into the text-without-audio error
branch, flooding auto-respond clients with one error per chunk and
displacing real decisions.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Match the official per-chunk contract (exactly one result per audio
chunk): duplex stage-0 segment boundaries no longer forward the raw
thinker output as a final-output message. Listen decisions already flow
via _emit_duplex_model_listen_output and spoken content via the talker
stage; the extra message carried cumulative text with no audio and every
downstream consumer had to filter it (mis-filtering caused either an
error per chunk or silenced decisions). Reverts the converter-side
text-without-audio suppression, no longer needed.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The orchestrator stamps duplex_native_decision=listen/model_listen on
model-listen segment outputs, but the converter only inspected
completion token tails, which the wrapped listen output does not carry;
listen decisions fell through to the text-without-audio error branch.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Every duplex segment ends finished=True by design; exiting the drain
after each delivered batch made every subsequent decision wait for the
next append to start a fresh drain task, re-adding one chunk of latency
per decision.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…spond

A stage-1 emission whose cumulative audio sliced to an empty delta can
still carry delta text; in auto-respond mode that is normal streaming
overlap, not a text-without-audio protocol error. Restores the guard
removed in d73682f now that the listen-marker (f94ae50) and persistent
drain (e30885d) fixes cover the cases it was blamed for.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The official worker processes each chunk in one synchronous loop, so
nothing can be lost between chunks. Our per-append cancel+restart of the
data-plane drain task orphaned any decision arriving in the swap window
(chunk 0 delivered, everything after raced). Keep one drain for the
session's stable resumable request and skip the restart entirely.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…ROFILE_LOGS

Log every boundary of the auto-respond event path to localize where
post-chunk-0 decisions are lost: append control result contents
([append-result]), drain task lifecycle ([drain-start], [drain]),
pump routing with per-queue depth ([pump]), and collect-side request
state census ([collect]).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…trace

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…chet

The model-listen wrapper carries the thinker's hidden states under the
'latent' mm key, which the audio encoder's key fallback treated as a
waveform. Encoding it on the chunk-0 listen ratcheted the per-request
cumulative audio offset to tokens*hidden_dim fake samples, so every
later talker unit sliced to empty and was silently dropped by the
auto-respond empty-audio guard: chunk-0 listen arrived, then the
session went mute (audio only resurfaced once the real cumulative
waveform outgrew the poisoned offset, ~16 units).

Compute the native decision first and yield listen results before any
audio work so listen batches never touch the offset.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
… batches

A talker segment streams several cumulative-audio batches that all carry
the same segment text; attaching it per batch re-delivered the text with
every audio delta (official results carry per-unit deltas exactly once).
Track delivered chars per request, attach only the unseen suffix, and
reset at segment end so a genuinely repeated next segment still goes out.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
A segment whose finished batch slices to an empty audio delta never hit
the in-branch reset, so the next segment's text was suffix-sliced against
the previous segment and lost. Clear the per-request sent-segment text in
the output iterator for every finished batch, and compare by content so a
genuinely repeated next segment is still delivered.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Browser session trace showed continuation units re-running the talker
with the SAME handed text after finished=True (every engine segment ends
finished, so finished does not partition text). The iterator-level reset
re-attached the text once per continuation segment, duplicating the
transcript ('of your day so far? of your day so far?') and re-opening a
ghost bubble after end of turn. Compare by content only and replace the
stored text when it actually changes; a verbatim-identical consecutive
reply keeps its audio but not a second transcript copy (rare, accepted).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
… handoff

The resumable duplex prompt folds every earlier unit, so a <|tts_bos|>
from an already-spoken reply can sit mid-prompt. The unbounded last-bos
search re-sliced that stale region on text-less continuation units and
re-handed already-spoken text to the talker, which re-synthesized it
(official feeds a lone audio_bos for empty units and never re-feeds
text). Restrict the search to the final prompt token (this unit's folded
decision) and the current segment.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Match the official MiniCPMODuplex talker conditioning, which our
per-segment one-shot synthesis violated in five ways that together
garbled the reply audio (correct text, mush between islands):

- One TTSStreamingGenerator per spoken turn with carried KV and
  text_start_pos, fed once per unit; reset only when <|turn_eos|>
  arrives. Previously every segment synthesized as an independent
  utterance from scratch.
- text_eos only at turn end (text_finished was True for every segment,
  stamping utterance-final prosody on each ~1s snippet).
- ~chunk_size codec tokens per unit instead of a 256+ token free-run,
  so audio length tracks the text again.
- Per-turn token2wav stream: ref-audio caches cloned once per turn with
  a [4218]*3 silence-token seed (stops ref-voice bleed at onsets) and
  overlapping pre_lookahead windows advancing chunk_size, flushed with
  last_chunk only at turn end. Previously each segment re-primed the
  caches from the ref wav and fed disjoint windows.
- Repetition penalty only for duplex TTS sampling (official constructs
  but never applies the top-p/top-k warpers).

Handoffs now accumulate in the runner streaming buffer
(streaming_accumulated_keys, with list support) and the turn state
consumes them by cursor, so a unit arriving mid-synthesis is queued
instead of overwritten. llm2tts conditions on mid-unit <|speak|> tokens
and includes <|turn_eos|> token+hidden (the trained stop signal),
stamping its id in handoff meta for turn-end detection.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The first stream() call of each mode (mid-turn 25+pre_lookahead window
and the last_chunk tail flush) costs ~20s of one-time compilation, which
landed inside the first spoken turn of the first session (two ~20s
stalls). Warm both modes against the default ref-audio caches when the
profile/dummy run reaches the talker, mirroring the official demo's
precompile step. Opt out with MINICPMO45_SKIP_T2W_WARMUP=1.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…ndow

A unit that sampled EOS early buffered fewer than window-size codec
tokens, yielded no audio, and its chunk produced no client-visible
result; the bridge's per-chunk lockstep then waited out its 20s timeout
(the two ~20s first-reply stalls). Official pins min=max tokens per
mid-turn unit with EOS at -inf; mirror it with a toggleable suppressor
re-enabled per unit and lifted at turn end. Also keep the per-unit
generate/vocode timing trace behind MINICPMO45_PROFILE_LOGS.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…ity)

The official demo forces the first force_listen_count units (default 3)
to listen so the model never answers off one second of partial audio;
we had no startup guard, so the model spoke at chunk 0/1 with
confabulated content. Inject force_listen into the data-plane payload
for the first N appends (configurable via extra_body.force_listen_count,
0 disables); the runner already applies it once per segment.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Variable-size last_chunk windows hit a fresh ~20s vocoder compile per
new token count, and an empty tail emitted no event for its chunk at
all, starving the bridge's per-chunk lockstep into its 20s timeout at
reply ends. Pad every tail to chunk_size+pre_lookahead silence tokens
(one shape, warmed at startup; trailing silence at reply end is
inaudible) and always emit the final waveform batch.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Clients that treat the audio.delta + transcript.delta pair as the
per-unit completion signal (the official-demo bridge does) waited 20s
at every reply end: the turn-end flush and deduplicated continuation
units carry audio but no text, so no transcript followed, the pending
unit never flushed, and the per-chunk lockstep timed out. Emit the
transcript delta with an empty delta whenever audio was emitted.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…andoff

Replaying our exact per-unit conditions through the OFFICIAL talker and
vocoder reproduced the garble, exonerating our generator: the conditions
themselves were missing ALTERNATING reply segments (decoded condition for
'This is a Chinese TV show called "The Legend of Qin Shi Huang".' was
only ' TV show called' + ' of Qin' + '".') — the talker vocalized text
it never received. Root cause: the runner's streaming-buffer update is
not merge-safe for a resumable stage-1 request (in-place updates merge
sub-keys, resume prefills REPLACE the buffer), so runner-side
accumulation silently dropped segments and the consumer cursor then ate
the head of each replacement. Accumulate in llm2tts instead (per-request
bridge state, cleared on epoch reset) and hand the complete ids+hidden
history every handoff, making downstream replace semantics lossless;
drop the runner-side streaming_accumulated_keys to avoid double-merge.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Two follow-ups to the accumulated-condition handoff: (1) a segment delta
can start with several unit decisions (forced/model listens from chunks
that produced no stage-1 handoff accumulate ahead of the speak), so the
old output[0]!=listen check skipped the ENTIRE first speak segment of
every reply (' This is a Chinese' / ' I think it's' never reached the
talker); skip the leading listen run, then the speak decision. (2) the
talker's consumed-cursor lived in the per-turn state and died at
turn_eos, so the next reply re-read the whole accumulated history and
re-synthesized the previous reply; keep the cursor per request, popped
only when the request finishes.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
A segment delta's leading decision tokens can also be folded into the
resumable prompt, so prompt_len + delta over-counts them and the
front-aligned hidden indexing truncated each reply's first segment to
its first token (' This [is a Chinese]' lost). The hidden tensor's last
len(delta) rows are the delta's rows; index from the end.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@Sy0307

Sy0307 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

May I know when the target merge date for this PR is? Is it scheduled?

Not scheduled. @linyueqian and I are working on it.

@linyueqian

Copy link
Copy Markdown
Collaborator

For anyone who wants to drive this PR from the official browser demo: the adapter we use is examples/online_serving/minicpmo/official_demo_bridge_worker.py. It implements the worker surface the https://github.com/OpenBMB/MiniCPM-o-Demo gateway expects (GET /health + WS /ws/duplex) and translates the official per-chunk duplex protocol onto vLLM-Omni's /v1/realtime?duplex=1 endpoint in full-duplex (extra_body.auto_response) mode, so the stock prebuilt web frontend works unchanged.

Quick start:

# 1. vLLM duplex server (2 GPUs)
vllm-omni serve <MiniCPM-o-4_5 path> --omni \
  --stage-configs-path vllm_omni/model_executor/stage_configs/minicpmo45_2gpu_streaming.yaml \
  --trust-remote-code --port 8099

# 2. Bridge worker
python examples/online_serving/minicpmo/official_demo_bridge_worker.py --port 22500 \
  --vllm-ws "ws://localhost:8099/v1/realtime?duplex=1" --model <MiniCPM-o-4_5 path>

# 3. Official demo gateway (from OpenBMB/MiniCPM-o-Demo)
python gateway.py --http --workers localhost:22500

Then open the demo web UI and talk. The session runs in the official semantics: the model decides listen/speak per ~1s unit, results carry per-unit delta text and audio, and end of turn fires once per reply.

The prebuilt MiniCPM-o-Demo frontend schedules playback assuming every
per-chunk result carries exactly one second of 24 kHz audio (listens are
silence, non-final speak units are left-padded, only the turn-final
result is an unpadded tail). Forwarding variable-size generation deltas
with no audio on listens made the frontend drift and clip even though
the delivered bytes were verbatim-clear. Buffer reply samples in the
bridge and emit one paced unit per result.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@lishunyang12

lishunyang12 commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

May I know when the target merge date for this PR is? Is it scheduled?

Not scheduled. @linyueqian and I are working on it.

This is expected to be ready before the next major release. I will have more time starting now and will review it thoroughly, as it introduces several structural changes designed to serve as an abstraction for easily integrating full-duplex type models. From a higher-level perspective, I think the abstraction should avoid being too model-specific, and we should also summarize the architectural trends of recent full-duplex models.

@lishunyang12 lishunyang12 self-assigned this Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants